Last night all hell broke loose. Partners could not see their dashboards, our people could not share the dashboards, the dashboards don’t show up etc etc. They are in the US, I’m in Estonia. I was prepping to go to bed.
Something had killed our API that uses MongoDB cluster with Beanie ODM. It worked but it didn’t. It was alive but it was dead. It acted completely strange. I started looking for the error. For background and context, our stack in this case:
- Python 3.12, mostly
- Beanie ODM
- MongoDB 7
- AWS Elasticbeanstalk (means the app runs in Docker)
- AWS CodePipeline for build and deployment
There it was. Good, clean, full of meaning and hints:
line 26, in merge_models\n for k, right_value in right.__iter__():\n ^^^^^^^^^^^^^^\nAttributeError: 'NoneType' object has no attribute '__iter__'. Did you mean: '__str__'?", "taskName": "Task-7822"}
First I tried to reproduce it in the developent env. Copied all the fresh data from production MongoDB cluster etc etc. Everything works.
Then I tried to reproduce it in our staging env that runs on the same MongoDB cluster but on a different database. It has fresh data and everything. It all works …
I started thinking that when the code is 100% the same in all environments then it must be the data. Something must be wrong with the data in the database, right? It makes sense, it’s logical.
But how if I copied it from production and it all worked in dev and staging? HOW???
I spent a few good hours changing completely pointless things in code like changing type hints from “dict” to “Dict” because why not. It’s 1 a.m and I’m just changing anything I can in the code because perhaps it would fix it.
Around 2 a.m I started becoming desperate and poured me another glass of bourbon.
I was too lazy to set up remote debugging between my PyCharm and AWS env and TBH it’s a pain in the ass to do. I dug around in my code and in Beanie’s. Then, going line by line in Beanie I discovered something – the entity that I’m saving at one point is being read back and it comes back as None. Nada. Null. Nilch. Anti-matter. And then it came to me! During my day, about 8-10 hours earlier I had changed MongoDB connection string, of course “for the better” and “for performance reasons”. I added 2 parameters there:
- w=0
- journal=false
In case of MongoDB cluster w=0 means the writes won’t wait for confirmation from ANY replica set members. It means the code can move on fast and let the database deal with writing the data whenever it has time for it. What can go wrong here?
Yes – everything can go wrong here. The point is in those edge cases when the code needs to read the same freshly written entity from the same database very fast. It’s not there, yet. Why? Because the same code didn’t bother to wait for the entity to be properly written. I changed the connection string back, did all kinds of test and it all worked well. I had ressurrected our API.
It was 2:30 a.m. I went to bed and couldn’t get sleep because a new product was occupying my brain – a product that would log all my work-related decisions automatically and then search for it and analyze if something I did earlier today may be the reason of the problems I have now. Anyone interested in such a product? Oh, I guess the MVP is there already – it’s my brain that DOESN’T WORK!