The Digital Firefighters: Inside story of tech teams during outages

Uddeshya Singh
Tech @ Trell
Published in
8 min readJul 30, 2021

--

“Wait, is Trell / Zomato / <insert-some-famous-brand-name> down?”

This one has honestly been one of the most frequent queries I have seen on Twitter and Reddit or actually, any social media platform, if they were not affected by the outage, that is. And it generally begs for the question as to why do they occur and what steps were taken to fix those?

Most recent Akamai outage

While there are very popular resources that cover outages like Google down, Facebook down, etc. (For example Hussein Naseer’s outage coverage videos), there is rarely any content that might cover the plight and mental state of the tech team and leadership when the services go down. So this one is going to serve that purpose, courtesy: My very generous and talented peers working @ Trell.

Before we dive in… 🤿

I would like to take this time to broadly categorize each technical outage as best as I can.

  1. The ones which are caused by an innocent, deadline dreaded code push. These are the ones that are honestly, easiest to figure out and track down. You generally just have to rollback the commit, re-test the commit in a pre-prod environment and push again, it should work and it usually costs like 10–30 seconds of downtime in a proactive engineering environment, and yeah, good integration + end to end testing can and should prevent this from happening.
  2. The ones which are solved by scaling up or optimizing. This category is also something that a tech team generally has eyes on, given due regard to the Product / Operations team’s activeness in letting the tech team know about a major jump in incoming traffic. Generally, the team would just have to scale their infrastructure a little bit for the time being and later on optimize your APIs and underlying business logic.
  3. The ones where it’s all third party and you can literally, only pray to god, or as we Indians say, “bhagwaan bharose”. This one generally occurs when you are using third-party managed services and you really can’t do a lot except wait for the managed service to come back up, or redirect the traffic partially for the time being. Either way, your bank account is going to take a hit and stakeholders are going to have a “talk” with you.

Now, the theory is boring. Want some practicals? Allow me to give you a real-life outage that Trell faced and no, it was not a CDN outage (though we were hit by that when Fastly was acting up but that’s a different soup altogether).
Allow me to walk you through the drama that unfolded and what I learned along the way.

The one where Text Index backfired 🤦

The first impact 💣

On one fine Friday evening, multiple services started showing increased latency, and Trell consumer bugs group started flooding up, with a single complaint (at first) that the feed doesn’t seem to be working or is extremely, painfully, woefully slow, along with a performance drop in internal analysis level services.

Now, we were aware that the main content feed is undergoing a revamp to serve the gradually increasing scale of users consuming the platform. The below caricature can explain the feed a little bit better. (No, it’s not the entire feed service, just a small detail to explain the context, and yes, it’s hand-drawn hence not so refined)

A small view of feed service

Context

Now you see, that detailing service is heavy and very very calculative, mostly because of some expensive queries being run on top of it and the entire revamp will eventually push this query job away from our primary database to a simple no-SQL database specifically used for fast reading of video metadata and the counting functions can be shifted to the write-ahead cache design pattern (but this is about it, the amount of information I am going to let out, a very specific blog post will be out on it hopefully very soon).

So, the first course of action? A callup on slack “war room” to discuss how to deal with this building-up situation.

Investigating the crime scene

First, we checked our databases and found out that okay, our readers are choking due to new cron jobs working on data crunching every hour and writer replica has to take the most redirected load, hence whatever service used writes, was choking, gradually.

Trial and Error 🔧

We began by optimizing the details query and redirecting the reads to a separate data source so that our readers could get some breathing space and eventually, RDS could redistribute the incoming load to reader replicas while the master slowly starts churning out its writes on the database.

Any wins? No.

The master was not able to reduce the number of active sessions on it, so the decision was taken to horizontally scale the reader replicas, and see if the sessions go down on master.

Any wins? Nope again.

Now, the active process were internally killed and the master machine was itself vertically scaled up to increase available vCPU and have a faster process time. And this, was where hell broke loose.

When writer Db was scaled

The second wave 🌊

The developers could see their feeds being improved, and the bugs group was kinda too silent to be true. The writer machine was still choked, and somehow feed was working… Something ain’t adding up.

Soon enough, a developer pointed out that they were not able to log in by a new account, and that was the cue to our disaster number two, or to put it in a better way, we were tending to the symptoms so far, now we finally had to deal with the root cause. Basically, the existing users were able to log in to their accounts but new users were not able to create an account.

The authentication architecture depends on putting a user token inside an activity table. The query in itself is cutely simple.

And we couldn’t really figure out why a query that should ideally take microseconds, is locking up the table for minutes. Sometimes the request could go through, other times? not really.

Calling in the support engineers

Using a managed service comes with its costs, and it was finally the time to call for support because everything on an architecture level seemed fine. So, the AWS support manager and engineers came in and took a look at what’s ailing their RDS distribution.

The sole input from a highly trained pool of support engineers, in a long 150 minutes stretched call was…

“Is there a FTS_DOC_ID_INDEX named in your database? Please remove it and see”

While this input was correct, we couldn’t really use it instrumentally. Why? Keep on reading, you’ll see.

The hunt for FTS index and temporary breathers 🍨

There did not exist any such index by the name of “FTS_DOC_ID” but we did manage to find something else.

Increasing innodb_ft_cache_size seemed to have given us a little breathing space (We tweaked it to increase the value by almost 10x), the queries were thankfully going through and by this time, it was 10 in the night. While most of the engineers dropped, tech leadership still worked on maintaining a contingency plan, of removing the old activity information table and creating one with new logged-in user's information, while the guest user’s information is being back populated.

In short, rows were reduced and it seemed to be working.

Until it wasn’t…

The next morning, the database was choking, again, and it was the same table that was the culprit. This time, a deeper investigation into “FTS_DOC_ID_INDEX” was launched, and finally, it was found out that FTS Index == Full-Text Search Index and not a particular index name in a database. (Yes, it must be pretty hard to not see something so obvious and relay it to customers promptly so that they can do damage control, totally understandable :) )

If you want to know more about Full-Text Search Indexes, I’d recommend this documentation right here: MySQL 5.6 FTS reference.

The AHA moment! 🦇

Sharp eyes of a senior DevOps engineer in our team found out the culprit, it was an index, on a varchar column named handleName. A simple varchar column, in a table with 200 Million entries took down a major component of our authentication system.

We removed that little trouble maker and optimized the services dependent on this particular table so that they don’t have an extremely slow response rate and add to our gradually recovering p95.

Trust building, the final step

Yes, the trust-building process is the final step the tech team has to undergo because while the event may be past them, it creates an impression that leads to the stereotype that if something is wrong in the organization operations, it’s ought to be that “tech must be down”, or

Orders are low? Pakka tech fata hoga. (The tech platform must be down)

Of course, a chat with stakeholders is always inbound in this scenario but that’s generally only for the tech leads and engineering managers to handle, not the juniors.

And in the end, a proper RCA has to be kept documented and a general guideline published on what should be kept in mind when designing further architecture and features.

RCA in our case

The Full-Text Search Indexes are not compatible with RDS Aurora which we were using and was essentially a bug in Aurora product and a patch for this bug will be there in the next Aurora release (yes, in case you are able to use FTS Indexes in your database on latest Aurora version, you know whom to thank)

Conclusion

And that was it, the conclusion of grueling hours to get the systems back up online and how it’s all fun and games while looking at a technical outage, thinking we finally got a small break to chill and enjoy, always remember there is a tech team preparing to sacrifice the weekend to get the engines back running and simultaneously performing disaster control and no, it’s not always an intern’s fault, sometimes all it takes to bring down a system for a couple of days is a clueless support team.

--

--

Uddeshya Singh
Tech @ Trell

Software Engineer @Gojek | GSoC’19 @fossasia | Loves distributed systems, football, anime and good coffee. Writes sometimes, reads all the time.