This is the next article of the series describing how we’re increasing our service availability in Citymobil (you can read the previous parts here: part 1, part 2, part 3). In further parts, I’ll talk about the accidents and outages in detail.
1. Bad release: database overload
Let me begin with a specific example of this type of outage. We deployed an optimization: added USE INDEX in an SQL query; during testing as well as in production, it sped up short queries, but the long ones — slowed down. The long queries slowdown was only noticed in production. As a result, a lot of long parallel queries caused the database to be down for an hour. We thoroughly studied the way USE INDEX worked; we described it in the Do’s and Dont’s file and warned the engineers against the incorrect usage. We also analyzed the query and realized that it retrieves mostly historical data and, therefore, can be run on a separate replica for historical requests. Even if this replica goes down due to an overload, the business will keep running.
We kept stumbling upon the same issues afterward, and at some point, we decided to address this matter. We’d studied the code through and moved all the queries that we could without compromising our service to the replicas. The replicas themselves were divided depending on their level of criticality, so that none of them could fail and stop the service. As a result, we came up with an architecture with the following databases:
- master database — for write operations and queries that are super sensitive to the data freshness;
- production replicas — for short queries that are less sensitive to data freshness;
- replicas for price surge coefficients. These replicas can be 30-60 seconds behind; that’s not crucial as the coefficients don’t change that frequently and if this replica goes down, the service won’t stop working; the prices just won’t fully match the supply and demand balance;
- replica for operational settings and the call center. If it crashes, the business will keep running but without customer and driver support, and we won’t be able to temporarily alter some settings;
- many replicas for ad hoc analytics and dashboards;
- MPP (massively parallel processing) database for some massive analytics with complete slices on the historical data.
This architecture provided us with vast space for growth and reduced a number of crashes due to nonoptimal SQL queries. But it’s still far from being perfect. We are planning to implement sharding, so that we can scale updates and deletes and super sensitive to data freshness queries. MySQL margin of safety is not infinite. We’ll need some heavy artillery in the form of some in-memory database (e.g. Tarantool) soon. I’ll definitely be talking about it in my next articles.
While we were dealing with nonoptimal code and queries, we understood the following: any nonoptimality should be eliminated before it has been released and not after. This decreases the risk of outages and efforts of engineering teams for optimization. If the code has already been deployed with some new releases on top of it, it’s much harder to optimize it. As a result, we introduced a mandatory code review for optimization. It’s carried out by our most experienced engineers, our elite force.
We also started collecting the best code optimization methods suitable for our realia in the Do’s and Dont’s. They are listed below. Please, do not take these practices as undeniable truth and do not try to replicate them blindly. Every method makes sense only for some specific situation and specific business. This is just an example to clarify the specifics:
- If an SQL query doesn’t depend on the user (for example, driver’s demand map with maximum fares and surge coefficients — this map is the same for any user of the driver app), then this query must be performed in a cron script with some specific frequency (once a minute is enough in that case). The result must be written in a cache (Memcached or Redis) that should be used in production code.
- If an SQL query operates with the data which delay isn’t crucial for the business, then its result must be placed in a cache with some TTL (30 seconds, for instance) and then be read from the cache by production code.
- If in a specific server method implementation (in PHP or another server-side language) you decide to make an SQL query, you need to make sure that data that you need hasn’t already arrived with some other SQL query or about to arrive in your code below.
- Same as above goes to requests to a cache. A cache is fast but still is a database. So it can also be overloaded. A common mistake is that you think that a cache is sort of a normal in-memory variable and use it as a variable. But access to it involves a network round trip and generates a workload to Redis or Memcached. Therefore, if the data has already arrived from the cache then don’t just take from the cache what has already been taken.
- If during a web request processing (again in PHP or any other language) you need to call a function, you need to make sure there’ll be no extra SQL queries or cache access made within. If such a function call is inevitable, you must make sure that it can’t be modified, or its logic hadn’t broken down in order to prevent unnecessary database/cache queries.
- If it’s necessary to perform an SQL query, you have to be absolutely sure that fields that you need cannot be added to already existing queries in your code above or below.
2. Bad manual operation
Examples of these accidents:
- bad ALTER that overloaded the database or caused replica delay;
- bad DROP (e.g., we came across a bug in MySQL that blocked the database whilst dropping a table);
- a heavy query on a master, made manually by mistake;
- a web-server was being configured even though it was under the real workload whereas we thought it was out-of-operation.
To minimize outages due to these reasons, we have to investigate an accident nature every time it occurs. We haven’t figured out a general rule yet. Again, let’s look at some specific examples. Surge coefficients (taxi fare at the time and place of high demand is multiplied by them) stopped working at some point. The reason was that there was a python script working on a database replica server where the data for calculation of coefficients was taken from and the script used up all the memory and replica went down. The script had been running for a while; it was operating right on the replica for the sake of convenience. The problem was solved by restart of the script. The following conclusions were drawn: do not run foreign scripts on a database server (it was written in Do’s and Don’t’s; otherwise, it’d be an empty shot!), monitor the memory usage on a database server and alert via SMS if that server is about to run out of memory.
It’s essential always to draw conclusions and not get comfortable in the «saw the problem, fixed it, forgot about it» kind of situation. Quality service can only be offered if one walks away with a conclusion. Besides that, the SMS alerts are critical — they increase the service quality level, they don’t let it go down and allow us to increase its reliability. Like a mountain climber who gets to a stable position and then pulls himself up to another stable position, but only higher this time.
Monitoring and alerts are not visible, but they act like iron hooks cutting into the rock of unknown preventing us from falling below our service level agreement that we are continually increasing.
3. Easter egg
What we call «an Easter egg» — is a delayed-action mine that we haven’t tripped yet, even though it’s been existing for a while. Outside of this article, this term is used for undocumented features created on purpose. In our case, it’s not a feature at all, but rather a bug that acts like a time bomb and appears as a side effect of some well-intentioned activity.
- overfill of 32-bit
- nonoptimality of code/configuration, triggered by high workload;
- delayed replica by a nonoptimal query brought on by a new pattern of usage or by heavier workload;
- delayed replica by a nonoptimal UPDATE operation on the master that was caused by a new workload pattern and delayed the replication.
Another popular kind of Easter egg is nonoptimal code; to be more specific — nonoptimal SQL query. The table used to be smaller and the workload was lighter — the query worked well. With the linear in time table growth and linear in time workload increase, resources consumption by a database management system was growing quadratically. Usually, that leads to a drastic negative effect: it’s like everything used to be ok and then suddenly — oops!
Rarer scenarios — combination of bugs and Easter eggs. A release with a bug led to enlargement of a database table and increased a number of table rows of a specific kind whilst an already existing Easter egg caused the database overload due to the slower queries to this expanded table.
We used to have some not workload related Easter eggs, though. For instance, a 32-bit
auto_incrementfield in MYSQL. After a little over 2 billion rows, inserts fail. Therefore, in the modern world, we must use 64-bit
auto_incrementfields only. We learned that lesson well.
How to deal with Easter eggs? The answer sounds straightforward: a) search for the old eggs, b) don’t allow the new ones to appear. We are trying to do both. A search for the old eggs goes hand in hand with our continuous code optimization. We appointed two of the most experienced engineers to perform optimization almost full-time. They find queries in slow.log that use databases resources the most; they optimize these queries and the code around them. We low a possibility of new eggs emergence via testing of every commit for optimality performed by the sensei engineers mentioned above. Their task is to point at the mistakes affecting the performance, to suggest the way to make things better and pass this knowledge onto other engineers.
At some point, right after finding another Easter egg, we realized that it was a good thing to look for slow queries, but we should’ve also looked for the queries that appear to be slow but work fast. These are the next contenders for crashing everything in case of another table explosive growth. The stupid but obvious example here is a query that full scans a table of 10 rows without using indexes at all. It will work fast for the time being. However, when the table is big enough, then the query will take the database down. That’s the very Easter egg.
4. External causes
These are the causes that we seem unable to control very well. Put another way, those are the causes that can only be mitigated but not eliminated. For example:
- Throttling our requests by a map service provider. It can be mitigated via the service usage control, adherence to a specific workload level, planning of workload increase beforehand and the service expansion purchase. However, we can’t do without maps.
- Network failure in a data center. It can be mitigated by placing the copy of the service in a backup data center. However, we can’t do without a data center either physical or cloud.
- Payment service down. It can be mitigated by payment services backup. However, we can’t do without payments.
- Errant blocking of traffic by a DDoS protection service. It can be mitigated by turning off the DDoS protection service by default with turning it on only in case of a DDoS attack. However, we can’t do without DDoS protection.
Since even mitigation of an external cause is a long and expensive endeavor, we began to collect statistics for accidents caused by the external reasons and wait for critical mass accumulation. We don’t have a recipe for defining of a critical mass. It’s all about mere intuition. For example, if we were completely down five times due to, let’s say, the DDoS protection service issues, then with every subsequent downtime a need for alternative would become more and more acute.
On the other hand, if we can somehow make everything work with an unavailable external service, we definitely do that. Post-mortem analysis of every outage helps us here. There’s always should be a conclusion. Which means that like it or not — we always come up with a workaround.
In the final part, I’m going to talk about one more type of outages and the conclusions we made about them, how we modified the development process, what automation we introduced. Stay tuned!