This is the final part of the series describing how we’re increasing our service availability in Citymobil (you can read the previous part here). Now I’m going to talk about one more type of outages and the conclusions we made about them, how we modified the development process, what automation we introduced.
1. Bad release: bug
This is the most unpleasant kind of outages and incidents. The only kind that doesn’t have any visible symptoms besides complaints of end users or business users. That’s why such an incident, especially a small one, can remain unnoticed in production for a while.
All other kinds of outages are more or less similar to «Bad release: 500 internal server errors». The only thing is that they aren’t triggered by a release, but rather a workload, manual operation or an external service problem.
To describe the method of dealing with this kind of outages, we should recall an old joke:
A mathematician and a physicist are provided the same task: to boil water. They are given some auxiliary tools: a stove, a kettle, a faucet with water, matches. They both take turns in filling the kettle with water, turning on gas and starting to heat the kettle. Then the task is simplified: they are given a kettle, filled with water and a stove that is already on. The task is the same — boil water. Physicist puts the kettle on the stove. The mathematician empties the kettle, turns off gas and says: «The problem is reduced to one that’s already been solved.» © anekdotov.net
This kind of outage must be reduced to «Bad release: 500 internal server errors» at all cost. Ideally, bugs should be logged the same way as 500 errors. However, of course, you can’t log the event of a bug because if you could you wouldn’t make a bug in the first place.
One of the idea to track a bug is to search for traces in the database. These traces allow us to see that there’s a bug and send out alerts. How can we aid to this? We began investigating every big bug and coming up with solutions: what kind of monitoring/SMS-alerting can be created to make this bug reveal itself as a 500 error right away? Here are some examples.
1.1. Example of an outage caused by a bug
Suddenly we were receiving a massive amount of complaints from our users: orders paid via Apple Pay couldn’t be closed. We started our investigation; the problem was reproduced in test environment. The root cause was found: we updated the format of the expiration date for credit cards because it was changed by a payment processing service, but we didn’t do it correctly for payments via Apple Pay; therefore, all Apple Pay payments were declined. We fixed it as soon as we knew about the issue, deployed it and the problem was gone. However, this problem was live for 45 minutes.
In the wake of this issue, we monitored a number of unsuccessful Apple Pay payments and created SMS and IVR alerts with some nonzero threshold (since some unsuccessful payments are considered normal by the service; for instance, if a client has no money on her account or her credit card is blocked). Since that moment, we’d immediately find out about the threshold crossing. If a new release brings any problem into Apple Pay processing, we’ll find that out immediately due to monitoring and roll the release back within 3 minutes (the process of manual rollback is described in one of the previous articles). So it used to be 45 minutes of partial downtime, and now it’s 3 minutes. Profit!
1.2. Other examples
A bug in the order list. We deployed an optimization of orders list in the driver app. The code had a bug. As a result, sometimes the drivers saw the empty order list. We found out about this bug by chance: one of the engineers was playing with the driver app and came across this issue. We quickly identified the bad release and it was rolled back right away. Consequently we created a graph of an average number of orders in the list based on the info from the database. Then we looked at this graph retrospectively month-to-month. We noticed a recent ravine caused by that bug and created an SMS alert via an SQL query. It builds this graph when an average number of orders in the list crosses the lower threshold that was set based on the minimum for the current month.
A bug in cachback. We were altering the user’s cashback giveaway logic. Among other things, we sent it to the wrong group of clients. The problem was fixed, the graph of given away cashback was created; we saw a drastic increase in it and also noticed that this has never happened before, and created an SMS alert with an appropriate threshold.
Again a bug in payments. The new release caused the bug — it’d take forever to place an order, card payment didn’t work, the drivers requested the clients to pay in cash. We found out about the problem through call center complaints. We deployed a fix and created an alert for the closing time for orders with thresholds, discovered via historical graphs analysis.
As you can tell, we are using the same approach for dealing with all the incidents of this kind:
- We find out about a problem.
- We troubleshoot it and find a bug in code.
- We fix it.
- We figure out the traces in the database (also traces can be found in web-server logs or Kibana) that can point at the signs of the problem.
- We build a graph of these traces.
- We go back in time and look at the ups and downs in the graph.
- We select a good threshold for an alert.
- When the problem arises again, we immediately find out about it via an SMS alert.
What’s good about this method: one graph and one alert solve the whole big group of problems (examples of problem groups: orders can’t be closed, extra bonuses, Apple Pay payments don’t go through, etc.)
Eventually, we implemented alerts and monitoring for every big bug as a part of our engineering culture. In order not to lose this culture, we formalized it just a bit. We began to force ourselves to create a report for every outage. The report is a form filled out with answers for the following questions: root cause, how we fixed it, business impact, takeaways. All the fields are mandatory. So, we had to conclude whether we liked it or not. This process change was obviously written down into Do’s and Don’t’s.
Our process automation level was increasing, and we decided that it was time to create a web interface that’d show the current development process state. We called this web interface «Kotan» (from the Russian word «катить», «to roll out» :-)
«Kotan» has the following functionality:
List of incidents. It contains the list of all triggered in past alerts — whichever required an immediate human reaction. For every incident we register the time it started, the time it was over (if it’s over already), link to a report (if the incident is over and there’s a report) and the alert guide link to see what type of alert this incident belongs to.
The alerts directory. This is virtually a list of all the alerts. To make it clearer, the difference between an alert and an incident is the following: the alert is like a class, whereas the incident — is an object. For example, «number of 500 errors is greater than 1» is the alert. And «number of 500 errors is greater than 1 and they happened on this date, at this time, lasted this long» — is an incident. Every alert is added to the system manually through the process described above after some specific problem that has never been detected by the alert system before is solved. Such iterative approach guarantees a low risk of false positive alerts (that require no action). The directory contains a complete report history for every type of alert; that helps diagnose an issue quicker: you receive an alert, you go to «Kotan», click on the Directory, check out all the history and get an idea about where to dive. A key to successful troubleshooting is having all the information at hand. The link to alert source code (to know for sure what situation this alert signals you about). A written description of the best current methods of fighting this alert.
Reports. These are all the reports in history. Every report has links for all the incidents it’s associated with (sometimes the incidents come in groups; the root cause is the same, and we create one report for the whole group), the date this report was written, problem solution confirmation flag and most importantly: the root cause, how it was fixed, impact on business, takeaways.
List of takeaways. Every takeaway has a note stating whether it’s been implemented, implementation is still coming, or it’s not needed (with an explanation why not).
3. What’s changed in the software development process?
A critical component of availability improvement is a software development process. The process is constantly changing. The goal of such changes is decreasing a chance of incidents. The decisions to amend the process shouldn’t be made abstractedly, but rather be based on experience, facts and numbers. The process must not be built directorially downwards, but from the bottom upwards with all the team members actively participating, since many heads of the whole team are better than one head of a manager. The process must be followed and monitored; otherwise, there’s no sense in having it. The team members must correct each other in case of divergence: who else would do it for them? There must be maximum automation taking care of the main functions, since a human makes mistakes constantly, especially at creative work.
In order to be sure that each incident has takeaways, we have done the following:
- Every alert automatically blocks the releases.
- When we receive a closing alert (a text message stating that the incident is over), we don’t unblock the releases right away, but instead we’re offered to create a report on accident.
- A report must contain the following information: the root cause of an accident, how it was fixed, business impact, takeaways.
- The report is written by the team that troubleshot the accident. Those armed with the complete information on the accident.
- Releases are automatically blocked until such a report is created and approved. That motivates the team to quickly concentrate and create a report right after an accident is fixed.
- The report must be approved by someone who’s not on the team, to make sure that the report contains all the information that is needed to understand it.
By doing so, we, on the one hand, achieved self-discipline at saving each incident in history, and on the other — provided an automated control. It is now impossible not to draw conclusions or not to write a report.
4. In lieu of an epilogue
In lieu of an epilogue, let me quickly summarize what we changed in the software development process in order to decrease a number of lost trips.
|What did we change?
||Why did we change it?
|We started to keep the accidents log.
||To have takeaways and prevent future accidents.
|We create a post-mortem for every big outage (the one with many lost trips).
||To learn how to quickly troubleshoot and fix such outages in the future.
|We created the Do’s and Dont’s file.
||To share the nuggets of wisdom with all the engineers.
|Only one release per five minutes.
||To reduce the duration of troubleshooting.
|First, we deploy code on one low-priority web-server and only then — on all others.
||To reduce impact of a bad release.
|Automated bad release rollback.
||To reduce impact of a bad release.
|No deployments during an outage.
||To speedup troubleshooting.
|We write about releases and accidents in group chat.
||To speedup troubleshooting.
|We monitor graphs for 3 minutes after every release.
||To speedup troubleshooting.
|SMS and IVR alerts regarding issues.
||To speedup troubleshooting.
|Every bug (especially a big one) is followed by making monitoring and alerting.
||To speedup troubleshooting of the similar situation in a future.
|Code optimization review.
||To reduce a chance of accidents due to overloading of databases.
|Regular code optimization (with MySQL slow.log as an input).
||To reduce a number of accidents due to «Easter eggs».
|Every accident must have a takeaway.
||It reduces a chance of such an accident in the future.
|Every accident must have an alert.
||It reduces duration of troubleshooting and fixing for such accidents in the future.
|Automated blocking of new releases after an accident before a report is written and approved.
||It increases a chance of having proper takeaways, thus reducing a chance of such accidents in the future.
|«Kotan» — automated service improving tool.
||It reduces the duration of outages; reduces a chance of occurrence of outages.
||It reduces the duration of troubleshooting
Thanks for reading till the end. Good luck to your business! I wish you less of lost orders, transactions, purchases, trips and whatever is crucial for you!