Implementing Fault-Tolerance PostgreSQL Cluster with Patroni

I'm a DevOps Teamlead at Miro. Our service is a high-load one: it is used by 4 million users worldwide, daily active users - 168,000. Our servers are hosted by Amazon and located in a single Ireland region. We have more than 200 active servers, with almost 70 of them being database servers.

The service's backend is a monolith stateful Java application that maintains a persistent WebSocket connection for each client. When several users collaborate using the same whiteboard, they see changes on the whiteboard in real-time. That's because we write every change to a database, resulting in ~20,000 requests per second to the databases. During peak hours, the data is written to Redis at ~80,000–100,000 RPS.

I am going to speak about why it is important to us to maintain PostgreSQL high availability, what methods we've applied to solve the problem, and what results we've achieved so far.

Why We Switched From Redis to PostgreSQL

In the beginning, our service worked with the Redis key-value database that stores all data in the server's RAM.

Pros of using Redis:

• It responds quickly, as all data is stored in RAM.
• Redis instances are convenient to back up and replicate.

Cons of using Redis for us:
• Redis lacks true transactions. We tried to imitate them in the application; unfortunately, this approach didn't work very well and required writing very complex code.
• The amount of data you can put in Redis is limited by the server's memory amount. The RAM consumption will increase as data volume grows, resulting in hitting the limitations of a chosen instance. To change the instance type in AWS, it is required to stop the whole service.
• For us, it is necessary to always keep latency low due to a large number of requests. We consider a 0.17–0.20ms latency optimal. With 0.30–0.40ms latency, it will take a lot of time for backend application to respond, resulting in degraded service. Unfortunately, this happened to us in September 2018, when one of the Redis instances had a latency twice as high compared to a baseline. To mitigate this issue, we stopped the service in the middle of the day and replaced the problem Redis instance.
• It is easy to render the data inconsistent, even if the errors in the source code are minor, and then spend a lot of time writing code that will fix this inconsistency.

We considered these cons and understood that we had to switch to something that is handier, has true transactions, and is less influenced by latency. We conducted research, analyzed a variety of options, and then chose PostgreSQL.

As of now, we have been moving to this new database for 2 years and have migrated only a half portion of data so far - that's why we are working with Redis and PostgreSQL in parallel.

At the very beginning of this migration, our application interacted with the database in a direct way, querying both Redis and PostgreSQL masters. The PostgreSQL cluster comprised the master and the slave with async replication. The scheme of interaction was as follows:

Implementing PgBouncer

While we were migrating from one database to another, our product evolved: the number of users increased as did the number of servers that worked with PostgreSQL, and then the number of connections available became insufficient. PostgreSQL forks a separate process for each connection, consuming the computational resources. The number of connections can be increased only to a certain level; otherwise, there is a chance that the database operation will become nonoptimal. In this case, the ideal solution was to choose a connection manager and place it in front of the database.

We had two candidates for the connection manager: Pgpool and PgBouncer. The former does not support transactions when working with the database, so PgBouncer became our tool of choice.

We came up with the following workflow: our application interacts with the single PgBouncer, PostgreSQL master instances are placed behind the PgBouncer, and a single slave with async replication enabled is placed behind each master instance.

Then we began to shard the PostgreSQL database on an application level because we couldn't store all data in a single PostgreSQL database and we were concerned with the database speed. The scheme described above is quite convenient if you need to achieve the same goals: it is sufficient to update the PgBouncer configuration when adding a new PostgreSQL shard to make the application work with this shard immediately.

PgBouncer High Availability

This scheme worked until our single PgBouncer instance died. We are hosted in AWS, where all instances run on hardware that dies periodically. If a piece of hardware dies, the instance is moved to a new piece of hardware and then runs. PgBouncer was moved too, but it became unavailable. Our service was down for 25 minutes as a result of this crash. Amazon suggests creating redundancy of critical services on the user side to fight such situations - something we hadn't implemented at that point.

After this incident, we took PgBouncer and PostgreSQL cluster reliability seriously, as a similar situation might repeat with any AWS instance we used.

We implemented high availability for PgBouncer as follows: all application servers interact with Network Load Balancer, with two PgBouncer instances behind it. Each PgBouncer connects to the PostgreSQL master in each shard. Now, if an AWS instance with a PgBouncer is down again, all traffic will be redirected to the other PgBouncer. AWS takes care of the Network Load Balancer availability.

This workflow allows adding new PgBouncer servers with ease.

Building Fault-Tolerance PostgreSQL Cluster

When working on this task, we took into consideration a few different options: custom failover scripts, repmgr, AWS RDS, and Patroni.

Custom Scripts

Such scripts can monitor the PostgreSQL master, promoting a slave to master and reconfiguring PgBouncer if the master fails.

The advantage of this solution is its simplicity because you write your own scripts and know for sure how they work.

On the other hand, the disadvantages are:

• It is possible for the master to be not dead but rendered unavailable due to a network failure. Having no knowledge of the network failure, the custom script will promote a slave to the master role, while the current master will continue to work. As a result, we will get two master servers, not knowing which one of these holds the latest and actual data. A «split-brain» is another term for this situation.
• Now we are out of slaves. Our configuration comprises a single master and a single slave. When the script promotes the slave to the master as a result of failover, no more slaves are left, so we are forced to manually add a new slave.
• Failover requires extra monitoring configuration. We have 12 PostgreSQL shards, meaning that we have to monitor 12 clusters. It is also necessary to not forget to update the failover script when increasing the shard count.

A custom failover script is a fairly complex thing and requires nontrivial maintenance. For a single PostgreSQL cluster, scripting is the simplest option; however, it is not suitable for us due to its poor scalability.

Repmgr

Replication Manager for PostgreSQL can manage a PostgreSQL cluster. This tool does not have an out-of-the-box automatic failover feature, so it is required to write a wrapper for the tool to do the job. Saying that, things may get even more complex when compared to working with custom scripts, so we didn't even give Repmgr a try.

AWS RDS

This tool does everything we need, can create backups, and manages the connection pool. Also, it has an automatic failover feature: if the master dies, its slave will become the new master instead, and AWS will rewrite the DNS records so they point to the new master. Moreover, slaves can be located in different availability zones.

Speaking of disadvantages, AWS RDS is not fine-tunable enough for us. For example, we enforce some restrictions on TCP connections on our instances, but it is impossible to achieve the same with RDS:

net.ipv4.tcp_keepalive_time=10
net.ipv4.tcp_keepalive_intvl=1
net.ipv4.tcp_keepalive_probes=5
net.ipv4.tcp_retries2=3

Also, AWS RDS is almost double the price compared to the regular price of an instance: that is the main reason why we discarded this tool.

Patroni

This is a Python template to manage PostgreSQL that has good documentation and an automatic failover feature. Also, its source code is hosted on GitHub.

Pros of using Patroni:

• Each configuration option is described in detail; it is clear how it works.
• The automatic failover feature works out-of-the-box.
• The template is written in Python, and we use Python a lot, so it will be easy for us to fix issues, if any, and maybe even to contribute to the project.
• It completely manages PostgreSQL, allows changing the configuration on every node in the cluster simultaneously, and can be used to restart the cluster to apply the new configuration, if necessary.

Cons of using Patroni:

• It is unclear from documentation how to work with PgBouncer in the right way. But this hardly should be considered a Patroni disadvantage, as its goal is solely to manage PostgreSQL. It is our job to handle the connections to Patroni in a proper way.
• There are not many examples of using Patroni in huge environments, while many examples exist that describe deployment from scratch.

Ultimately, we chose Patroni as the tool that would help us to build a high availability PostgreSQL cluster.

Implementing Patroni

We had twelve PostgreSQL shards configured as «one master with one slave with async replication» before Patroni was integrated into our infrastructure. Application servers communicated with the databases via a Network Load Balancer. All PostgreSQL servers were placed behind two PgBouncer instances that were put behind the balancer.

Prior to deploying Patroni, we had to choose a distributed store to keep all cluster configurations together. Patroni supports many distributed configuration stores, like etcd, Zookeeper, and Consul. Luckily, we had an unused Consul cluster connected to Vault in production. So we decided that it was the perfect time to start using Consul for the things it was intended for.

How Patroni Interacts with Consul

So we had a three-node Consul cluster and a Patroni cluster that comprised a leader and a replica (Patroni refers to the «master node» as the «leader» and to the «slave node» as the «replica»). Each instance of the Patroni cluster continuously sends information about the state of the cluster to Consul. Therefore, it is possible to fetch the current leader and the actual configuration of the Patroni cluster from Consul.

To connect Patroni to Consul, it is sufficient to read the official documentation. According to that, it is necessary to specify the Consul endpoint to connect to in http(s)://host:port format; optionally, you can specify the scheme to use:

host: the host:port for the Consul endpoint, in format:
http(s)://host:port
scheme: (optional) http or https, defaults to http

It seemed simple to do, but a few hidden caveats were revealed by us. We worked with Consul via a secure connection (HTTPS) and our config file looked as follows:

consul:
host: https://server.production.consul:8080
verify: true
cacert: {{ consul_cacert }}
cert: {{ consul_cert }}
key: {{ consul_key }}

Apparently, it did not work. On start, Patroni tried to connect to Consul via HTTP with no luck, despite HTTPS being specified in the config file.

We managed to solve this issue by examining the Patroni source code (luckily for us, it is in Python). It turned out that the host parameter was not supposed to be parsed at all, and it was necessary to explicitly specify the protocol via the scheme parameter. So we came up with the following working config for Patroni to interact with Consul:

consul:
host: server.production.consul:8080
scheme: https
verify: true
cacert: {{ consul_cacert }}
cert: {{ consul_cert }}
key: {{ consul_key }}

Using Consul-Template

So, the configuration store was chosen. Then it was necessary to understand how to force PgBouncer to switch its configuration in the case of changing the leader in the Patroni cluster. The Patroni documentation provided us with no answers, as there were no descriptions on how to work with PgBouncer at all.

Searching for a solution, we found an article (unfortunately, I can't remember the article's name) that mentioned Consul-template as a very helpful instrument to make PgBouncer and Patroni play together nicely. This pushed us to do some digging on Consul-template.

Turned out that Consul-template continuously monitors the configuration of the PostgreSQL cluster that is stored in Consul. When the leader is changed in the Patroni cluster, Consul-template updates the PgBouncer configuration and then commands it to reload the configuration.

One of the huge advantages of the template is that it is stored as code. This means that if it is necessary to add a new shard, it is sufficient to make a new commit and automatically update the template, sticking with the «Infrastructure-as-Code» paradigm.

New Service Architecture with Patroni

As a result, we ended up with the following architecture for our service:

All application servers interact with the load balancer that sends the servers' requests to one of the PgBouncer instances. The Consul-template runs on each of these instances, monitoring the Patroni cluster and checking the PgBouncer configs to be in the actual state. Then a PgBouncer instance sends the requests to the current leader of every Patroni cluster.

Manual Testing

To check the failover feature, the aforementioned architecture was tested in a small staging environment prior to being deployed in production.

We opened a whiteboard, then moved a sticker and killed the leader of the Patroni cluster at the very same moment. To kill the leader, it is sufficient to shut down the instance the leader resides on via the AWS console.

The sticker was reverted to its original position within a 10–20 second interval, then we were able to move it in a normal way again. This meant that the Patroni cluster worked as expected: it changed the leader and notified Consul, then Consul-template immediately fetched this information, replaced the PgBouncer configuration, and sent the command to reload.

How to Survive High Load While Maintaining Minimal Downtime

It all worked great! But new questions were arising:

1. How would the new architecture handle a real workload?
2. we deploy the new service architecture to production in a quick and safe way?

To address the first question, we conducted performance testing in the staging environment. This environment was built to be completely identical to the production one and was supplied with test data that was nearly identical to the production data in terms of dataset size. We simply decided to kill one of the PostgreSQL master instances during the test to find out what would happen. But prior to doing that, it was important to check the procedure of automated deployment because we had a few PostgreSQL shards in the staging environment; additionally, this procedure would provide us with perfectly tested configuration scripts.

Both tasks are ambitious, but we had PostgreSQL 9.6. Why not update it straight to 11.2?

So we decided to split the overall task into two stages: upgrade the database to 11.2 in the first place, and then run the Patroni cluster. (This version of PostgreSQL was actual in March 2019, our current PostgreSQL version is 11.7)

To perform a quick upgrade of the PostgreSQL version, it is necessary to use the -k option. This option instructs PostgreSQL to create hard links on a storage drive, so it is not required to copy the data. With the -k option specified, the update process takes one second to complete for databases that are 300–400GB in size.

We had a lot of PostgreSQL shards, so the update process had to be automated. To achieve this, we wrote an Ansible playbook that did the upgrading for us:

/usr/lib/postgresql/11/bin/pg_upgrade \
--old-bindir=''  --new-bindir='' \
--old-options=' -c config_file=' \
--new-options=' -c config_file='

Note that it is important to execute this playbook with the - check option first to be sure that the update is possible. Also, our playbook temporarily replaces config files while upgrading the database. This playbook took 30 seconds to execute in our environment; it was a perfect result.

Running Patroni Cluster

To solve the second aforementioned problem, it is enough to take a look at the Patroni configuration file. The official Patroni repository contains an example config file that uses initdb to initialize a new database on the first start of Patroni. We already had the databases configured, so we just removed the init section from this config file.

When we began to deploy Patroni onto the existing PostgreSQL cluster, we faced a new issue: both Patroni servers ran as leaders. This happened because Patroni did not know anything about the state of the cluster it was deployed on and tried to treat these servers as members of separate clusters that shared the same name. To solve this issue, we deleted the directory containing the data on the slave (replica) instance:

rm -rf /var/lib/postgresql/

Warning: this operation should be performed only on the slave (replica) instance!

When connecting a clean slave to Patroni, Patroni does base backup on the leader, then rolls out this backup to the replica and finishes setting up the replica by using wal logs to completely synchronize the replica state with the leader.

Another problem we encountered was related to the cluster names: all PostgreSQL clusters are named main by default. This is OK if each cluster knows nothing about the others. But if you want to use Patroni, then all clusters should have unique names. The solution was to change the clusters' names in the PostgreSQL config files.

Conducting Performance Testing

We conducted a test that imitated real people working with whiteboards. When the workload achieved our average daily workload value, we repeated the previous test we had done in staging: we shut down an instance with the PostgreSQL leader.

The automatic failover feature worked as expected: Patroni changed the leader, then the Consul-template updated the PgBouncer config and commanded it to reload. The graphs in our Grafana demonstrated that there was a 20–30 seconds latency and a small number of errors related to the database connections.

This was a regular situation - these failover values are acceptable for our service and it is considerably better than having the service go down.

Promoting Patroni to Production

Finally, we came up with the following plan:

1. Deploy Consul-template to the PgBouncer servers, then run it.
2. Update PostgreSQL to 11.2.
3. Change PostgreSQL clusters' names.
4. Deploy Patroni cluster.

As a bonus, this plan allowed us to do the first step at practically any time. We could put one PgBouncer in maintenance mode to deploy and run Consul-template on it, while leaving another PgBouncer fully operational, and vice versa. So we did.

We used Ansible playbooks to do a quick deployment, as we already had tested these playbooks in the staging environment where the scenario ran from 1.5 to 2 minutes for each shard. We could deploy all required components to each shard without stopping our service. Unfortunately, this operation involved shutting down PostgreSQL for a few minutes, so it was considered unacceptable.

Our maintenance windows are scheduled to take place every three months. We completely shut down our service and update our database instances during these windows. So it was a great decision to deploy the new services during the next maintenance window, which was planned to take place in a week. Meanwhile, we decided to take some extra preparations just to make sure everything would go as planned:

• We deployed an extra slave for each PostgreSQL shard to have a copy of the latest data. That would help to keep the service operational if something went wrong.
• We added a new instance to each shard. That instance was intended to be used as a new replica in the Patroni cluster, eliminating the necessity for removing the data from the slave.

All these preparations helped us to minimize the chances of getting errors during Patroni deployment.

We restarted our service - all worked as expected, and people continued to use the service. However, the monitoring graphs demonstrated an anomalously high load on the Consul servers:

Why did we miss this issue while testing in the staging environment? This issue demonstrates quite well why it is necessary to follow the «Infrastructure-as-Code» paradigm, making sure the whole infrastructure, from the staging to production environments, complies with this paradigm. Otherwise, it is super easy to encounter the same issue in your own infrastructure. What happened?

Consul had been used in production for a long time, then it was introduced to the staging environment, resulting in the staging environment running a newer version of Consul than was used in production. The issue of high consumption of CPU resources by the Consul-template had been fixed in one of these newer Consul releases. So we just upgraded Consul in production to solve this issue.

Restart Patroni Cluster

However, we ran into a new problem we hadn't a clue about. When updating Consul, we simply removed a Consul node from the cluster by executing the consul leave command. Patroni connected to another Consul server, and our service maintained its operational status. But when we updated the last Consul instance in the cluster by issuing the same consul leave command, all Patroni clusters restarted with the following error in log files:

ERROR: get_cluster
Traceback (most recent call last):
...
ERROR: Error communicating with DCS
<b>LOG: database system is shut down</b>

The Patroni cluster was unable to fetch the info about itself and restarted.

To find a solution, we contacted the Patroni authors via GitHub issue. They suggested a few improvements to our config files:
consul:
consul.checks: []
bootstrap:
dcs:
retry_timeout: 8

We reproduced the issue in the staging environment and tested updated configs. Unfortunately, it didn't help.

As of now, this issue is still unresolved. We are going to try the following:

• Use Consul-agent on each instance of Patroni cluster.
• Eliminate the issue by fixing the source code.

We understand where the error is triggered: it is likely caused by the default timeout value that can't be overridden in the config file. The Consul cluster hangs for a time longer than one second when removing the last instance from the Consul cluster, rendering Patroni unable to get the cluster status. This all leads to the Patroni cluster reboot.

Summary

After the successful deployment of Patroni, we added an additional replica to each cluster. Now all clusters have something similar to a quorum: a single leader and a pair of replicas allow being secured if split-brain occurs when switching the instances' roles.

Patroni cluster has been working in production for over three months. And it has come in handy already. Not long ago, one of the leaders died in AWS, the automatic failover did its job, and customers continued their work. So, Patroni got its primary job done.

It is very convenient to change the cluster configuration. It is enough to change the configuration on a single instance for it to be replicated to all cluster members. Patroni will notify you if a reboot is required to apply a new configuration. Also, Patroni can restart the cluster with a single command, which is convenient.

The automatic failover feature is working and has already helped us.

Now it's possible to update PostgreSQL without service downtime. To do so, it is necessary to update the replicas to a newer version first, then change the leader in the Patroni cluster, and then update the former leader. In this scenario, automatic failover is tested as well.

Fortunately, no more errors have been encountered so far.