Optimisations for PostgreSQL serving Rails application / Habr

As Senior Software Engineer at company building messaging platform for healthcare industry I am responsible, including other duties, for performance of our application. We develop pretty standard web-service using Ruby on Rails application for business logic and API, React + Redux for users' facing single page application, as database we use PostgreSQL. Common reasons for performance problems in similar stacks are heavy queries to database and I would like to tell the story how we applied non-standard but fairly simple optimisations to improve performance.

Our business operates in US, so we have to be HIPAA compliant and follow certain security policies, security audit is something that we are always prepared for. To reduce risks and costs we rely on a special cloud provider to run our applications and databases, very similar to what Heroku does. On one hand it allows us to focus on building our platform but on the other hand it adds an additional limitation to our infrastructure. Talking shortly — we cannot scale up infinitely. As a successful startup we double number of users every few month and one day our monitoring told us that we were exceeding disk IO quota on the database server. Underlying AWS started throttling which was resulting in a significant performance degradation. Ruby application was not capable to serve all incoming traffic because Unicorn workers were spending too much time awaiting for database's response, customers were unhappy.

Standard solutions

In the beginning of the article I mentioned the phrase "non standard optimisations" because all low-hanging fruits were already picked:

we removed all N + 1 queries. Ruby gem Bullet was the primary tool
all needed indices on database were added, all not-needed were removed, thanks to pg_stat_statements
some queries with multiple joins were rewritten for better efficiency
we separated queries to fetch paginated collections from decoration queries. For example, initially we added counter of messages per dialog by joining tables but it was replaced with an additional query to augment results. The next query does Index Only Scan and really cheap:

SELECT COUNT(1), dialog_id 
FROM messages
WHERE dialog_id IN (1, 2, 3, 4)
GROUP BY dialog_id;

added a few caches. Actually, this did not work well because as a messaging application we have many updates

All these tricks did a great job for a couple of month until we faced again the same performance problem — more users, higher load. We were looking for something else.

Advanced solutions

We did not want to use heavy artillery and implement denormalisation and partitioning because these solutions require deep knowledge in databases, shift team's focus from implementing features to maintenance and by the end we wanted to avoid complexity in our application. Lastly, we used PostgreSQL 9.3 where partitions are based on triggers with all their costs. KISS principle in action.

Custom solutions

Compress data

We decided to focus on the main symptom — disk IO. As less data we store, as less IO capacity we need, this was the primary idea. We started looking for opportunities to compress data and the first candidates were columns like user_type provided with polymorphic associations by ActiveRecord. In the application we use modules a lot which lead us to have long strings like Module::SubModule::ModelName for polymorphic associations. What we did — convert all these columns' types from varchar to ENUM. The Rails migration looks like this:

class AddUserEnumType < ActiveRecord::Migration[5.2]
  disable_ddl_transaction!

  def up
    ActiveRecord::Base.connection.execute <<~SQL
      CREATE TYPE user_type_enum AS ENUM (
        'Module::Submodule::UserModel1',
        'Module::Submodule::UserModel2',
        'Module::Submodule::UserModel3'
      );
    SQL

    add_column :messages, :sender_type_temp, :user_type_enum

    Message
      .in_batches(of: 10000)
      .update_all('sender_type_temp = sender_type::user_type_enum')

    safety_assured do
      rename_column :messages, :sender_type, :sender_type_old
      rename_column :messages, :sender_type_temp, :sender_type
    end
  end
end

A few notes about this migration for people who not familiar with Rails:

disable_ddl_transaction! disables transactional migration. This is very risky to do but we wanted to avoid long transaction. Please be sure that you don't disable transactions on migration without a need for it.
On the first step we create a new ENUM data type on PostgreSQL. The best feature on ENUM is a small size, a really small comparing to varchar. ENUM has some difficulties with adding new values but usually we don't add new user types often.
add a new column sender_type_temp with the user_type_enum
fill values to the new column in_batches to avoid a long lock on the table messages
last step swaps old column with new. This is the most dangerous step because if the column sender_type got turned to sender_type_old but sender_type_temp had failed to become sender_type we would get a lot of troubles.
safety_assured comes from the gem strong_migration which helps to avoid mistakes on writing migrations. Renaming column is not a safe operation, so we had to confirm that we understand what we were doing. Actually, there is a safer but longer way including multiple deployments.

Needless to say that we run all similar migrations during lowest activity periods with proper testing.

We converted all polymorphic columns into ENUM, dropped old columns after a few days of monitoring and finally run VACUUM to decrease fragmentation. This saved us roughly 10% of total disk space but
some tables with a few columns were compressed twice! What was more important — some tables started to be cached in memory (remember, we cannot easily add more RAM) by PostgreSQL and this dramatically decreased the required disk IO.

Don't trust your service provider

Another thing was found in the article How a single PostgreSQL config change improved slow query performance by 50x — our PostgreSQL provider makes an automatic configuration for server based on requested volume of RAM, Disk and CPU but by whatever reason they left the parameter random_page_cost with the default value which is 4 optimised for HDD. They charge us to run databases on SSD but did not configure PostgreSQL properly. After contacting them we got way better execution plans:

EXPLAIN ANALYSE SELECT COUNT(*) AS count_dialog_id, dialog_id as dialog_id
                FROM messages
                WHERE sender_type = 'Module::Submodule::UserModel1'
                  AND sender_id = 1234
                GROUP BY dialog_id;

db=# SET random_page_cost = 4;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=76405.45..76411.92 rows=647 width=12) (actual time=2428.412..2428.420 rows=12 loops=1)
   Group Key: dialog_id
   ->  Bitmap Heap Scan on messages  (cost=605.90..76287.72 rows=23545 width=4) (actual time=82.442..2376.033 rows=79466 loops=1)
         Recheck Cond: ((sender_id = 1234) AND (sender_type = 'Module::Submodule::UserModel1'::user_type_enum))
         Heap Blocks: exact=23672
         ->  Bitmap Index Scan on index_messages_on_sender_id_and_sender_type_and_message_type  (cost=0.00..600.01 rows=23545 width=0) (actual time=76.067..76.068 rows=79466 loops=1)
               Index Cond: ((sender_id = 1234) AND (sender_type = 'Module::Submodule::UserModel1'::user_type_enum))
 Planning time: 3.849 ms
 Execution time: 2428.691 ms
(9 rows)

db=# SET random_page_cost = 1;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=21359.54..21366.01 rows=647 width=12) (actual time=97.905..97.913 rows=12 loops=1)
   Group Key: dialog_id
   ->  Index Scan using index_messages_on_sender_id_and_sender_type_and_message_type on messages  (cost=0.56..21241.81 rows=23545 width=4) (actual time=0.058..60.444 rows=79466 loops=1)
         Index Cond: ((sender_id = 1234) AND (sender_type = 'Module::Submodule::UserModel1'::user_type_enum))
 Planning time: 0.277 ms
 Execution time: 98.070 ms
(6 rows)

Move data away

We moved a huge table to another database. We have to keep audits of every change in the system by law and this requirement is implemented with gem PaperTrail. This library creates a table in the production database where all changes of objects under monitoring are saved. We use library multiverse to integrate another database instance to our Rails app. By the way — it is going to be a standard feature of Rails 6. There are some configurations:

Describe connection in the file config/database.yml

external_default: &external_default
  url: "<%= ENV['AUDIT_DATABASE_URL'] %>"

external_development:
  <<: *external_default

Base class for ActiveRecord models from another database:

class ExternalRecord < ActiveRecord::Base

  self.abstract_class = true
  establish_connection :"external_#{Rails.env}"

end

Model which implements PaperTrail versions:

class ExternalVersion < ExternalRecord

  include PaperTrail::VersionConcern

end

Use case in the model under audit:

class Message < ActiveRecord::Base

  has_paper_trail class_name: "ExternalVersion"

end

Summary

We finally added more RAM to our PostgreSQL instance and currently we consume only 10% of available disk IO. We survived up until this point because we applied a few tricks — compressed data in our production database, corrected configuration and moved not relevant data away. Probably, these won't help in your particular case but I hope this article could give some ideas about custom and simple optimisation. Of course, don't forget to go through the check list of standard problems listed in the beginning.

P.S.: Highly recommend a great DBA add-on for psql.