melanny20 May 19 at 11:00

How to catch and optimize problematic queries in PostgreSQL

Medium

8 min

883

Postgres Professional corporate blogPostgreSQL *

Tutorial

Translation

Original author: https://habr.com/ru/users/sokolcati/

If you work with PostgreSQL, you've likely run into performance issues at some point — especially as your database grows. Things may have been running smoothly at first, but as your client database expanded, queries started slowing down. Sound familiar? Here's a guide to help you identify and fix problematic queries, so you can get your PostgreSQL database running at peak performance again.

Identifying problematic queries

First, let's figure out which queries might need optimization and how to identify them.

Long queries

Slow performance is often blamed on long-running queries. However, ‘long’ is a relative term: 30 seconds to load a website header might be too much, but searching for books about elephants in a massive library might be reasonable. Let's figure out what we consider long queries and start hunting them down:

Add the parameter log_min_duration_statement=<time_in_ms> to the postgresql.conf file. It's crucial to strike the right balance: the lower the threshold, the more queries we log and the more resources we spend tracking them.
Update the configuration using SELECT pg_reload_conf(); or restart the server to apply the settings.
Test it with log_min_duration_statement=10000 (10 seconds) by running two simple queries:

SELECT pg_sleep(5);
SELECT pg_sleep(12);

The first query is short, less than 10 seconds. The second one triggers the timer and logs as following:

[2263] LOG: duration: 12007.490 ms 
rows: 1 
size: 6 bytes statement:
SELECT pg_sleep(12);

This is the default behavior. For more detailed query information, logging can be configured further. See the documentation for a list of parameters.

Suspicious queries

To track the execution statistics of all SQL statements, we can enable the pg_stat_statements module:

Add shared_preload_libraries ='pg_stat_statements' to postgresql.conf
Create the extension with CREATE EXTENSION pg_stat_statements;

This will provide us with various parameters, the most informative being:

query — the executed query text;
calls — the number of times the query was called;
total_exec_time — total execution time of all calls;
mean_exec_time — average execution time;
stddev_exec_time — standard deviation in execution time;
rows — total number of rows returned or affected.

Queries with a high standard deviation can be considered suspicious: they execute very fast sometimes and very slow other times. Possible causes include:

Execution time depends on the number of rows: queries with fewer rows execute quickly, while those with many rows take longer. The latter should be optimized.
Execution time does not depend on the number of rows but varies between runs. There may be an error in the query, or something might be blocking its execution.

Excessive queries

We can sort our query statistics by total_exec_time to identify queries consuming significant resources despite being fast individually:

SELECT *
FROM pg_stat_statements
ORDER BY total_exec_time DESC;

This helps identify quick queries that cumulatively create a significant load. We may not be able to speed them up, but we should review them. Check if:

The query is used unnecessarily as a safeguard. For example, a developer might ping the database with SELECT 1 before or after every table change.
The query computes the same expression on rarely changing data. We can cache the result using a view to avoid repeated calculations.

Sorting queries by CPU usage can also help illustrate their impact on overall performance:

SELECT  (100  total_exec_time / sum(total_exec_time) OVER ()) AS cpu_perc
FROM pg_stat_statements
ORDER BY cpu_perc DESC;

For the curious

To expand our selection of suspicious queries, we can use the pgpro_stats utility. It stores query plans, collects statistics on wait events, resource usage, cache invalidation, and adjusts the collection frequency. More details can be found in the documentation.

Real-time query monitoring

Now let's check what's running right now. Are queries blocked? Are they executing optimally?

What's running now?

The pg_stat_activity system view contains one row for each active backend:

SELECT pid, backend_type, state, query, wait_event_type, wait_event
FROM pg_stat_activity;

If nothing is running, we'll see the last executed query.

Is a query blocked?

To check blocked processes, we can use built-in IDE tools, third-party software, or write our own query using system views. We'll use pg_locks to see lock information and pg_stat_activity to monitor active processes:

SELECT * FROM pg_locks
LEFT JOIN pg_stat_activity
ON pg_locks.pid = pg_stat_activity.pid;

A more refined query can help display which query is blocking another.

Understanding the results table obtained from such a query can be quite challenging. However, you can write a more efficient query—and you don’t even have to do it yourself. By searching for keywords like ‘pg_locks monitoring’, you can find a query that clearly shows which query is blocking which process.

Is the query executing optimally?

The EXPLAIN command can help answer this question.

EXPLAIN displays the planner's tree: each node indicates the operation used at a particular stage, the cost of the operation, and the estimated number of rows expected at that node.

Important: EXPLAIN (without the ANALYZE flag) does not execute the query itself but only describes its execution based on the statistics available to the planner.

Let's consider the output of the EXPLAIN command using an example query that calculates the maximum profit among all sold goods:

EXPLAIN SELECT max(income.value - expense.value) 
FROM income FULL JOIN expense ON income.id = expense.id 
WHERE expense.value < income.value;

Query result:

Query plans can be collected not only manually but also automatically using the auto_explain module, built into the vanilla version of PostgreSQL. This module logs execution plans for slow queries:

auto_explain.log_min_duration (integer) — the number of milliseconds after which the plan is logged;
auto_explain.log_analyze (boolean) — whether to save EXPLAIN or EXPLAIN ANALYZE.

To better understand complex EXPLAIN results for intricate queries, you can use query plan visualizers, easily found by searching for ‘explain visualizer postgres’. The first result you encounter will likely be useful.

Let's redraw the query as a tree.

According to the planner, both tables consist of 500,000 rows, which will be scanned sequentially. The tables are joined using the Hash Join method.

To extract more information from the plan, we can use the ANALYZE flag. Unlike EXPLAIN, which describes query execution, EXPLAIN + ANALYZE actually runs the query. Keep in mind:

If you do not want to modify data but are only interested in execution details, wrap the query in a transaction and perform a ROLLBACK.
EXPLAIN runs almost instantly and does not load the database, whereas EXPLAIN ANALYZE may consume time and system resources to execute the SQL query.

In addition to the estimated cost, rows, and width, actual execution data is added:

times — how long the operator spent in this plan node;
rows — the actual number of rows processed by the operator;
loops — the number of processing cycles required.

The planner has several methods for scanning and joining tables. To understand how the planner chooses the method in each case, let's examine the cost parameter in our plan.

Understanding cost

Cost is described by two numbers:

The cost of initial preparation before fetching the first result row, expressed in arbitrary units. For sequential scans, this value is zero, as the first row is available immediately.
The total cost of the operation up to retrieving the last row.

Higher-level nodes wait for results from lower-level ones, so the total cost accumulates at the root of the tree. In our example, lower nodes already have high costs due to sequential scans of large tables.

Would indexing the tables speed up the query? In our case, the planner continues to use sequential scanning instead of index scanning. To understand why, let's examine join methods.

Join methods complexity

Nested Loop requires no preparation (the first cost value is 0) and has quadratic complexity.
Merge Join preparation cost depends on the number of rows in the node, with linear complexity.
Hash Join is similar to Merge Join but has a higher coefficient and requires more preparation time.

If the estimated row count is below N₁, Nested Loop is chosen; if above N₂, Hash Join is used; otherwise, Merge Join is selected.

The choice of operation is mathematically justified, but selecting the optimal option isn't always guaranteed for two reasons:

The estimated number of rows is based on statistics, which may be outdated or differ from the actual data.
The actual cost of an operation can deviate from the theoretical one. For instance, with a Hash Join, the cost function is linear, but if an inefficient hash function is used or there are many duplicate identifiers, the cost may reach N₂.

Let’s visualize the estimated complexity of operations with dashed lines and the actual complexity with solid lines. The query planner makes decisions based on the information available to it: if the estimated number of rows is less than N1, it selects the first method (shown in purple), and if it's greater, it picks the second method (shown in blue).

However, it’s possible that the complexity of the ‘blue’ method is actually higher, but if the estimated row count reaches N₁ or more, the planner will still choose it.

Forcing a different plan

Can we force the planner to choose another scanning method?

Each method known to the planner has a configuration parameter enable_<method>. By default, all parameters are enabled. Let's try disabling sequential scanning:

SET enable_seqscan = off;

This command does not fully disable the method but adds a cost of 10 billion units.

After disabling sequential scanning, the planner was forced to use index scanning. The join method also changed from Hash Join to Merge Join. Execution time decreased by 2.5 times!

However, use this trick sparingly:

The planner usually selects the best execution method.
Disabling an operation may speed up one query but slow down others.

The SET command affects the entire session, so reset parameters after the query or use SET LOCAL within a transaction.

Optimizing query execution

There are several extensions to optimize query execution:

sr_plan — allows saving a specific execution plan for reuse;
pg_hint_plan — enables execution plan control through hints in SQL comments;
AQO — uses machine learning to improve row count estimates.

We've learned how to analyze query plans, but what if a query has been running for a long time? Should we cancel it or wait for the result? There's no clear answer, and unfortunately, we forgot to save the plan. But there's a solution— the pg_query_state module allows you to view the result of the EXPLAIN ANALYZE command while the query is still running. We'll cover pg_query_state in detail in a separate article, which will coincide with the release of the module's new version. Stay tuned for updates on the Postgres Professional blog.

Assessing Query Progress

To check query progress, PostgreSQL provides dynamic views pg_stat_progress_* for commands such as ANALYZE, CREATE INDEX, VACUUM, CLUSTER, Base Backup, and COPY.

Running two queries SELECT * FROM pg_stat_progress_vacuum; a second apart, we can see changes in processed rows and determine the completion percentage.

The output of SELECT * FROM pg_stat_progress_vacuum; — The output of `SELECT * FROM pg_stat_progress_vacuum;`

We can see that the number of ‘dead’ rows has increased, and the relation ID where the operation was performed has changed. This means that the vacuum process has finished processing one table and moved on to the next. If we know the total number of ‘dead’ rows, we can estimate the percentage of the operation completed.

For user processes, a similar tool is the pg_query_state module, which we've mentioned before.

Conclusion

Let's summarize our step-by-step plan to track and optimize long queries:

Identify potential queries for optimization.
Eliminate redundant queries.
Check for locks.
Study execution plans.
Create or remove indexes, tune parameters, and add extensions.
Monitor query progress.
Decide whether to continue or cancel the query.

We hope our experience in troubleshooting inefficient queries helps you too. Share your methods in the comments!

Hubs: