TantorLabs Jul 18 at 03:43

Redundant statistics slow down your Postgres? Try sampling in pg_stat_statements

Medium

11 min

414

Тантор Лабс corporate blogPostgreSQL * SQL * Database Administration * System administration *

Tutorial

pg_stat_statements is the standard PostgreSQL extension used to track query statistics: number of executions, total and average execution time, number of returned rows, and other metrics. This information allows to analyze query behavior over time, identify problem areas, and make informed optimization decisions. However, in systems with high contention, pg_stat_statements itself can become a bottleneck and cause performance drops. In this article, we will analyze in which scenarios the extension becomes a source of problems, how sampling is structured, and in which cases its application can reduce overhead.

Issue

Let's briefly recall how pg_stat_statements is structured in order to understand when and why the extension may slow down performance. This will help understand which mechanisms within the extension can become a bottleneck under high load.
The key data structure in pg_stat_statements is a hash table. Each bucket in it contains execution metrics for a specific query. The key for this table is formed based on four parameters:

queryid — unique identifier of the normalized query;
User OID;
Database OID;
a toplevel flag indicating whether the query is top-level, meaning it's not nested within any internal function or subquery.

The hash table finds or creates the corresponding entry using this key and updates the collected query metrics.

Upon receiving a request, pg_stat_statements executes the following sequence of operations:

Search for the bucket: a shared LWLock is acquired on Проверить предлогthe hash table and it is searched through by key;
Normalization (optional): if there is no suitable bucket, the query is pre-normalized: literals are replaced with placeholders like $1, $2, etc.;
Creating a new bucket: the lock level of LWLock is raised to exclusive and a new bucket is created;
Recording query information in the bucket: to update the query metrics in the bucket, the SpinLock of that bucket is acquired. Then the locks on SpinLock and LWLock are released.

These locking operations with a large number of unique queries or high contention in pg_stat_statements become a bottleneck. Let's review this using a scenario where all SQL queries are unique from the perspective of pg_stat_statements. A machine with 48 CPUs can reproduce such a load. To ensure that the queries are unique, we will create 1000 similar tables with different names:

init_script.sql

DO $$

DECLARE

    i INT;

BEGIN

    FOR i IN 1..1000 LOOP

        EXECUTE format('CREATE TABLE table_%s (id INT PRIMARY KEY, value TEXT);', i);

        EXECUTE format('INSERT INTO table_%s (id, value) VALUES (1, ''test'');', i);

    END LOOP;

END;

$$;

Then, using the built-in random number generator in pgbench (pgbench_script.sql), we will create queries on these tables so that each of them is different and falls into a new bucket of the pg_stat_statements hash table.

pgbench_script.sql

\set table1_id random(1, 1000)

\set table2_id random(1, 1000)

\set table3_id random(1, 1000)

 

SELECT t1.value AS value1, t2.value AS value2, t3.value AS value3

FROM table_:table1_id t1

JOIN table_:table2_id t2 ON t1.id = t2.id

JOIN table_:table3_id t3 ON t2.id = t3.id

WHERE t1.id = 1 AND t2.id = 1 AND t3.id = 1;

To visually see what causes the performance drop, let's call pg_stat_activity once a second while the benchmark is running. The results of each query will be written in /tmp/waits file:

waits.sql

\o /tmp/waits

select 'OUT', COALESCE(wait_event, 'None') wait_event, COALESCE(wait_event_type, 'No wait') wait_event_type from  pg_stat_activity where state = 'active';

\watch 1

After the benchmark is completed, we group all types of delays and count how many times each of them occurred:

cat /tmp/waits | grep OUT | awk '{print $2 "|" $3}' FS="|" | sort | uniq -c | sort -n -r -k1

Then we run the benchmark, compare the system performance with the pg_stat_statements extension enabled and disabled, and display the reasons for performance drop. To do this, we will use standard pgbench utility:

Number of clients (-c): 48 - according to the number of CPUs;
Threads (-j): 20 - limits contention at the OS level to avoid overloading the CPU and context switches;
Duration (-T): 120 seconds;
Script: run pgbench_script.sql (-f pgbench_script.sql);
Metrics: total number of delays and final TPS.

pgbench -c48 -j20 -T120 -f pgbench_script.sql --progress 10 | grep "tps = " 2>>$RESULTS >>$RESULTS &

Now lets combine all these actions in one script:

RESULTS="/tmp/results"

 

rm -rfv $RESULTS

nohup pgbench -c48 -j20 -T120 -f pgbench_script.sql --progress 10 | grep "tps = " 2>>$RESULTS >>$RESULTS &

timeout 125 psql -f waits.sql

echo " count | wait_event | wait_event_type" >>$RESULTS

echo "--------------------------------------" >>$RESULTS

cat /tmp/waits | grep OUT | awk '{print $2 "|" $3}' FS="|" | sort | uniq -c | sort -n -r -k1 >>$RESULTS

cat $RESULTS

rm -rfv /tmp/waits

We will get the following results:

# With pg_stat_statements off

tps = 237 437.104223 (without initial connection time)

 count | wait_event | wait_event_type

--------------------------------------

   2922  None            No wait

    918  ClientRead    Client

 

# With pg_stat_statements on

tps =  32 112.129029 (without initial connection time)

 count |     wait_event     | wait_event_type

--------------------------------------

   4703  pg_stat_statements      LWLock

    884  None                    No wait

    143  ClientRead              Client

As we see here, with a large number of unique queries, enabled pg_stat_statements can significantly reduce performance — even leading to fold drop in TPS. And this all is caused by frequent 'exclusive' LWLock acquisition.Now let's review another scenario — with a large number of similar queries. Here we will need a more powerful machine with 192 CPUs. For the test, we will again use a script that periodically checks pg_stat_activity, but this time we will create the load using the same query executed through pgbench -M prepared -S with 192 clients:

pgbench -c192 -j20 -T120 -M prepared -S --progress 10 | grep "tps = " 2>>$RESULTS >>$RESULTS &

When we run this benchmark...

RESULTS="/tmp/results"

 

rm -rfv $RESULTS

pgbench -i -s500

nohup pgbench -c192 -j20 -T120 -M prepared -S --progress 10 | grep "tps = " 2>>$RESULTS >>$RESULTS &

timeout 125 psql -f waits.sql

echo " count | wait_event | wait_event_type" >>$RESULTS

echo "--------------------------------------" >>$RESULTS

cat /tmp/waits | grep OUT | awk '{print $2}' FS="|" | sort | uniq -c | sort -n -r -k1 >>$RESULTS

cat $RESULTS

rm -rfv /tmp/waits

..we get the following results:

# Results with pg_stat_statements off

tps = 1 015 425.438193 (without initial connection time)

 count | wait_event | wait_event_type

--------------------------------------

  13201  None             No wait

   3482  ClientRead       Client

 

# Results with pg_stat_statements on

tps =   484 338.163894 (without initial connection time)

 count | wait_event | wait_event_type

--------------------------------------

 11 214  SpinDelay        Timeout

   9481  None             No wait

    930  ClientRead       Client

If we try to reproduce this on a machine with 48 CPUs (i.e., with 48 users)...

pgbench -c48 -j20 -T120 -M prepared -S --progress 10 | grep "tps = " 2>>$RESULTS >>$RESULTS &

...the performance issues will stay well within the bounds of statistical error.

# Results with pg_stat_statements off

tps = 625 335.965464 (without initial connection time)

 count | wait_event | wait_event_type

--------------------------------------

    979  ClientRead         Client

    927  None               No wait

 

# Results with pg_stat_statements on

tps = 611 708.477697 (without initial connection time)

 count | wait_event | wait_event_type

--------------------------------------

   1000  ClientRead         Client

    978  None               No wait

This indicates that the impact of pg_stat_statements when working with duplicate queries becomes noticeable only at very high levels of parallelism. The main reason is the contention for the same entry in the hash table, which is accompanied by frequent SpinLock acquisitions when updating query metrics in the hash table bucket. When many threads simultaneously execute the same query, they try to update the same structure — increment call counters, execution time, and other metrics. This leads to severe contention for SpinLock, which under high load causes delays and reduces TPS.

What is sampling?

Query sampling is a method of uniform filtering, where only part of all queries is included in the sample. In the context of pg_stat_statements, this means that metric information is recorded not for every executed query, but only for some of them, with equal probability. A similar approach is used in PostgreSQL in other places: log_transaction_sample_rate and log_statement_sample_rate are used to reduce log volume, as well as in auto_explain.sample_rate and pg_store_plans.sample_rate. In Tantor Postgres 17.5, a corresponding setting was added to pg_stat_statements — the GUC parameter pg_stat_statements.sample_rate, which allows you to set fraction of queries that is tracked in the extension's statistics. The value of the parameter (from 0 to 1) determines what fraction of queries will be tracked in pg_stat_statements. A query will be sampled if the following condition is met:

Sampling

is_query_sampled = pgss_sample_rate != 0.0 &&

                    (pgss_sample_rate == 1.0 ||

                    pg_prng_double(&pg_global_prng_state) <= pgss_sample_rate);

Since the number of queries will be very large, using this inequation allows filtering only the specified fraction of the queries. The sampling method has one significant drawback: not all queries are tracked in pg_stat_statements. This impacts the completeness of the collected information, especially during debugging or analyzing rare but problematic queries. On the other hand, since only a part of the queries is tracked, the load on locks is reduced, and consequently, the overall system performance is improved.

Limitations of the sampling

If queries are sampled at the stage of adding a new bucket to the hash table, thereby unloading the LWLock, there is a risk of losing a valuable query hat would offer crucial information if tracked in pg_stat_statements. Moreover, even if sampling did take care of the performance problem, there is still the issue of security and correctness of query storage. The fact is that adding a bucket to the hash table can occur both before and after the query execution. But the structure required for normalizing the query is relayed only before the query execution, namely at the parsing stage, when we need to create a structure to sore literals for normalization. If at this stage pg_stat_statements decides not to save the query due to sampling, but then (after execution) still tries to add it to the hash table, the query will be recorded in its original (non-normalized) form. This may lead to the leakage of sensitive information in pg_stat_statements (for example, passwords or personal data in literals of a SQL query). Therefore, sampling during query parsing is unacceptable: it can violate security requirements.
Nevertheless, both the PostgreSQL community and the developers at Tantor Labs are trying to solve the problem of a large number of unique queries in a different way — by merging similar queries under one queryid. This reduces the number of unique entries in the hash table and, accordingly, decreases the frequency of its locking. The community has already merged the following queries into one QueryId:

In SET commands;
Identical queries with different lengths of IN(...);
Temporary tables.

In scenarios where the load on SpinLock becomes bottleneck — for example, when the same entry in a hash table is updated frequently — sampling can be quite effective. Since SpinLock protects only a single bucket, fewer calls to it (by skipping some queries) reduces contention between threads and thus improves overall performance.

Sampling results

Let's review the above scenario with a powerful machine of 192 CPUs. The same waits.sql scripts and enabled pg_stat_statements are used for the test. Now let's run a benchmark to assess the impact of the pg_stat_statements.sample_rate on performance and the nature of waits. We run a loop over five sample_rate values: 1, 0.75, 0.5, 0.25, and 0. For each value, we run a load testing using pgbench:

benchmark.sh

CONNECTIONS=192

RESULTS="/tmp/results"

pgbench -i -s500

 

rm -rfv $RESULTS

for i in 1 .75 .5 .25 0

do

  psql -c "alter system set pg_stat_statements.sample_rate = ${i};" 2>/dev/null >/dev/null

  psql -c "select pg_reload_conf();" 2>/dev/null >/dev/null

  psql -c "show pg_stat_statements.sample_rate;" 2>/dev/null >/dev/null

  echo -e "\nsample_rate = $i" >>$RESULTS

  nohup pgbench -c $CONNECTIONS -j20 -T120 -S -Mprepared --progress 10 | grep "tps = " 2>>/tmp/results >>$RESULTS &

  timeout 125 psql -f /tmp/waits.sql

  echo " count | wait_event | wait_event_type" >>$RESULTS

  echo "--------------------------------------" >>$RESULTS

  cat /tmp/waits | grep OUT | awk '{print $2 "|" $3}' FS="|" | sort | uniq -c | sort -n -r -k1 >>$RESULTS

  rm -rfv /tmp/waits

done

cat $RESULTS

After running the benchmark with various pg_stat_statements.sample_rate values, we got the results in the table below. It shows how performance and the nature of waits change when we change the fraction of queries hitting the hash table:

sample_rate	tps	SpinDelay	NoWait	ClientRead
1.00	484 338	11 107	9 568	929
0.75	909 547	4 781	12 079	2100
0.50	1 019 507	174	13 253	3378
0.25	1 019 507	-	13 397	3423
0.00	1 015 425	-	13 106	3502

At sample_rate = 1.0, when metrics are collected for all queries, TPS is the lowest and there is a huge number of waits on SpinLock. As the sample_rate decreases to 0.75 and below, TPS spikes, and SpinDelay decreases by 2.3 times. At sample_rate = 0.25 and below, SpinDelay disappears.
Thus, sampling effectively reduces the overhead of pg_stat_statements and significantly improves performance in scenarios with high contention for SpinLock.

Conclusion

In pg_stat_statements, LWLock is used when adding a new entry to the hash table, and with a large number of unique queries, it can become a bottleneck. The PostgreSQL community is trying to solve this problem by reducing the number of new entries (namely by merging similar queries under one queryid), and in Tantor Postgres 17.5, corresponding configuration parameters were added to mask arrays and temporary tables: pg_stat_statements.mask_const_arrays and pg_stat_statements.mask_temp_tables. This helps to cluster similar queries more precisely.

In turn, SpinLock is used to protect individual buckets and becomes a source of contention when counters for identical queries are frequently updated, especially on machines with a large number of CPUs. To solve this problem, pg_stat_statements.sample_rate parameter was added in Tantor Postgres SE 17.5: it allows reducing the load by sampling queries, thereby eliminating the problem described in the article.

The developers of Tantor Labs proposed to merge the query sampling mechanism in pg_stat_statements in the main branch of PostgreSQL. There's already an active discussion happening over at pgsql-hackers — if you've got any thoughts or feedback, feel free to jump in!

Hubs: