pg_stat_statements is the standard PostgreSQL extension used to track query statistics: number of executions, total and average execution time, number of returned rows, and other metrics. This information allows to analyze query behavior over time, identify problem areas, and make informed optimization decisions. However, in systems with high contention, pg_stat_statements itself can become a bottleneck and cause performance drops. In this article, we will analyze in which scenarios the extension becomes a source of problems, how sampling is structured, and in which cases its application can reduce overhead.
Issue
Let's briefly recall how pg_stat_statements
is structured in order to understand when and why the extension may slow down performance. This will help understand which mechanisms within the extension can become a bottleneck under high load.
The key data structure in pg_stat_statements
is a hash table. Each bucket in it contains execution metrics for a specific query. The key for this table is formed based on four parameters:
queryid
— unique identifier of the normalized query;User
OID
;Database
OID
;a
toplevel
flag indicating whether the query is top-level, meaning it's not nested within any internal function or subquery.
The hash table finds or creates the corresponding entry using this key and updates the collected query metrics.

Upon receiving a request, pg_stat_statements
executes the following sequence of operations:
Search for the bucket: a shared LWLock is acquired on Проверить предлогthe hash table and it is searched through by key;
Normalization (optional): if there is no suitable bucket, the query is pre-normalized: literals are replaced with placeholders like $1, $2, etc.;
Creating a new bucket: the lock level of LWLock is raised to
exclusive
and a new bucket is created;Recording query information in the bucket: to update the query metrics in the bucket, the SpinLock of that bucket is acquired. Then the locks on SpinLock and LWLock are released.
These locking operations with a large number of unique queries or high contention in pg_stat_statements
become a bottleneck. Let's review this using a scenario where all SQL queries are unique from the perspective of pg_stat_statements
. A machine with 48 CPUs can reproduce such a load. To ensure that the queries are unique, we will create 1000 similar tables with different names:
init_script.sql
DO $$
DECLARE
i INT;
BEGIN
FOR i IN 1..1000 LOOP
EXECUTE format('CREATE TABLE table_%s (id INT PRIMARY KEY, value TEXT);', i);
EXECUTE format('INSERT INTO table_%s (id, value) VALUES (1, ''test'');', i);
END LOOP;
END;
$$;
Then, using the built-in random number generator in pgbench (pgbench_script.sql), we will create queries on these tables so that each of them is different and falls into a new bucket of the pg_stat_statements
hash table.
pgbench_script.sql
\set table1_id random(1, 1000)
\set table2_id random(1, 1000)
\set table3_id random(1, 1000)
SELECT t1.value AS value1, t2.value AS value2, t3.value AS value3
FROM table_:table1_id t1
JOIN table_:table2_id t2 ON t1.id = t2.id
JOIN table_:table3_id t3 ON t2.id = t3.id
WHERE t1.id = 1 AND t2.id = 1 AND t3.id = 1;
To visually see what causes the performance drop, let's call pg_stat_activity once a second while the benchmark is running. The results of each query will be written in /tmp/waits file:
waits.sql
\o /tmp/waits
select 'OUT', COALESCE(wait_event, 'None') wait_event, COALESCE(wait_event_type, 'No wait') wait_event_type from pg_stat_activity where state = 'active';
\watch 1
After the benchmark is completed, we group all types of delays and count how many times each of them occurred:
cat /tmp/waits | grep OUT | awk '{print $2 "|" $3}' FS="|" | sort | uniq -c | sort -n -r -k1
Then we run the benchmark, compare the system performance with the pg_stat_statements extension enabled and disabled, and display the reasons for performance drop. To do this, we will use standard pgbench utility:
Number of clients (-c): 48 - according to the number of CPUs;
Threads (-j): 20 - limits contention at the OS level to avoid overloading the CPU and context switches;
Duration (-T): 120 seconds;
Script: run pgbench_script.sql (-f pgbench_script.sql);
Metrics: total number of delays and final TPS.
pgbench -c48 -j20 -T120 -f pgbench_script.sql --progress 10 | grep "tps = " 2>>$RESULTS >>$RESULTS &
Now lets combine all these actions in one script:
RESULTS="/tmp/results"
rm -rfv $RESULTS
nohup pgbench -c48 -j20 -T120 -f pgbench_script.sql --progress 10 | grep "tps = " 2>>$RESULTS >>$RESULTS &
timeout 125 psql -f waits.sql
echo " count | wait_event | wait_event_type" >>$RESULTS
echo "--------------------------------------" >>$RESULTS
cat /tmp/waits | grep OUT | awk '{print $2 "|" $3}' FS="|" | sort | uniq -c | sort -n -r -k1 >>$RESULTS
cat $RESULTS
rm -rfv /tmp/waits
We will get the following results:
# With pg_stat_statements off
tps = 237 437.104223 (without initial connection time)
count | wait_event | wait_event_type
--------------------------------------
2922 None No wait
918 ClientRead Client
# With pg_stat_statements on
tps = 32 112.129029 (without initial connection time)
count | wait_event | wait_event_type
--------------------------------------
4703 pg_stat_statements LWLock
884 None No wait
143 ClientRead Client
As we see here, with a large number of unique queries, enabled pg_stat_statements
can significantly reduce performance — even leading to fold drop in TPS. And this all is caused by frequent 'exclusive' LWLock acquisition.Now let's review another scenario — with a large number of similar queries. Here we will need a more powerful machine with 192 CPUs. For the test, we will again use a script that periodically checks pg_stat_activity, but this time we will create the load using the same query executed through pgbench -M prepared -S with 192 clients:
pgbench -c192 -j20 -T120 -M prepared -S --progress 10 | grep "tps = " 2>>$RESULTS >>$RESULTS &
When we run this benchmark...
RESULTS="/tmp/results"
rm -rfv $RESULTS
pgbench -i -s500
nohup pgbench -c192 -j20 -T120 -M prepared -S --progress 10 | grep "tps = " 2>>$RESULTS >>$RESULTS &
timeout 125 psql -f waits.sql
echo " count | wait_event | wait_event_type" >>$RESULTS
echo "--------------------------------------" >>$RESULTS
cat /tmp/waits | grep OUT | awk '{print $2}' FS="|" | sort | uniq -c | sort -n -r -k1 >>$RESULTS
cat $RESULTS
rm -rfv /tmp/waits
..we get the following results:
# Results with pg_stat_statements off
tps = 1 015 425.438193 (without initial connection time)
count | wait_event | wait_event_type
--------------------------------------
13201 None No wait
3482 ClientRead Client
# Results with pg_stat_statements on
tps = 484 338.163894 (without initial connection time)
count | wait_event | wait_event_type
--------------------------------------
11 214 SpinDelay Timeout
9481 None No wait
930 ClientRead Client
If we try to reproduce this on a machine with 48 CPUs (i.e., with 48 users)...
pgbench -c48 -j20 -T120 -M prepared -S --progress 10 | grep "tps = " 2>>$RESULTS >>$RESULTS &
...the performance issues will stay well within the bounds of statistical error.
# Results with pg_stat_statements off
tps = 625 335.965464 (without initial connection time)
count | wait_event | wait_event_type
--------------------------------------
979 ClientRead Client
927 None No wait
# Results with pg_stat_statements on
tps = 611 708.477697 (without initial connection time)
count | wait_event | wait_event_type
--------------------------------------
1000 ClientRead Client
978 None No wait
This indicates that the impact of pg_stat_statements
when working with duplicate queries becomes noticeable only at very high levels of parallelism. The main reason is the contention for the same entry in the hash table, which is accompanied by frequent SpinLock acquisitions when updating query metrics in the hash table bucket. When many threads simultaneously execute the same query, they try to update the same structure — increment call counters, execution time, and other metrics. This leads to severe contention for SpinLock, which under high load causes delays and reduces TPS.
What is sampling?
Query sampling is a method of uniform filtering, where only part of all queries is included in the sample. In the context of pg_stat_statements
, this means that metric information is recorded not for every executed query, but only for some of them, with equal probability. A similar approach is used in PostgreSQL in other places: log_transaction_sample_rate and log_statement_sample_rate are used to reduce log volume, as well as in auto_explain.sample_rate
and pg_store_plans.sample_rate
. In Tantor Postgres 17.5, a corresponding setting was added to pg_stat_statements — the GUC parameter pg_stat_statements.sample_rate
, which allows you to set fraction of queries that is tracked in the extension's statistics. The value of the parameter (from 0 to 1) determines what fraction of queries will be tracked in pg_stat_statements. A query will be sampled if the following condition is met:
Sampling
is_query_sampled = pgss_sample_rate != 0.0 &&
(pgss_sample_rate == 1.0 ||
pg_prng_double(&pg_global_prng_state) <= pgss_sample_rate);
Since the number of queries will be very large, using this inequation allows filtering only the specified fraction of the queries. The sampling method has one significant drawback: not all queries are tracked in pg_stat_statements
. This impacts the completeness of the collected information, especially during debugging or analyzing rare but problematic queries. On the other hand, since only a part of the queries is tracked, the load on locks is reduced, and consequently, the overall system performance is improved.
Limitations of the sampling
If queries are sampled at the stage of adding a new bucket to the hash table, thereby unloading the LWLock, there is a risk of losing a valuable query hat would offer crucial information if tracked in pg_stat_statements
. Moreover, even if sampling did take care of the performance problem, there is still the issue of security and correctness of query storage. The fact is that adding a bucket to the hash table can occur both before and after the query execution. But the structure required for normalizing the query is relayed only before the query execution, namely at the parsing stage, when we need to create a structure to sore literals for normalization. If at this stage pg_stat_statements
decides not to save the query due to sampling, but then (after execution) still tries to add it to the hash table, the query will be recorded in its original (non-normalized) form. This may lead to the leakage of sensitive information in pg_stat_statements (for example, passwords or personal data in literals of a SQL query). Therefore, sampling during query parsing is unacceptable: it can violate security requirements.
Nevertheless, both the PostgreSQL community and the developers at Tantor Labs are trying to solve the problem of a large number of unique queries in a different way — by merging similar queries under one queryid. This reduces the number of unique entries in the hash table and, accordingly, decreases the frequency of its locking. The community has already merged the following queries into one QueryId:
In SET commands;
Identical queries with different lengths of IN(...);
In scenarios where the load on SpinLock becomes bottleneck — for example, when the same entry in a hash table is updated frequently — sampling can be quite effective. Since SpinLock protects only a single bucket, fewer calls to it (by skipping some queries) reduces contention between threads and thus improves overall performance.
Sampling results
Let's review the above scenario with a powerful machine of 192 CPUs. The same waits.sql
scripts and enabled pg_stat_statements
are used for the test. Now let's run a benchmark to assess the impact of the pg_stat_statements.sample_rate
on performance and the nature of waits. We run a loop over five sample_rate
values: 1, 0.75, 0.5, 0.25
, and 0
. For each value, we run a load testing using pgbench
:
CONNECTIONS=192
RESULTS="/tmp/results"
pgbench -i -s500
rm -rfv $RESULTS
for i in 1 .75 .5 .25 0
do
psql -c "alter system set pg_stat_statements.sample_rate = ${i};" 2>/dev/null >/dev/null
psql -c "select pg_reload_conf();" 2>/dev/null >/dev/null
psql -c "show pg_stat_statements.sample_rate;" 2>/dev/null >/dev/null
echo -e "\nsample_rate = $i" >>$RESULTS
nohup pgbench -c $CONNECTIONS -j20 -T120 -S -Mprepared --progress 10 | grep "tps = " 2>>/tmp/results >>$RESULTS &
timeout 125 psql -f /tmp/waits.sql
echo " count | wait_event | wait_event_type" >>$RESULTS
echo "--------------------------------------" >>$RESULTS
cat /tmp/waits | grep OUT | awk '{print $2 "|" $3}' FS="|" | sort | uniq -c | sort -n -r -k1 >>$RESULTS
rm -rfv /tmp/waits
done
cat $RESULTS
After running the benchmark with various pg_stat_statements.sample_rate
values, we got the results in the table below. It shows how performance and the nature of waits change when we change the fraction of queries hitting the hash table:
sample_rate | tps | SpinDelay | NoWait | ClientRead |
1.00 | 484 338 | 11 107 | 9 568 | 929 |
0.75 | 909 547 | 4 781 | 12 079 | 2100 |
0.50 | 1 019 507 | 174 | 13 253 | 3378 |
0.25 | 1 019 507 | - | 13 397 | 3423 |
0.00 | 1 015 425 | - | 13 106 | 3502 |
At sample_rate = 1.0, when metrics are collected for all queries, TPS is the lowest and there is a huge number of waits on SpinLock. As the sample_rate decreases to 0.75 and below, TPS spikes, and SpinDelay decreases by 2.3 times. At sample_rate = 0.25 and below, SpinDelay disappears.
Thus, sampling effectively reduces the overhead of pg_stat_statements
and significantly improves performance in scenarios with high contention for SpinLock.
Conclusion
In pg_stat_statements
, LWLock
is used when adding a new entry to the hash table, and with a large number of unique queries, it can become a bottleneck. The PostgreSQL community is trying to solve this problem by reducing the number of new entries (namely by merging similar queries under one queryid
), and in Tantor Postgres 17.5, corresponding configuration parameters were added to mask arrays and temporary tables: pg_stat_statements.mask_const_arrays
and pg_stat_statements.mask_temp_tables
. This helps to cluster similar queries more precisely.
In turn, SpinLock is used to protect individual buckets and becomes a source of contention when counters for identical queries are frequently updated, especially on machines with a large number of CPUs. To solve this problem, pg_stat_statements.sample_rate
parameter was added in Tantor Postgres SE 17.5: it allows reducing the load by sampling queries, thereby eliminating the problem described in the article.
The developers of Tantor Labs proposed to merge the query sampling mechanism in pg_stat_statements
in the main branch of PostgreSQL. There's already an active discussion happening over at pgsql-hackers — if you've got any thoughts or feedback, feel free to jump in!