Extended statistics — what is it?
The extended statistics feature lets you tell PostgreSQL to collect additional statistics on a set of table columns. Why would you need that? Let me quickly explain using the public power plant dataset. For instance, the primary_fuel
used by a power plant is implicitly tied to the country
column. So, if you run a simple query like:
SELECT count(*) FROM power_plants
WHERE country = '<XXX>' AND primary_fuel = 'Solar';
You’ll see that Norway returns 0 rows, while Spain returns 243. That’s obvious to us — latitude, climate, etc. — but the database doesn’t know that. During query planning, PostgreSQL estimates the row count incorrectly: 93 for Norway and 253 for Spain. In a more complex query, where this estimate feeds into a JOIN or some other operator, the consequences can be unfortunate. Extended statistics calculate joint value distributions across column sets and help detect these dependencies.
ORMs often make things worse. In this same power plant dataset, you might see filters on country
and country_long
. From their descriptions, it’s clear these two fields are directly related. But when an ORM generates a GROUP BY on both fields, the estimation fails dramatically:
EXPLAIN (ANALYZE, COSTS ON, TIMING OFF, BUFFERS OFF, SUMMARY OFF)
SELECT country, country_long FROM power_plants
GROUP BY country, country_long;
HashAggregate (rows=3494 width=16) (actual rows=167.00 loops=1)
Group Key: country, country_long
-> Seq Scan on power_plants (rows=34936 width=16)
(actual rows=34936.00 loops=1)
A human would never write that query, but in the age of AI and auto-generated queries, this is the kind of mess we need to deal with.
So what does extended statistics actually offer? It lets you define three types of stats on a column set: MCV (Most Common Values), ndistinct, and dependencies.
For scan filters, MCV is the most effective: if the value combo you’re filtering on is common in the table, the planner will estimate it accurately. If the combo is rare (like solar power in Norway), then PostgreSQL uses rough estimates like ntuples / ndistinct
, but can improve them by excluding values from the MCV list.
For estimating group counts (GROUP BY, DISTINCT, IncrementalSort, Memoize, Hash Join), the ndistinct
stat on column combinations is very helpful.
To see the effect of extended stats, let’s apply it to our example:
CREATE STATISTICS ON country,primary_fuel FROM power_plants;
ANALYZE;
Now the earlier query gives much more accurate cardinality estimates for filtering and grouping by these two fields. For Norway, it estimates 1 row; for Spain, 253 — feel free to try it with country = 'RUS'
or 'AUT'
. Sure, the table isn’t that big, but the tool clearly works.
Yet I rarely see extended statistics used in the wild. Probably one reason is fear that ANALYZE
will take too long. But more likely, it’s the complexity of knowing when and where to create the stats.
Finding the right stats definition
Are there any empirical rules for when and what kind of stats to create? I came up with two rules of thumb:
Rule 1 is based on index definitions. If a DBA risks creating an index on a set of columns, it probably means queries will often filter on those columns. And query performance on those filters is critical — a good reason to improve plan quality.
Of course, not every multi-column filter suffers from bad estimates. That’s a drawback of this rule — we might end up creating useless stats. And if the expected use case is point lookups, a small estimation error (1 vs 5 rows) doesn’t matter much.
Rule 2 based on real-world filter patterns. Here, we pick candidate queries using two factors: (1) how much load the query puts on the DB (e.g. pages-read), and (2) whether it contains multi-column scan filters. It would also be nice to only consider cases where the actual row count differs significantly from the estimate.
This rule is more selective — it reduces the number of generated stats definitions. But it raises tricky questions:
When to create stats? With Rule 1, you create stats when you create the index. With Rule 2, you either use a timer to collect queries, or trigger it manually. The absence of a heavy bonus-calculating query for 29 days doesn’t mean we don’t want it fast on day 30. Even if its load is low, the accountant might not want to wait hours for it to run.
How to clean up old stats? With Rule 1, stats disappear when you drop the index. With Rule 2, it’s not so clear. A query might stop showing up because the sales season ended — but it could return next year. This adds uncertainty and potential instability to planner behavior.
How big should the estimate error be to justify new stats? ×2? ×10? ×100?
So, I decided to start with the simpler Rule 1 — and build the tech for Rule 2 later. The plan is to create a recommender that analyzes pg_stat_statements
and finished query plans, and suggests stat definitions — with reasons for each suggestion.
Extension overview
The extension idea is simple (see the repo). First, we need a hook to collect created object IDs — I used object_access_hook
.
Then we need to find the right time to filter for composite indexes and add stats definitions to the database — ProcessUtility_hook
worked well for this.
Because extended stats (types distinct
and dependencies
) are calculated for all column combinations, the computational cost grows fast:
3 columns = 4
distinct
stats and 9dependencies
8 columns = 247 and 1016 respectively
No wonder PostgreSQL limits stats elements to 8.
To avoid overloading the DB, I added a columns_limit
parameter (max columns per stat) and a stattypes
parameter (which types to include).
When auto-stats are created, a dependency
is registered not only on the table but also on the index used as a template. So if the index is dropped, the stats go too.
Should the extension also own the stats (so DROP EXTENSION
deletes them)? Not sure — the extension can work as a module without CREATE EXTENSION
, affecting all DBs in the cluster.
To separate auto-generated stats from manual ones, the extension adds a comment to each stat object with the library and stat name.
There are also pg_index_stats_remove
and pg_index_stats_rebuild
functions for bulk cleanup and regeneration — useful if the schema existed before loading the module or DB settings changed.
One tricky part is reducing redundant stats. With many indexes, we need to cut down the ANALYZE load. I introduced a deduplication procedure (see pg_index_stats.compactify
).
Example: if you have an index on t(x1,x2)
, and create one on t(x2,x1)
, no new stats needed. If you already have t(x1,x2,x3)
, then creating t(x2,x1)
does require new MCV
stats — but distinct
and dependencies
can be skipped.
Experiment
Theory is nice, but you need practice. I didn’t have a loaded PostgreSQL instance handy, so I found an old dump from a system with ~10k tables and 3× as many indexes. About 20k of those indexes had more than one column, and over 1k had five or more.
Good test client — too bad there’s no useful workload. Running ANALYZE
on this DB took 22 seconds. With the extension and a 5-column limit, it took 55 seconds.
Raw ANALYZE times by column limit and stat types:
Limit | MCV | MCV + NDISTINCT | MCV + NDISTINCT + DEPENDENCIES |
---|---|---|---|
2 | 21s | 27s | 28s |
4 | 23s | 37s | 67s |
5 | 24s | 51s | 118s |
6 | 25s | 68s | 196s |
8 | 30s | 137s | 574s |
Clearly, covering all column combinations gets expensive — especially dependencies
. So either keep the limit at 3–5 columns, or go with Rule 2. Now I understand why SQL Server has a separate worker just for updating such stats — it is costly.
How about redundancy cleanup? Let’s run another test:
SET pg_index_stats.columns_limit = 5;
SET pg_index_stats.stattypes = 'mcv, ndistinct, dependencies';
SET pg_index_stats.compactify = 'off';
SELECT pg_index_stats_rebuild();
ANALYZE;
SET pg_index_stats.compactify = 'on';
SELECT pg_index_stats_rebuild();
ANALYZE;
To monitor stat objects, run:
-- Total number of stat items
SELECT sum(nelems) FROM (
SELECT array_length(stxkind,1) AS nelems
FROM pg_statistic_ext
);
-- Total per stat type
SELECT elem, count(elem) FROM (
SELECT unnest(stxkind) elem FROM pg_statistic_ext
)
GROUP BY elem;
Results:
Metric | Without dedup | With dedup |
---|---|---|
ANALYZE time (s) | 141 | 123 |
Total stat elements | 74,353 | 61,409 |
MCV | 24,783 | 24,089 |
DISTINCT | 24,783 | 18,658 |
DEPENDENCIES | 24,783 | 18,658 |
EXPRESSIONS | 4 | 4 |
The gain isn’t huge — about 15% on time and more on stats volume. But still, protection against corner cases. Interesting note: dedup reduced MCV stats — meaning many indexes differ only in column order. Also, expression
stats showed up — even though we never mentioned them. PostgreSQL creates these automatically if the definition includes expressions. Not a huge issue, but it would be nice to control this behavior.
Comparing with joinsel
At Postgres Professional, we have another stats collector — joinsel. It’s not a direct competitor to extended statistics
. It creates a composite type based on the index definition and uses regular stats in pg_statistic
.
Pros: includes MCV, histogram (so it can handle range filters), and works with PostgreSQL core mechanics.
Cons: no dependencies
, and only one ndistinct
value for the whole composite type (though this can be fixed).
Let’s test ANALYZE with joinsel
:
SET enable_compound_index_stats = 'on';
SELECT pg_index_stats_remove();
\timing on
ANALYZE;
Time: 41248.977 ms (00:41.249)
ANALYZE time doubled, which is reasonable. And the best part — the complexity grows linearly with index column count.
Conclusion
The verdict on Rule 1: with careful limits, it’s a valid and usable approach.
Another takeaway: we need better extended statistics
tools in PostgreSQL core — to fine-tune what gets generated.
As for the assistant and Rule 2? Left for better times. If some brave soul with free time and patience wants to take it on — hit me up, I’ll help you get started.