
The most problematic part of academic papers on various aspects of database system technologies, for me, is the performance evaluation section. It hardly needs saying that this part of the research is often key for the developer, as it justifies the very reason we’re spending time reading the entire preceding text. Equally important is that this section must ensure reproducibility and allow for independent analysis.
In fields like fluid dynamics and heat engineering, for instance, the theory of similarity has been developed. It allows recording experiment results using dimensionless quantities (e.g., Nusselt, Prandtl, Reynolds numbers) and, by repeating the experiment under slightly different conditions, enables meaningful comparisons.
I haven’t seen anything like that in the field of database systems. The testing section usually includes a brief description of the hardware/software setup and a set of graphs, where the main evaluated parameter is either execution time or TPS (transactions per second).
This approach seems to be the only viable one when comparing different DBMSs and deciding on which configuration to use in production. However, execution time depends on many factors, including server settings, caching effects, the chosen query plan, and the use of parallelism...
Now imagine we're creating a new query optimization method and want to compare it with one already published. Having only execution-time graphs labeled "before" and "after" with a brief description of the test platform — like here or here — we hit a wall: our figures don't align with the published ones due to the large number of unknowns. In this case, we need a measurable parameter that excludes the influence of other DBMS subsystems, is more portable, and is easier to analyze. I'm sure developers of, for example, a new storage engine would also appreciate removing the optimizer’s impact from their benchmarks.
Trying to reproduce experiments from papers or compare my method to the one proposed by an author, I keep running into the same problem: the uncertainty of the commonly used execution time metric is too high to draw solid conclusions — it more accurately reflects the efficiency of code in a specific runtime environment than the quality of the query plan itself.
Execution time is a noisy metric: even running the same benchmark test ten times on the same machine and the same instance can yield significant variation in results. For example, the ten consecutive runs of all 113 queries from the Join Order Benchmark (JOB) on my desktop showed up to 50% variance in execution time — and that’s under ideal, fully controlled conditions. What kind of variance will a third-party researcher get, and how are they supposed to analyze it?

How are we supposed to compare query plans executed with different numbers of parallel workers? On the test machine, parallelism gives a speed boost — but in production, where hundreds of backends compete, it may become a bottleneck. So maybe we need a better metric?
In the specific area of query optimization, execution time feels like an excessive parameter. For comparing different optimization approaches or evaluating the effect of a new transformation in the same PostgreSQL optimizer, it’s better to use a more focused characteristic.
Given that the core operations of a DBMS are data operations, it would be natural to choose a parameter like the number of operations performed on table rows during query execution, accounting for the number of attributes per row. Minimizing this parameter would indicate a more efficient query plan.
However, collecting such statistics is a very complex task. So let’s look for something slightly less precise but more easily retrievable. For instance, DBAs often use the number of pages read. A page here means a buffer cache page, which is also a table data block on disk.
It doesn’t make much sense to separately account for pages in RAM versus on disk — that gives us more information about eviction strategy and disk behavior than about the efficiency of the query plan. For our purpose, it’s sufficient to mechanically sum these values. We should also include temporary disk cache pages used by sorting, hashing, and other algorithms to store rows that didn’t fit in memory.
It’s important to note that the same page sometimes needs to be counted more than once. During sequential scans, we access the page once to read tuples. But during rescan operations (like the inner side of a nested loop join), we read the page again — and should count it again.
Postgres already provides the infrastructure to measure the number of pages read — via the pg_stat_statements
extension. I use the following approach: before executing each benchmark query, I run SELECT pg_stat_statements_reset();
After execution, I retrieve statistics with:
SELECT
shared_blks_hit + shared_blks_read + local_blks_hit + local_blks_read +
temp_blks_read AS blocks,
total_exec_time::integer AS exec_time
FROM pg_stat_statements
WHERE query NOT LIKE '%pg_stat_statements_reset%';
How good is this metric? In the same experiment shown earlier, all ten JOB test runs showed negligible variance in the number of pages read per query:

Just a few pages of difference. Arguably even this minimal variance should be studied, but it’s likely caused by internal operations like stats queries or parallel worker competition.
What does this metric give us? Let’s run a simple experiment. Take one query from the JOB test set (10a.sql) and gradually increase the number of workers involved. Watch how the execution time and the number of pages read change.

We’ll see that while the execution time changes, the number of pages read remains largely the same. It only jumps once — when increasing the number of workers from 1 to 2 — doubling the pages read. Looking at EXPLAIN
, we see why: with 0 or 1 workers, three out of six joins are nested loops and three are hash joins. With 2 or more workers, one more nested loop appears and one fewer hash join — meaning the page read metric detects a plan change that execution time alone doesn’t make obvious.
Now let’s use the pages read metric to measure the effect of PostgreSQL’s AQO extension on JOB queries.
Run each JOB query with AQO in learn mode ten times. In this mode, AQO stores cardinality estimates for each plan node at the end of execution and uses them during planning, allowing the optimizer to reject plans with overly optimistic estimates. Given PostgreSQL’s tendency to underestimate join cardinalities, this approach seems sound.
The chart below (log scale) shows how the number of pages read changes compared to the first iteration (when the optimizer has no knowledge about node cardinalities or column distinct values). By the tenth iteration, nearly all queries either improve or remain stable — possibly because the optimizer already chose a good plan early on, or the method didn’t work here.

However, six queries show a page count increase compared to the first iteration. Maybe ten iterations weren’t enough to rule out suboptimal plans? Let’s increase the count to 30 and see what happens:

The chart shows that the plans converge to an optimal solution. Two queries (26b and 33b) converge to a higher page count than in iteration zero. But their execution time improved by 15–20%. EXPLAIN
shows why: one nested loop was replaced by a hash join. Since hash joins read entire tables, they increase page reads, but are more efficient time-wise — especially in parallel mode.
So, page count isn’t a universal metric for plan efficiency. However, within a single DBMS, it provides a stable baseline, lets us repeat experiments across platforms, compare optimization strategies, and detect changes hidden by noisy execution time metrics. We shouldn’t stop publishing execution time in benchmarks — but maybe we should add pages read alongside?
With a page count chart, the benchmark script (see above), and a link to raw data, others can replicate the experiment, align it to the published results, run further studies, or compare with their own method. Isn’t that handy?
That’s it. The main goal of this post is to draw attention to the issues of reproducibility and objective analysis in DBMS research. Should we look for additional benchmark metrics? How useful is the pages read metric for this purpose? Can we adapt it to compare plans across different, but architecturally similar DBMSs? Could we normalize it by average tuple count per page?