
Today I present results from the first step towards enabling temporary tables in PostgreSQL parallel query execution: benchmarking sequential writes and reads of temp buffers. I added functions to the PostgreSQL system catalog to measure buffer flush operations and ran a set of tests. For the planner, I estimated the cost coefficient of flushing a temporary table buffer page to disk relative to the existing DEFAULT_SEQ_PAGE_COST. I also estimated the cost of a simple scan of temp buffers (a dry-run). Measurements show that sequential writes are about 30% slower than reads. Based on these results, I propose a cost formula. I also ran a limited “dry-run” style estimate and measured write speed when table blocks are distributed across memory pages randomly.
Since the very beginning, temporary tables in PostgreSQL have remained parallel restricted. In my view, the reason is straightforward: temporary tables primarily cover the need for relational variables, and for both performance and code maintainability they should stay as simple as possible. Because PostgreSQL parallel workers are effectively separate backends, they do not have access to the leader process’s local state where temporary tables live. Supporting parallel operations on temporary tables would make this mechanism significantly more complex.
That said, at least two production-grade implementations of parallel operations on temporary tables already exist in enterprise PostgreSQL distributions (Postgres Pro and Tantor). Another argument “for” is the ongoing community discussion around logical replication of DDL. This puts the question of improving temp-table tooling on the table: by analogy with SELECT, before replicating we need to know about temporary objects that can be buried arbitrarily deep inside UTILITY commands, right? That suggests it might be time to move this topic forward in the PostgreSQL community.
After years of incremental work related to temporary objects (temporary types, indexes, functions, and so on), one fundamental issue remains: temp buffer pages are local to the leader process. If those pages are not flushed to disk, parallel workers cannot read them. If we flush them, we need a cost model so the planner can decide when it is worth doing. This might make sense once support for the SQL directive GLOBAL TEMP TABLE appears, but for now there is no indication that this is coming. However, there is something we can do already while preserving the locality of temporary table buffers.
A comment in the code (80558c1), written by Robert Haas in 2015, clarifies the current state of affairs:
/* * Currently, parallel workers can't access the leader's temporary * tables. We could possibly relax this if we wrote all of its * local buffers at the start of the query and made no changes * thereafter (maybe we could allow hint bit changes), and if we * taught the workers to read them. Writing a large number of * temporary buffers could be expensive, though, and we don't have * the rest of the necessary infrastructure right now anyway. So * for now, bail out if we see a temporary table. */
The comment points to a possible way forward: if we flush the leader process’s temporary buffers to disk before starting parallel operations, workers would be able to safely scan such a table in parallel directly from disk. The question, however, is the cost — would this lead to performance degradation if the overhead of writing buffers outweighs the gains from parallelism?
It is easy to imagine a case where such functionality would be ineffective: many simple scan queries over temporary tables combined with a large temp_buffers setting and a continuous stream of DML (even if the percentage of affected rows is small) modifying one or more temporary tables. In this scenario, scanning every temporary buffer and writing it to disk before executing each query could degrade overall system performance.
On the path toward implementing parallel scans of temporary tables, the argument about the cost of flushing buffers appears fundamental and must be addressed first. We can tackle this problem by providing the optimizer with a proper cost model. In that case, it will be able to choose between a parallel scan with buffer-flush overhead and a sequential scan without parallel workers.
In effect, we are looking for a constant similar to DEFAULT_SEQ_PAGE_COST, but for writes. To that end, let us turn to real data and measure how much it actually costs to flush temporary buffers. My goal is to determine whether this overhead is a real barrier to parallel execution or whether the problem is somewhat overstated.
Benchmarking toolkit
At the moment, PostgreSQL does not provide direct access to local buffers for measurement or state manipulation. Therefore, I added several functions to pg_proc. The temp-buffers-sandbox branch, based on the current PostgreSQL master, contains all the modifications required for our tests.
The implementation consists of two key commits.
Commit #1. Statistics infrastructure
This commit introduces two new internal statistics that track the state of local buffers:
allocated_localbufstracks the total number of buffers allocated for temporary tables.dirtied_localbufscounts how many local buffers contain so-called DIRTY (not synchronized to disk) pages.
These statistics could potentially form the basis of a future cost model, giving the query optimizer visibility into the current state of temporary buffers when estimating the cost of flushing them.
Commit #2. UI functions
This commit adds SQL-callable functions that allow direct manipulation and inspection of local buffers:
pg_allocated_local_buffers()returns the number of currently allocated local buffers.pg_flush_local_buffers()explicitly flushes all dirty pages to disk.pg_read_temp_relation(relname, randomize)sequentially (or in random order) reads all blocks of a temporary table intotemp_buffers.pg_temp_buffers_dirty(relname)marks all pages of the given table as DIRTY.
These functions provide explicit, block-level measurement of flush and read operations, which is essential for developing accurate cost estimates.
Methodology
You can find the test scripts here.
Luckily, operations on local buffers are fairly simple: they do not take locks, do not write WAL, and avoid other expensive work. That removes concurrency questions from the experiment and simplifies the benchmark logic. To build a cost model we need to measure three things: sequential write speed, sequential read speed, and the overhead of scanning buffers when no I/O is required (the so-called dry-run).
The ratio between read and write speeds lets us derive a sequential page-write cost parameter from the existing DEFAULT_SEQ_PAGE_COST used in PostgreSQL core. The planner can then use this parameter to estimate the cost of flushing dirty local buffers before parallel operations.
The benchmark sequence is as follows:
Create a temporary table and fill it with data that fits into the local buffer pool (all pages will be dirty in memory).
Call
pg_flush_local_buffers()to write all dirty buffers to disk. Measure the I/O.Call
pg_flush_local_buffers()again to measure scan overhead without a real flush (dry-write-run).Evict the test table’s pages by creating a new auxiliary table that fills the entire buffer pool, then drop it.
Call
pg_read_temp_relation()to read all test table blocks from disk into buffers. Measure the I/O.Call
pg_read_temp_relation()again to measure scan overhead without real reads (dry-read-run).Evict the test table’s pages from
temp_buffers.Write the test table’s pages into
temp_buffersat random addresses.Call
pg_temp_buffers_dirty()to mark the test table’s pages intemp_buffersasDIRTY.Call
pg_flush_local_buffers()to write all test table pages to disk.
Steps 7–10 model random access only very roughly. I use it here simply to get a feel for the difference between sequential and random disk access.
All measurements are performed using EXPLAIN (ANALYZE, BUFFERS), which records execution time and buffer I/O stats (local read, local write, blocks hit/read). Planning time is negligible (usually < 0.02 ms) and is excluded from analysis. While it is possible to avoid EXPLAIN entirely (and thus avoid the overhead of instrumentation structures), I assume this overhead is minimal and similar for both reads and writes. Using EXPLAIN provides a convenient way to verify execution time and confirm the actual number of affected blocks.
The tests cover buffer pool sizes in powers of two from 128 to 262,144 blocks (from 1 MB to 2 GB), with 30 iterations per size for statistical reliability. Each test allocates 110% of the number of blocks required to fit the table. The extra 10% is used for Free Space Map and Visibility Map metadata. Larger buffer counts likely trigger memory swapping and lead to unreliable results.
Benchmark results
Raw test results are available here. On my laptop, the most stable performance is observed in the 4–512 MB range:
nblocks | bufsize | Write (ms) | Dry-run (write) (ms) | Read (ms) | Dry-run (read) (ms) |
|---|---|---|---|---|---|
512 | 4 MB | 0.54 | 0.002 | 0.58 | 0.016 |
1,024 | 8 MB | 1.07 | 0.003 | 1.13 | 0.028 |
2,048 | 16 MB | 3.02 | 0.004 | 2.42 | 0.054 |
4,096 | 32 MB | 6.36 | 0.007 | 4.81 | 0.107 |
8,192 | 64 MB | 12.34 | 0.013 | 9.79 | 0.210 |
16,384 | 128 MB | 24.63 | 0.026 | 19.35 | 0.421 |
32,768 | 256 MB | 49.60 | 0.051 | 38.72 | 0.838 |
65,536 | 512 MB | 98.93 | 0.102 | 77.46 | 1.681 |
I have relatively limited RAM, and larger buffer sizes show qualitatively higher write overhead and much higher variability:
nblocks | bufsize | Write (ms) | Dry-run (write) (ms) | Read (ms) | Dry-run (read) (ms) |
|---|---|---|---|---|---|
131,072 | 1 GB | 283.15 | 0.204 | 180.06 | 3.353 |
262,144 | 2 GB | 728.18 | 0.413 | 373.46 | 6.725 |
Dry-run scanning without I/O is minimal: 0.002–0.240 ms.
Based on the experiment, we can conclude that the cost of writing a temporary table should be close to the cost of a sequential page. More precisely, I would recommend the following formula:
DEFAULT_WRITE_TEMP_PAGE_COST = 1.20 × DEFAULT_SEQ_PAGE_COST
The limited random access modelling shows a 10–24% slowdown compared to sequential mode. To be safe, I would recommend applying a correction factor and using 1.3 instead of 1.2.
The low write cost is explained by the nature of temporary tables: no locks and no WAL writes. Some questions remain, though. I’m not sure what storage type the current default seq_page_cost is intended for; my measurements were made on an NVMe SSD. Would the read/write ratio be different on HDD? Does it matter to explore different storage types, or is one universal value good enough for most cases? Also, I modelled random writes in a very limited way—would general random write have a different cost?
The tests also show that we can account for the size of temp_buffers: dry-run buffer scanning overhead is about 1% of the write cost. Therefore, a full formula for pre-flushing temp buffers could look like this (assuming DEFAULT_SEQ_PAGE_COST = 1):
flush_cost = 1.3 × dirtied_localbufs + 0.01 × allocated_localbufs
What’s next?
With a cost model in place, the next steps are fairly obvious—though not trivial:
Add a planner flag that signals the presence of temporary objects in a plan subtree. Existing flags like
parallel_safeandconsider_parallelare good candidates for extending with this (and related) logic.Add executor support so a worker can access the leader process’s temporary table storage on disk. Most of this work is removing asserts and error messages ;).
Add a temp-buffer flush operation to disk—most likely in
Gather/GatherMerge, right before launching parallel workers.Add the cost model so the planner can decide whether to allow parallel scans of temporary tables. There will also be a question about
temp_bufferssizing for workers: should you split the max size across all workers, allow each worker to allocatetemp_buffersblocks, or something else?
That’s basically everything needed to extend PostgreSQL parallelism to temporary tables.
Conclusion
360 measurements were performed, with 30 repetitions for each temporary buffers size. Medium datasets (16–512 MB) show a coefficient of variation consistently below 6%, indicating stable results. Large
temp_bufferssizes (1–2 GB) show much higher variability (CV >150% for writes), likely caused by swapping or other OS effects, so treat those results carefully.Sequential writes into local buffers are about 30% slower than sequential reads.
For optimisation purposes, the default write cost can be defined as
1.3 * DEFAULT_SEQ_PAGE_COST.
THE END.
Spain, Madrid, 02 January 2026.