melanny20 Jun 9 at 11:00

SQL Server vs. PostgreSQL query optimization: room for improvement?

Medium

15 min

705

Postgres Professional corporate blogPostgreSQL*

Case

For years, we’ve studied Oracle to make PostgreSQL a more migration-friendly option. We introduced tools similar to SQL profile and SQL plan baseline as AQO and sr_plan extensions. In some cases, PostgreSQL even outperforms Oracle, especially in automatic re-optimization.

Migrations from Oracle to PostgreSQL are usually smooth performance-wise, and we’ve even developed session variable extensions to make the transition easier. While many enterprise-only features exist, PostgreSQL often integrates popular solutions directly into the core.

But with SQL Server to PostgreSQL migrations, things get trickier — users report major query slowdowns. Problematic queries vary, with databases ranging from GBs to tens of TBs. In one case, PostgreSQL's query plan was so inefficient that it ran for weeks without finishing, whereas SQL Server completed it in 20ms! This raised questions about SQL Server’s technological edge, making it worth a deep dive.

Temp tables & parallel execution

During a SQL Server to PostgreSQL migration, we noticed a 7-9x slowdown on a seemingly simple JOIN with GROUP BY on small tables. The schema looked like this:

SELECT sum(t1.x * t2.count) 
FROM t1, t2 
WHERE t1.x3 = t2.x3 AND t1.x4 = t2.x4 
GROUP BY t1.x1, t1.x2, t1.x3, t1.x4;
PostgreSQL execution plan:
HashAggregate  (actual time=4000s, rows=2.1E3)
  Group Key: t1.x1, t1.x2, t1.x3, t1.x4
  ->  Nested Loop  (actual time=500s, rows=1.5E9)
        ->  Seq Scan on t1 (actual time=0.3s, rows=240000)
        ->  Memorize  (actual time=0.003..0.260s, rows=6100, loops=240000)
             Cache Key: t1.x1, t1.x2, t1.x3, t1.x4
  Cache Mode: logical
  Hits: 200000  Misses: 21000
             ->  Index Scan using t2_idx on t2 (actual time=0.3E-4s, rows=11)
        Index Cond: t2.x1 = t1.x1 AND t2.x2 = t1.x2
        Filter: t2.x3 = t1.x3 AND t2.x4 = t1.x4
                   Rows Removed by Filter: 0

Here, t1 and t2 are temp tables with ~200K rows each. The JOIN explodes into 1.5B rows, taking 500s, but the GROUP BY alone takes an hour! Meanwhile, SQL Server runs the same query in 300s.

To understand the cause, let's take a look at the query execution plan in SQL Server:

HashAggregate (parallel 8 streams)
  Hash Join
    Index Scan t1
    Index Scan t2

SQL Server does two things better:
Uses Hash Join instead of Nested Loop (but this only reduces runtime from 500s to 300s)
Parallelizes execution across 8 threads, significantly boosting performance

To understand the importance these points, let's take a look at the flamegraph:

Flamegraph, построенный по результатам профайлинга бэкенда, выполнявшего запрос — Flamegraph

Profiling shows PostgreSQL spends excessive time on hash calculations (hashtext()) and tuple comparisons (texteq()).

The query groups a billion incoming tuples, produced by the JOIN, into 21,000 groups. That means each group contains around 70,000 tuples — a significant number of duplicates! The second piece of the puzzle lies in the column types — all four columns are of text type.

Let's check the statistics for these columns:

SELECT a.attname, s.stadistinct, s.stawidth
FROM pg_statistic s, pg_attribute a
WHERE starelid=16395 AND starelid=attrelid 
  AND s.staattnum=a.attnum 
  AND a.attname IN ('x1', 'x2', 'x3', 'x4');
 attname    | stadistinct |  stanullfrac        | stawidth

---------------+--------------+---------------------+----------

x1              |               7 |                        0 |       72

x2              |         3574 |   		        0 |       72

x3              |               6 | 0.00033333333 |       72

x4              |               3 |                        0 |       50

Since the number of unique values is pretty low, almost every comparison on x1 requires a second check on x2. On top of that, each column stores fairly long strings, meaning the executor often processes around 300 bytes per value just to hash or deduplicate. Hash aggregation has to perform at least two operations per tuple—hash generation and string comparison. With a billion tuples, that’s a massive workload and likely the reason for the long execution time. Given no obvious way to speed up the grouping, my main guess for SQL Server’s impressive performance is its use of parallelism.

Parallelism

According to the docs, SQL Server leverages multithreading to speed up grouping. While parallel execution for arbitrary (especially custom) aggregates can be tricky, hash-based grouping is relatively easy to parallelize. In this case, where most rows are already in memory, parallel hashing seems like an ideal solution.

PostgreSQL, on the other hand, relies on parallel workers, but since they are implemented as separate processes, they come with extra overhead. By default, Postgres plays it safe, using only 2-3 worker processes per query, ensuring fair CPU time distribution across other backends. Plus, parallelization applies to entire query subtrees, not just individual plan nodes, and always includes scan operators. Another limitation: temporary tables aren't visible to parallel workers, making them unusable in vanilla PostgreSQL.

To test the impact of parallel workers, we can convert temp tables to regular ones and force a higher number of parallel workers. We also need to reduce parallel execution costs and tweak some settings like min_parallel_table_scan_size and max_parallel_workers in GUC. The final config looks like this:

SET max_parallel_workers = 32; #Set before the instance launch
SET max_parallel_workers_per_gather = 16;
SET parallel_setup_cost = 0.001;
SET parallel_tuple_cost = 0.0001;
SET min_parallel_table_scan_size = 0;

After running the test on PostgreSQL with the same hardware as SQL Server, we finally get comparable performance:

Finalize HashAggregate (actual time=416s)
 Group Key: t1.x1, t1.x2, t1.x3, t1.x4
 ->  Gather (actual time=416s)
     Workers Launched: 9
     ->  Partial HashAggregate (actual time=416s)
         Group Key: t1.x1, t1.x2, t1.x3, t1.x4
         ->  Nested Loop (actual time=68s)
             ->  Parallel Seq Scan on t1 (actual time=0.08s)
             ->  Index Scan using t2_idx on t2 (actual time=0.04)
                 Index Cond: t2.x1 = t1.x1 AND t2.x2 = t1.x2
                 Filter: t2.x3 = t1.x3 AND t2.x4 = t1.x4)
                   Rows Removed by Filter: 0

Execution Time: 416.5s

Now, PostgreSQL is on par with SQL Server! 🚀

Parallel workers significantly improve performance, but PostgreSQL's heavyweight process model is a limitation.
Temp tables block parallel execution, making optimization harder.
SQL Server’s thread-based approach is more flexible.

A multi-threaded hash aggregation function in PostgreSQL could be a game-changer — maybe a solid GSoC project for next year?

Multi-clause expression selectivity estimation

This case deals with a common problem: JOINs (or filters) with multiple conditions. For example, the following expression with four conditions:

t1.x1 = t2.x1 AND t1.x2 = t2.x2 AND t1.x3 = t2.x3 AND t1.x4 = t2.x4

PostgreSQL estimates the selectivity of the whole expression by evaluating the selectivity s_i of each sub-expression (clause):

Using these partial estimates, the planner calculates the total number of rows produced by the JOIN:

In most cases, this formula underestimates the number of rows produced. It's rare to find data in adjacent columns that is uniformly distributed "vertically" (within the column) and independently "horizontally" (between columns). In other words, while PostgreSQL planner gives a good estimate for a single JOIN condition, it underestimates when the query involves multiple conditions.

To break this down, let's look at a simple example.

CREATE TABLE t1 (x1 integer, x2 integer, x3 integer, x4 integer);
CREATE TABLE t2 (x1 integer, x2 integer, x3 integer, x4 integer);
INSERT INTO t1 (x1, x2, x3, x4) SELECT value%2, value%10, value%20, value%100 FROM generate_series(1, 1000) AS value;
INSERT INTO t2 (x1, x2, x3, x4) SELECT value%2, value%10, value%20, value%100 FROM generate_series(1, 1000) AS value;
CREATE INDEX t1_idx ON t1 (x1, x2, x3, x4);
CREATE INDEX t2_idx ON t2 (x1, x2, x3, x4);
VACUUM ANALYZE t1, t2;
EXPLAIN (ANALYZE, COSTS ON, BUFFERS OFF)
SELECT * FROM t1 WHERE x1=1 AND x2=1 AND x3=1 AND x4=1;
EXPLAIN (ANALYZE, COSTS ON, BUFFERS OFF)
SELECT * FROM t1 JOIN t2 ON (t1.x1=t2.x1 AND t1.x2=t2.x2 AND t1.x3=t2.x3 AND t1.x4=t2.x4);

Here, we have an obvious implicit functional dependency between all columns. Let's see how selectivity estimation works for the scan operator and JOIN. Running this in PostgreSQL gives:

Simple SELECT:
Index Only Scan using t1_idx on t1  (cost=0.28..4.30 rows=1 width=16)
(actual time=0.067..0.072 rows=10 loops=1)
...
JOIN of two tables:
Nested Loop  (cost=0.29..72.88 rows=25 width=32)
(actual time=0.110..10.203 rows=10000 loops=1)
  ->  Seq Scan on t1  (cost=0.00..16.00 rows=1000 width=16)
                      (actual time=0.019..0.225 rows=1000 loops=1)
  ->  Memoize  (cost=0.29..0.33 rows=1 width=16)
               (actual time=0.002..0.005 rows=10 loops=1000)
        Cache Key: t1.x1, t1.x2, t1.x3, t1.x4
...

It’s clear that the database has no idea about the distribution of values across adjacent columns, so it estimates the number of result tuples by multiplying selectivities for each column, which significantly underestimates the result. As a result, the optimizer falls back to choosing NestLoop, which is a risky choice, though mitigated by the use of the Memoize intermediate caching operator.

How does SQL Server handle this? Let's look at the simple SELECT plan:

…and at the Join plan

SQL Server handled it perfectly. The catch? Without indexes, the result changes drastically. This shows that SQL Server collects statistics on the joint distribution of data across columns during index creation.

After studying SQL Server's documentation and research, we found that to handle uneven value distributions, SQL Server gathers extensive statistics: index statistics, WHERE condition stats, user-defined statistics (similar to PostgreSQL’s CREATE STATISTICS), but with more options. The most interesting ones are histograms for estimating JOIN and WHERE selectivity, and the condition WHERE in statistics that only scans parts of the table satisfying the predicate.

In our case, SQL Server builds EXTENDED STATISTICS (my hypothesis) for columns x1, x2, x3, x4 in tables t1 and t2. With knowledge of the number of unique values and histograms for t1(x1, x2, x3, x4) and t2(x1, x2, x3, x4), SQL Server accurately estimates JOIN cardinality.

PostgreSQL’s extended statistics are very limited right now — mainly for scan filters, and not for JOIN cardinality estimation. Fortunately, the PostgreSQL community has recognized this issue and is working on a solution.

Interestingly, Postgres Professional has developed a non-trivial way to partially compensate for this existing limitation. If you need to jointly evaluate expressions like A.x = N AND A.y = M, you can simply represent the expression differently:

For this type of expression, a new composite type can be defined. With such a type, you can use standard PostgreSQL tools to compute and store standard statistics for the expression ROW(A.x, A.y), and leverage histograms, MCV (Most Common Values), and knowledge of distinct values for that type to estimate the selectivity of the expression without resorting to extended statistics!

How this works can be seen by checking the estimates the optimizer provides for queries when manipulating the enable_compound_index_stats parameter.

Another advantage often helping SQL Server build optimal plans is its advanced internal parameterization and parameter value caching.

Parameter value caching

To illustrate this optimization aspect, here’s a non-trivial example joining three large tables:

CREATE TEMP TABLE t1 (x numeric PRIMARY KEY, payload text);
CREATE TEMP TABLE t2 (x numeric, y numeric PRIMARY KEY);
CREATE TEMP TABLE t3 (x numeric, payload text);
INSERT INTO t1 (x, payload)
  (SELECT gs, 'long line of text'
   FROM generate_series(1, 1E5) AS gs);
INSERT INTO t2 (x, y)
  (SELECT gs % 10, gs FROM generate_series(1, 1E6) AS gs);
INSERT INTO t3 (x, payload)
  (SELECT -(gs % 10) + 2, 'long line of text'
   FROM generate_series(1, 1E5) AS gs);
VACUUM ANALYZE t1, t2, t3;
EXPLAIN (COSTS OFF, ANALYZE, BUFFERS OFF, TIMING OFF)
SELECT * FROM t3 WHERE x IN (
  SELECT y FROM t2 WHERE x IN (
    SELECT x FROM t1)
);

To simplify the demo, we disable HashJoin. The resulting PostgreSQL plan is:

Nested Loop Semi Join (actual rows=20000 loops=1)
   ->  Seq Scan on t3 (actual rows=100000 loops=1)
   ->  Nested Loop (actual rows=0 loops=100000)
         ->  Index Scan using t2_pkey on t2 (actual rows=0 loops=100000)
               Index Cond: (y = t3.x)
         ->  Index Only Scan using t1_pkey on t1 (actual rows=1 loops=20000)
               Index Cond: (x = t2.x)
               Heap Fetches: 0
 Planning Time: 1.099 ms
 Execution Time: 675.855 ms

We have three large tables here. Joining them directly would be heavy, so PostgreSQL employs internal parameterization. This means the NestLoop JOIN doesn’t scan the entire inner table; instead, it first fetches a tuple from the outer side and then performs an indexed scan on the inner side with a parameter, thus reducing the number of returned rows. In this example, parameterized scanning happens both for table t1 and the whole subtree that includes the JOIN of t1 and t2.

There’s a major flaw in this plan: table t3 contains many duplicates—only 10 unique values in 100K rows. In such cases, it would be much cheaper to cache the result of the underlying JOIN(t1, t2), since it can only return 10 distinct results, instead of rescanning it 100K times. While this example is simple, the subtree could be far more complex! PostgreSQL has a Memoize node for caching parameter values and query subtree results. However, research shows it doesn’t work for SEMI JOINs and can’t be used directly above the JOIN operator. Moreover, even if we overcome these limitations, PostgreSQL won’t place the Memoize operator above the index scan on t1—because when planning the JOIN (t1, t2), the re-scan count for t1 is estimated as 1, and the join with t3 (i.e. JOIN(JOIN(t1, t2), t3)) that triggers many re-scans happens later, as the optimizer works bottom-up.

Now, let’s see how SQL Server handles this. The equivalent Transact-SQL looks like this:

CREATE TABLE t1 (x numeric PRIMARY KEY, payload text);
CREATE TABLE t2 (x numeric, y numeric PRIMARY KEY);
CREATE TABLE t3 (x numeric, payload text);
INSERT INTO t1 (x, payload)
  (SELECT value, 'long line of text'
   FROM generate_series(1, 100000));
INSERT INTO t2 (x, y)
  (SELECT value % 10, value FROM generate_series(1, 1000000));
INSERT INTO t3 (x, payload)
  (SELECT -(value % 10) + 2, 'long line of text'
   FROM generate_series(1, 100000));
ALTER DATABASE SCOPED CONFIGURATION SET MAXDOP = 1;
SELECT * FROM t3 WHERE x IN (
  SELECT y FROM t2 WHERE x IN (
    SELECT t1.x FROM t1)
)
OPTION (USE HINT('DISABLE_BATCH_MODE_ADAPTIVE_JOINS'), LOOP JOIN);

The query plan (shown in the image below) reveals that SQL Server handles this in 138 ms.

While its plan broadly resembles PostgreSQL’s, SQL Server caches subquery results using the Row Count Spool operator. Its parameter caching technique is so well-honed that the optimizer not only recognizes the need to cache parameter values and results, but can also place caches over both the index scan on t1 and the entire JOIN subtree of (t1, t2). That’s nearly a 4× speedup—pretty impressive!

But that’s not all. Some of the most striking query speedups I've seen in SQL Server come from another optimization technique:

Internal parameterization of the query plan

In the example above, we saw that the PostgreSQL optimizer is capable of parameterizing scan operators and joins. However, it turns out that PostgreSQL can't parameterize other query plan nodes, such as propagating parameterization into subqueries, which limits the possible plan space. Where could this be an issue? Let’s take a look at another example::

CREATE TABLE t1 (x1 integer, x2 text);
CREATE TABLE t2 (y1 integer, y2 text);
CREATE INDEX t1_idx0 ON t1 (x1);
CREATE INDEX t2_idx0 ON t2 (y1);
INSERT INTO t1 (x1,x2)
 SELECT value % 100, 'long line of text number 1'
 FROM GENERATE_SERIES(1, 10000) AS value;
INSERT INTO t2 (y1,y2)
 SELECT value % 10000, 'long line of text number 2'
 FROM GENERATE_SERIES(1, 30000) AS value;
VACUUM ANALYZE t1, t2;

First, let's examine PostgreSQL's behavior. We'll try joining tables t1 and t2, while adding a GROUP BY to each table's scan.

EXPLAIN (COSTS OFF, ANALYZE, BUFFERS OFF, TIMING OFF)
SELECT * FROM
 (SELECT x1 FROM t1 GROUP BY x1) AS q1(x1)
 JOIN
 (SELECT y1 FROM t2 GROUP BY y1) AS q2(y1)
ON q2.y1 = q1.x1;
/*
Hash Join (actual rows=100 loops=1)
  Hash Cond: (t2.y1 = t1.x1)
  ->  HashAggregate (actual rows=10000 loops=1)
        Group Key: t2.y1
        ->  Seq Scan on t2 (actual rows=30000 loops=1)
  ->  Hash (actual rows=100 loops=1)
        Buckets: 1024  Batches: 1  Memory Usage: 12kB
        ->  HashAggregate (actual rows=100 loops=1)
              Group Key: t1.x1
              ->  Seq Scan on t1 (actual rows=10000 loops=1)
Execution Time: 25.311 ms
*/

The NestLoop increased query execution time 5x, up to 145 ms. Nothing surprising, right? Nested loops are expected to be slow on large tables.

But what if we rewrite the query using lateral joins? Let’s see how that changes things.

EXPLAIN (COSTS OFF, ANALYZE, BUFFERS OFF, TIMING OFF)
SELECT * FROM
(SELECT x1 FROM t1 GROUP BY x1) AS q1(x1),
 LATERAL (SELECT y1 FROM t2 WHERE t2.y1=q1.x1 GROUP BY y1) AS q2(y1);
/*
Nested Loop (actual rows=100 loops=1)
  ->  HashAggregate (actual rows=100 loops=1)
        Group Key: t1.x1
        ->  Seq Scan on t1 (actual rows=10000 loops=1)
  ->  Group (actual rows=1 loops=100)
        ->  Index Only Scan using t2_idx0 on t2 (actual rows=3 loops=100)
              Index Cond: (y1 = t1.x1)
Planning Time: 1.418 ms
Execution Time: 7.456 ms
*/

This is still a NestLoop returning the same result, but it runs significantly faster — about three times faster than HashJoin! The reason is that by declaring a lateral reference, we explicitly add a parameter, allowing the join to target t2 without scanning most of its rows. In the previous example, the grouping in the subquery prevented the creation of a parameterized scan, as PostgreSQL only supports internal parameterization for JOIN and scan operators.

Now, let’s see how SQL Server handles this:

SQL Server achieved roughly the same query execution time as PostgreSQL's best case using lateral joins — but through a slightly different, yet equally interesting, optimization. It deferred the grouping operator's execution to a later phase in the plan tree, which enabled parameterized scanning within the JOIN. Additionally, by anticipating a reduced tuple count after the join, the optimizer aimed to minimize the grouping workload.

However, this isn't our main focus. To eliminate extra optimization techniques and narrow the search space for both optimizers, we replace the JOIN with a LEFT JOIN in the query:

EXPLAIN (COSTS OFF, ANALYZE, TIMING OFF, BUFFERS OFF)
SELECT * FROM
 (SELECT x1 FROM t1 GROUP BY x1) AS q1(x1)
 LEFT JOIN
 (SELECT y1 FROM t2 GROUP BY y1) AS q2(y1)
ON q2.y1 = q1.x1;

In Transact-SQL, the query is almost identical—you just need to add the hint OPTION (LOOP JOIN) to prevent a HashJoin from being chosen. In the end, we have in PostgreSQL:

Nested Loop Left Join (actual rows=100 loops=1)
  Join Filter: (t2.y1 = t1.x1)
  Rows Removed by Join Filter: 490197
  ->  Group (actual rows=100 loops=1)
        Group Key: t1.x1
        ->  Index Only Scan using t1_idx0 on t1 (actual rows=10000 loops=1)
              Heap Fetches: 0
  ->  HashAggregate (actual rows=4903 loops=100)
        Group Key: t2.y1
        Batches: 1  Memory Usage: 913kB
        ->  Seq Scan on t2 (actual rows=30000 loops=1)
Planning Time: 0.412 ms
Execution Time: 75.476 ms

In SQL Server:

1 ms vs. 75 ms — a nearly 100x speedup! All granted SQL Server allows parameterization to pass through complex operators. For example, here the parameter used by the NestLoop flows through a StreamAggregate, enabling a targeted IndexScan.

In summary, after examining these (and other, less impressive) cases, it's clear that there’s a broad set of query optimization techniques missing from the current PostgreSQL version — and not even discussed on hackers mailing lists. SQL Server’s developers have clearly put in a lot of work to smooth out edge cases like data skew or extreme duplicates, making query execution more adaptive and response times acceptable.

Of course, PostgreSQL has its strengths. For instance, you can define custom sort orders for each aggregate function, and the optimizer has flexibility in ordering columns in GROUP BY or DISTINCT ON clauses, among other things if you dig deep enough.

However, comparing closely related RDBMS is a great way to identify new paths for code evolution—since you already have a model in the form of a query plan! You could even re-engineer or design a new optimization layer by leveraging the target system's unique features and its relatively simpler codebase. Have you encountered cases where migrating between RDBMS led to significant slowdowns or speedups thanks to new, optimal query plans? Let us know in the comments!

Hubs: