How we loaded a petabyte into PostgreSQL before New Year

It all started as a joke by the office coffee machine. But, as with every decent joke, it suddenly sounded worth trying — and before we knew it, we were knee-deep in an experiment that turned out to be anything but trivial, complete with a whole minefield of gotchas.

It began simply: while everyone else was busy debating hardware tuning and squeezing out extra TPS from their systems, we thought — why not just shove a huge chunk of data into PostgreSQL and see how it holds up? Like, really huge. Say, a one-petabyte database. Let’s see how it survives that.

It was December 10, the boss wanted the report by January 20, and New Year was less than a month away. And that itch that all engineers know? It hit hard.

First, let’s acknowledge this: production databases that big are still more of a concept than everyday reality. Sure, hundreds of terabytes are no longer impressive. But petabytes? That’s still lab territory. Although, the line is blurring fast. Real projects involving multi-petabyte databases are starting to pop up — but often the size comes down to what's being stored. Logs? Images? Video? Sure, anyone can blow up a table like that. Then there’s telemetry from factories, analytics generating hundreds of temporary tables, and so on. Data volume keeps growing. In five years, nobody will bat an eye at a petabyte. But not today.

The hardware side of the story

So, if we want to test whether PostgreSQL can handle petabyte pressure, first we need to find that petabyte of space. It’s a big number. We didn’t exactly have spare petabytes lying around in the office. A quick market scan showed that, sure, we could buy 15 TB disks, load 24 of them into 2U chassis, do that three times, and boom — we’re there. But it’s expensive, takes time to deliver, and hey, New Year is coming. We’ve got snowballs to throw, not just pgbench to run.

Enter the cloud. Perfect for MVPs and seasonal workloads. Or so all the cloud marketing promised us. So we googled up some providers and sent them a brief that went something like this:

We need a guaranteed 1 PB of storage, unified. No thin provision tricks, just give it to us.
SSDs only. No HDDs allowed.
N machines with access to it. Don’t care how many exactly, just no fewer than three.
Each VM should have 4 to 16 vCPU. This is an experiment, not a production cluster. Basic Intel Silver or equivalents will do.
Each VM must have at least 64 GiB RAM, preferably 128 or even 256, just so memory doesn’t become the bottleneck. RAM is cheap compared to everything else.

Why do we need all this? To generate data, run queries, measure things, draw some charts, and ponder the results. You know, the usual.

And then came surprise number one. Every single cloud provider bailed on the very first requirement. They just didn’t have that much space available in a single chunk — even if they started clearing it for us. Sure, we could cobble together storage from different regions, but testing performance when node one’s in Moscow, node two’s in Saratov, and node three is somewhere near the Arctic Circle… well, you get it.

So we ditched the cloud and went the old-school route — renting physical servers. Which led to exactly the same disappointment: not a single data center had enough fast disk space to just plug and play. On one hand, we get it — idle hardware is money wasted. On the other, where’s the hot spare mentality? That’s a philosophical question, and reality wasn’t interested in philosophy.

Just as we were about to kill the whole idea, we heard back from a decent data center. They had seven servers that kinda matched what we needed. They were willing to stuff them with disks and rent them to us for cheap. A distributed database with Shardman across seven machines? That sounded intriguing. We gave the green light.

Wait, what’s Shardman?

Shardman is a distributed database engine we’ve been building at Postgres Professional for the past five years. The core idea is simple: take a table, partition it, and distribute those partitions across different nodes in a share-nothing setup. On top of that, you get a unified SQL interface. The system still satisfies full ACID guarantees and is designed for OLTP workloads. The biggest win? Any node can be used as an entry point for your queries.

So now, with hardware sorted out, we had just two weeks left till New Year.

The benchmark

Alright, we’ve got the hardware. Now we need a ruler to measure things. Time’s short, and we’re not building a warp drive here. So let’s grab a solid, well-known benchmark, load data with it, fire some queries, and then go chop veggies for the holiday salad.

Enter YCSB — Yahoo! Cloud Serving Benchmark. Or more specifically, its Go implementation by pingcap. Never heard of it? Here’s what the crew says: it’s a NoSQL benchmark from 2010 created by Yahoo folks. It covers the bare minimum of simple operations. Yes, it’s NoSQL, but that’s fine because it uses a classic key-value model. No messy joins, no distributed transactions, no fancy queries.

It builds a single table called userdata, supports partitioning, and fits right into Shardman without any modifications. The key is dead simple: ycsb_key = 'user' + hash(seq), plus ten varchar(100) columns filled with random data. Four letters and a hash. The rest of the fields? Totally random. No need for business logic, validation, or anything else.

Spoiler: this carefree attitude will come back to bite us later. But that’s a story for the next part.

The structure of the table

 Partitioned table "public.usertable"
  Column  |      	Type      	  | Collation | Nullable  | Default
----------+------------------------+-----------+----------+---------
 ycsb_key | character varying(64)  |       	  | not null |
 field0   | character varying(100) |       	  |      	 |
 field1   | character varying(100) |       	  |      	 | 
 field2   | character varying(100) |       	  |      	 |
 field3   | character varying(100) |       	  |      	 |
 field4   | character varying(100) |       	  |      	 |
 field5   | character varying(100) |       	  |      	 |
 field6   | character varying(100) |       	  |      	 |
 field7   | character varying(100) |       	  |      	 |
 field8   | character varying(100) |       	  |      	 |
 field9   | character varying(100) |       	  |      	 |
Partition key: HASH (ycsb_key)
Indexes:
	"usertable_pkey" PRIMARY KEY, btree (ycsb_key)

It's time for some math. Each row in our table takes up about 1100 bytes. That means we need to load roughly a trillion of them into the database (yep, that’s thirteen digits). We’ve got two weeks to pull this off, so we need to push about a gigabyte per second — and with that pace, we’d even have a little wiggle room. Sounds doable. Modern disks can handle this kind of load.

Alongside the usual go-ycsb benchmarks, we decided to roll our own tests using our tool pg_microbench, which is built on standard Java libraries for multithreading and JDBC for database interaction.

We designed two test cases: YahooSimpleSelect and YahooSimpleUpdate. Both are set up so you can specify which shard to connect to and which shard to read from — just to be 100% sure who’s reading what and from where. For every generated ycsb_key, we calculate its target shard and only run queries against that shard.

The test flow was pretty straightforward:

Run shardman.global_analyze() to collect stats on both segmented and global tables.
Kick off a 10-minute warm-up with the selected load profile.
Launch the actual test with different thread counts, increasing by powers of two. Each test runs for 5 minutes.

And that’s the plan. Time to hit "go" and — where are our snacks?

December 10, 2024. Harsh reality

While we were still waiting on the servers, someone had a bright idea: “Let’s run this benchmark in our lab and see what it does!” We spun up a dozen virtual machines, handed each 8 cores, launched the whole thing — and went to lunch.

Half an hour later we checked the results, and… cold sweat. Just 50 GB across all nodes in 30 minutes. That’s way off from the many-gigabytes-per-second we were aiming for.

Digging deeper, we found that the generator had simply stopped writing new data. We popped open the code and patched it to insert directly into the correct shard by calculating the node from the sharding key upfront. But why was it so slow?

What was inside? The logic looked standard enough — a classic infinite loop:

for (long seq = 0; seq < X; seq++) {
    ycbs_key = ‘user’ + fnv1a(seq)
    ...
}

Seemed legit: generate a key, insert data. But where were the results? Turns out the key was “special.” It included an fnv1a hash of a sequence number. No one really knew why it was needed, so we kept digging.

Eventually, we hit the most boring issue imaginable: hash collisions. The algorithm started generating duplicate keys, leading to endless empty loops. At some point, collisions got so frequent that data generation effectively ground to a halt.

Management decision: stop writing one row at a time. Batch them. Everyone knows batching is faster, period. So we rewrote it to generate batches of rows. And hit a wall — deadlocks. The system completely froze. Why? Because hash collisions weren’t just duplicating keys, they were swapping them between threads. So multiple threads were endlessly trying to insert the same row. A textbook nightmare.

December 20. Eleven days left

Nobody wants to spend January debugging PostgreSQL benchmarks. Time for drastic action. We ditched the hash function and went with good old sequential generation. Straightforward Java batching: 100 rows per transaction, node-wise inserts, no random numbers needed. Just generate a million rows, pick random ones into a batch, and go.

We also added monitoring — because the original benchmark had no progress indicator — and, for aesthetic pleasure, logged the load process into a separate table, updating it every 5 minutes.

We re-ran the whole thing on the same VMs, went to lunch again — and victory! The generator pushed about 150 GB in 5 minutes. That’s more like it. Time to move on to real hardware.

December 27. We got the servers. Deadline: 24 days

The salads could wait — we had work to do. We installed Debian 12 and set up RAID 0 across 10 disks, 15 TB each. That gave us 140 TB of usable space per server, times seven servers — almost a full petabyte. Nice.

Then came the software: OS, Shardman, monitoring tools, and all the bells and whistles (pgpro_otel_collector, pgpro_stats, pgpro_pwr). We set num_parts to 70 so each node would store 10 equal sections and avoid hitting the 32 TB section limit. We mounted PGDATA directly into the RAID mount point — more on that later.

Since Shardman is a distributed database and very picky about clock skew, we installed chrony on each node to sync time via NTP. Bless the Omnissiah. And then — we fired up the generator on every node.

Before heading home, we took one last look. Each node was loading data at ~150 GB per 10 minutes. One node, planck-6, decided to be an overachiever and generated a bit more than the others. But more is better than less, right? So we locked the office, scattered to our homes, and slipped into a blissful, citrus-scented sleep.

What could possibly go wrong now?

planck-1: 629 GB  
planck-2: 632 GB  
planck-3: 620 GB  
planck-4: 632 GB  
planck-5: 631 GB  
planck-6: 975 GB  
planck-7: 626 GB

Here’s the next chunk of the translation into natural, idiomatic English for Habr.com — still lively, no camelCase, and true to the original tone:

December 28, 2024. Never deploy on a Friday. 23 days to deadline

Saturday morning brought an unpleasant surprise: two of our nodes had fallen way behind the rest. While most managed to crank out around 11 terabytes overnight, nodes 3 and 6 barely scraped 4.

planck-1: 11T  
planck-2: 11T  
planck-3: 3.5T  
planck-4: 11T  
planck-5: 11T  
planck-6: 4.1T  
planck-7: 11T

Panic mode engaged. No time to be sad, so we dove under the hood. iostat showed elevated RAID usage and painfully slow I/O operations. Write latency on the NVMe drives for servers 3 and 6 clocked in at around 5 ms — compared to just 0.15 ms on the other machines.

And of course, it’s Saturday. Holiday parties are in full swing, and here we are getting 10 ms response times from disks. Lovely.

We reached out to tech support, who — bless them — jumped in within 10 minutes. SMART said all was fine, firmware was identical across drives, BIOS settings matched, everything seemed in order. We tried rebooting, double-checked versions and configs. As a last resort, we swapped out the NVMe cables — things improved (5 ms became 1 ms), but still not close to the 0.1 ms we were seeing elsewhere.

After a long (and inconclusive) debate, we landed on a theory: one of the disks was acting up, and since our RAID was software-based, its speed was capped by the slowest drive. We weren’t about to swap all the drives and start from scratch, so we pulled a classic hack. We broke the RAID arrays on the problematic nodes and went with standalone disks instead. Each disk got its own tablespace, and we wrote directly to them. It seemed like a solid workaround — at the time.

But trouble came from an unexpected place: turns out Shardman doesn’t support local tablespaces. If you create one, you have to replicate it across all servers. Makes sense for a distributed system, but redoing our setup from scratch? No thanks.

postgres=# CREATE TABLESPACE u02_01 LOCATION '/u02_01';
ERROR: local tablespaces are not supported
HINT: use "global" option to create global tablespace

So, we did what every good hacker does: “Do as I say, not as I do.” We disabled schema sync.

postgres=# set shardman.sync_schema = off;
SET

postgres=# show shardman.sync_schema;
 shardman.sync_schema 
----------------------
 off
(1 row)

postgres=# CREATE TABLESPACE u02_01 LOCATION '/u02_01';
ERROR: tablespace location template doesn't specify all necessary substituions
HINT: expected to see rgid word in location template

Now our Shardman was basically local-only and everything should’ve worked… except it didn’t. It refused to do anything without a server ID in the tablespace name. But hey, for every overly clever restriction, there’s an equally clever workaround. We created a folder with a server ID in its name, and it worked like a charm.

mkdir /u02_01/3
mkdir /u02_02/3
postgres=# CREATE TABLESPACE u02_02 LOCATION '/u02_02/{rgid}';
CREATE TABLESPACE
postgres=# CREATE TABLESPACE u02_03 LOCATION '/u02_03/{rgid}';
CREATE TABLESPACE

Tablespaces were created, shards were spread across the disks, and generation resumed.

ALTER TABLE usertable_2 SET TABLESPACE u02_01;
ALTER TABLE usertable_9 SET TABLESPACE u02_02;
ALTER TABLE usertable_16 SET TABLESPACE u02_03;

By December 30, we were done tweaking. We set it up to leave about half a terabyte free on each node—just in case. If the disks filled up, we wouldn’t be able to run tests. With everything humming along nicely, we finally headed home to ring in the new year with our families and salad bowls.

January 3, 2025. Never rush. 17 days to deadline

The on-call engineer got an alert: disk space on server 6 was running low, and data loading had stopped. Turns out, in the pre-holiday rush, we had reconfigured the tablespace for tables—but forgot to do the same for indexes:

alter index usertable_12_pkey set tablespace u02_01;
alter index usertable_19_pkey set tablespace u02_02;

We fixed the index tablespaces wherever needed and got generation back on track.

January 9. What's going on? 11 days to deadline

After ringing in the new year, part of the team came down with the flu, others escaped to the mountains, and some blissfully disconnected from work. But on the 9th, one curious soul checked the monitoring dashboard—and found total chaos.

The ras daemon was screaming about bad memory sticks. Every single server except the seventh was showing critical RAM errors.

2025-01-09T16:58:52.754082+03:00 planck-3 kernel: [1021273.775923] mce: [Hardware Error]: Machine check events logged
2025-01-09T16:58:52.754511+03:00 planck-3 kernel: [1021273.776582] EDAC skx MC5: HANDLING MCE MEMORY ERROR
2025-01-09T16:58:52.754515+03:00 planck-3 kernel: [1021273.776585] EDAC skx MC5: CPU 16: Machine Check Event: 0x0 Bank 17: 0x8c00008200800090
2025-01-09T16:58:52.754517+03:00 planck-3 kernel: [1021273.776591] EDAC skx MC5: TSC 0x8b09ed5a153c5
2025-01-09T16:58:52.754520+03:00 planck-3 kernel: [1021273.776594] EDAC skx MC5: ADDR 0x306533e800
2025-01-09T16:58:52.754523+03:00 planck-3 kernel: [1021273.776597] EDAC skx MC5: MISC 0x9018c07f2898486
2025-01-09T16:58:52.754527+03:00 planck-3 kernel: [1021273.776600] EDAC skx MC5: PROCESSOR 0:0x606a6 TIME 1736431132 SOCKET 1 APIC 0x40
2025-01-09T16:58:52.754530+03:00 planck-3 kernel: [1021273.776617] EDAC MC5: 2 CE memory read error on CPU_SrcID#1_MC#1_Chan#0_DIMM#0 (channel:0 slot:0 page:0x306533e offset:0x800 grain:32 syndrome:0x0 —  err_code:0x0080:0x0090 SystemAddress:0x306533e800 ProcessorSocketId:0x1 MemoryControllerId:0x1 ChannelAddress:0x3f94cf800 ChannelId:0x0 RankAddress:0x1fca67800 PhysicalRankId:0x1 DimmSlotId:0x0 Row:0xfe51 Column:0x308 Bank:0x3 BankGroup:0x0 ChipSelect:0x1 ChipId:0x0)

Off to tech support again. They checked everything, replaced what needed replacing, and got us back online within hours. The hardware may have been meh, but kudos to the support team — they really delivered.

January 10. Let’s test. 10 days to deadline

Time to actually run some tests. Alongside go-ycsb, we stuck with our beloved pg_microbench for its multithreading capabilities and the ability to target specific shards for reading and writing.

As a reminder, we had written two tests for pg_microbench: YahooSimpleSelect and YahooSimpleUpdate. Both let you specify exactly which shard to connect to and read from, so you always know who’s doing what.

Our main result came from YahooSimpleSelect, producing a matrix of TPS values for 20 workers:

          | planck-1 | planck-2 | planck-3 | planck-4 | planck-5 | planck-6 | planck-7
---------------------------------------------------------------------------------------
planck-1  | 15805    | 710      | 764      | 649      | 751      | 734      | 242
planck-2  | 741      | 10987    | 725      | 629      | 677      | 697      | 229
planck-3  | 984      | 905      | 23673    | 825      | 914      | 931      | 333
planck-4  | 779      | 634      | 670      | 10220    | 623      | 633      | 214
planck-5  | 771      | 712      | 764      | 628      | 7830     | 695      | —
planck-6  | 810      | 751      | 802      | 654      | 693      | 37807    | 219
planck-7  | 223      | 236      | 219      | 206      | 210      | 330      | 1663

Columns = server sending the query, raws = server returning the data.

Naturally, diagonal values (local queries) were the highest — no surprises there. But funnily enough, the much-maligned servers 3 and 6 actually performed better than the rest. Meanwhile, server 7 — our only healthy one — was suspiciously slow. Turned out the problem was with the stripe parameter.

Take a look at this jagged CPU utilization on server 7 during testing. It should’ve been a smooth line — our load was constant. We fired up the usual suspects: perf and flamegraph. No smoking gun, so we dug deeper.

perf record -F 99 -a -g --call-graph=dwarf sleep 30
perf script --header --fields comm,pid,tid,time,event,ip,sym,dso

Zooming in, we spotted issues in the ext4_mb_good_group function. A bit of googling pointed to the stripe parameter — likely set to something non-zero. Bingo. Turns out this was a known bug that had been fixed. Our RAID was created with stripe=1280. Not ideal. Setting it to 0 isn’t perfect, but it's a known workaround. So we zeroed it out, did a remount, restarted the tests—and boom.

Tests were flying, CPU was pegged near 100%. Time to fire up the generator again, take another well-earned break, and finally push toward that one petabyte mark.

January 15. Still testing. 5 days to deadline

January 15 is the birthday of a good friend of ours — and the date when data generation finally completed across all servers. That gave us enough time to run the tests we had planned.

Let’s be honest — we didn’t hit the full petabyte. We landed at 863 terabytes, which translates to about 800 billion rows. Think that’s not enough? That we should’ve just waited another day or two? Don’t forget that the actual space used on disk is always greater than the useful data volume. By the end, our disks were nearly bursting.

Node     Used Space     Row Count (monitoring_insert)
planck-1     126 TB     799,447,789,415  
planck-2     107 TB     799,506,350,001  
planck-3     107 TB     799,998,897,115  
planck-4     126 TB     799,586,514,368  
planck-5     126 TB     799,559,145,747  
planck-6     126 TB     799,857,655,389  
planck-7     126 TB     799,346,369,872  

**Total: 863 TB**  
**Max per node: ~800 billion rows**

What did we find on the servers? Everyone’s favorite — autovacuum. Anyone who’s worked with PostgreSQL knows that after dumping a mountain of data, the autovacuum ~~mafia~~ wakes up and starts chewing through those terabytes.

  pid    | duration            | wait_event             | mode       | database | table          | phase         | table_size | total_size | scanned | scanned_pct | vacuumed | vacuumed_pct 
---------+---------------------+------------------------+------------+----------+----------------+---------------+------------+-------------+---------+--------------+-----------+---------------
1957485 | 1 day 01:41:41.759642 | Timeout.VacuumDelay   | wraparound | postgres | usertable_0    | scanning heap | 9309 GB    | 13 TB       | 3210 GB | 34.5         | 0 bytes   | 0.0
1957487 | 1 day 01:41:40.691321 | LWLock.WALWrite       | wraparound | postgres | usertable_7    | scanning heap | 9266 GB    | 13 TB       | 3230 GB | 34.9         | 0 bytes   | 0.0
1957491 | 1 day 01:41:39.662693 | IO.WALWrite           | wraparound | postgres | usertable_56   | scanning heap | 9266 GB    | 13 TB       | 3553 GB | 38.3         | 0 bytes   | 0.0
1957504 | 1 day 01:41:38.557464 | LWLock.WALWrite       | wraparound | postgres | usertable_63   | scanning heap | 9266 GB    | 13 TB       | 3562 GB | 38.4         | 0 bytes   | 0.0
2207587 | 10:57:22.773513       | Timeout.VacuumDelay   | regular    | postgres | usertable_21   | scanning heap | 11 TB      | 13 TB       | 9856 GB | 88.6         | 0 bytes   | 0.0
2207588 | 10:57:21.801343       | LWLock.WALWrite       | regular    | postgres | usertable_35   | scanning heap | 11 TB      | 13 TB       | 9855 GB | 88.6         | 0 bytes   | 0.0
2207593 | 10:57:20.791863       | Timeout.VacuumDelay   | regular    | postgres | usertable_14   | scanning heap | 11 TB      | 13 TB       | 9856 GB | 88.6         | 0 bytes   | 0.0
2207606 | 10:57:19.790292       | Timeout.VacuumDelay   | regular    | postgres | usertable_28   | scanning heap | 11 TB      | 13 TB       | 9856 GB | 88.6         | 0 bytes   | 0.0

In 24 hours, the autovacuum managed to handle about 4 TB per partition out of 14. The rest sat around in status VacuumDelay. It’s a noble cause, sure — but we needed to return the servers soon and report on the budget spent. We turned off VacuumDelay, and things sped up a bit — but still not fast enough.

 pid     | duration       | wait_event         | mode       | database | table         | phase         | table_size | total_size | scanned | scanned_pct | vacuumed | vacuumed_pct 
---------+----------------+--------------------+------------+----------+---------------+---------------+------------+-------------+---------+--------------+-----------+---------------
2429208 | 00:42:25.247109 | LWLock.WALWrite    | wraparound | postgres | usertable_5   | scanning heap | 12 TB      | 13 TB       | 452 GB  | 3.7          | 0 bytes   | 0.0
2429213 | 00:42:24.36491  | LWLock.WALWrite    | wraparound | postgres | usertable_12  | scanning heap | 12 TB      | 13 TB       | 778 GB  | 6.3          | 0 bytes   | 0.0
2429216 | 00:42:23.352458 | LWLock.WALWrite    | wraparound | postgres | usertable_19  | scanning heap | 12 TB      | 13 TB       | 451 GB  | 3.7          | 0 bytes   | 0.0
2429231 | 00:42:22.34729  | LWLock.WALWrite    | wraparound | postgres | usertable_33  | scanning heap | 12 TB      | 13 TB       | 452 GB  | 3.7          | 0 bytes   | 0.0
2429233 | 00:42:21.352138 | LWLock.WALWrite    | wraparound | postgres | usertable_26  | scanning heap | 12 TB      | 13 TB       | 451 GB  | 3.7          | 0 bytes   | 0.0
2429238 | 00:42:20.35141  | LWLock.WALWrite    | wraparound | postgres | usertable_40  | scanning heap | 12 TB      | 13 TB       | 451 GB  | 3.6          | 0 bytes   | 0.0
2429240 | 00:42:19.350775 | IO.WALSync         | wraparound | postgres | usertable_47  | scanning heap | 12 TB      | 13 TB       | 450 GB  | 3.6          | 0 bytes   | 0.0
2429242 | 00:42:18.349118 | LWLock.WALWrite    | wraparound | postgres | usertable_54  | scanning heap | 12 TB      | 13 TB       | 450 GB  | 3.6          | 0 bytes   | 0.0
2429246 | 00:42:17.336848 | LWLock.WALWrite    | wraparound | postgres | usertable_61  | scanning heap | 12 TB      | 13 TB       | 78 GB   | 0.6          | 0 bytes   | 0.0
2429259 | 00:42:16.347052 | LWLock.WALWrite    | wraparound | postgres | usertable_68  | scanning heap | 12 TB      | 13 TB       | 78 GB   | 0.6          | 0 bytes   | 0.0

(10 rows)

Digging further, we realized the bottleneck had shifted to WAL. Each node had its own massive write-ahead log. No time for experiments—so we moved the WAL to tmpfs during autovacuum execution.

pid     | duration        | wait_event          | mode            | database | table        | phase         | table_size | total_size | scanned | scanned_pct | vacuumed | vacuumed_pct
--------|-----------------|---------------------|-----------------|----------|--------------|---------------|------------|------------|---------|-------------|----------|------------
2500393 | 00:02:33.715813 | LWLock:WALWrite     | wraparound      | postgres | usertable_5  | scanning heap | 12 TB      | 13 TB      | 816 GB  | 6.6         | 0 bytes  | 0.0
2500397 | 00:02:32.790738 | IO:DataFileWrite    | wraparound      | postgres | usertable_12 | scanning heap | 12 TB      | 13 TB      | 1141 GB | 9.2         | 0 bytes  | 0.0
2500398 | 00:02:31.788569 | LWLock:WALWrite     | wraparound      | postgres | usertable_19 | scanning heap | 12 TB      | 13 TB      | 813 GB  | 6.6         | 0 bytes  | 0.0
2500407 | 00:02:30.788247 |                     | wraparound      | postgres | usertable_33 | scanning heap | 12 TB      | 13 TB      | 814 GB  | 6.6         | 0 bytes  | 0.0
2500408 | 00:02:29.788889 | LWLock:WALWrite     | wraparound      | postgres | usertable_26 | scanning heap | 12 TB      | 13 TB      | 813 GB  | 6.6         | 0 bytes  | 0.0
2500422 | 00:02:28.786883 |                     | wraparound      | postgres | usertable_40 | scanning heap | 12 TB      | 13 TB      | 812 GB  | 6.5         | 0 bytes  | 0.0
2500426 | 00:02:27.785951 | IO:WALWrite         | wraparound      | postgres | usertable_47 | scanning heap | 12 TB      | 13 TB      | 811 GB  | 6.6         | 0 bytes  | 0.0
2500427 | 00:02:26.785828 |                     | wraparound      | postgres | usertable_54 | scanning heap | 12 TB      | 13 TB      | 811 GB  | 6.6         | 0 bytes  | 0.0
2500432 | 00:02:25.785364 | IO:DataFileRead     | wraparound      | postgres | usertable_61 | scanning heap | 12 TB      | 13 TB      | 440 GB  | 3.6         | 0 bytes  | 0.0
2500438 | 00:02:24.732249 |                     | wraparound      | postgres | usertable_68 | scanning heap | 12 TB      | 13 TB      | 438 GB  | 3.6         | 0 bytes  | 0.0
(10 rows)

That did the trick. In 20 hours, everything got vacuumed, the freezes thawed, the data was ready for testing—and our mood vastly improved.

January 17. Three days left. Time to make some graphs

Graphs are wonderful. They might mean absolutely nothing on their own, but they always look fantastic.

We re-ran our YahooSimpleSelect test with various thread counts, this time limiting ourselves to just the diagonal case: reading and writing on the same node.

You can draw your own conclusions, but here’s one fact: server 3, which gave us the most trouble, posted the highest TPS in this test. Go figure. We still don’t understand how, but it was the fastest at reads, while server 6 led on writes.

Another observation: after 80 workers, performance plateaued, and CPU usage never exceeded 60%.

Next up: YahooSimpleUpdate. Once again, planck-6 surprised us with double the TPS of any other node. Weird. We’re chalking it up to hardware anomalies. But let’s be honest—when you buy matching servers, you expect matching performance. This… was not that.

Now for the main event — go-ycsb. We started with a basic read-only test and, once again, saw variation in load distribution.

But overall, our measurements showed pretty linear scaling. We saw TPS growth up to 1,500 workers—even on modest 16-core Intel Silver CPUs.

Across all tested workload profiles, performance was fairly consistent...

... except for one case: the 50/50 workload clearly showed degradation.

That’s something we’re still digging into.

Final thoughts: that we learned

Everyone loves numbers, so here are ours:

1,200 messages in our internal project chat.
23 hours of overtime during the holiday season. We normally frown on that — but adrenaline and enthusiasm got the better of us.
An unknown number of bad RAM sticks (we lost count).
1 dead fan — it gave its life to circulate hot air and died an unsung hero.
~100 GB of logs, metrics, and results saved for future analysis.

What would we do differently?

Plan for more buffer — both in terms of time and disk space. We actually asked for more storage, but the hardware was limited to 10-disk 1U boxes. No room for expansion.

Choose your benchmark wisely. Most benchmarks are built for quick tests on small datasets. If you’re aiming for petabyte-scale loads, you’ll need something designed for that kind of volume.

Weird hardware anomalies. We never got to the bottom of what was going on with servers 3 and 6. You can blame Schrödinger, quantum noise, or cosmic rays — but it really highlights the need for someone deeply immersed in system internals on your team.

Two big takeaways:

The RAS daemon is amazing — we're not deploying without it ever again.
Mind the relationship between ext4 and RAID configs — mismatches can kill performance.

We could list more about Shardman here, but unless you’re working directly on it, that’d be boring.

Was it worth it? Absolutely. The knowledge we gained was priceless, and the experiment didn’t break the bank. We came out smarter, with new tools, and a deeper understanding of how to tackle an entirely new class of problems.

Oh, and we published our data loader for Shardman on GitHub — just in case you feel like repeating our beautiful chaos.

How we loaded a petabyte into PostgreSQL before New Year — and what happened next