A year-long hunt for a Linux kernel bug, or the unexpected zeros from XFS

You’ve probably had this happen too: a service runs smoothly, keeps users happy with its stability and performance, and your monitoring stays reassuringly green. Then, the next moment — boom, it’s gone. You panic, dive into the error logs, and find either a vague segfault or nothing at all. What to do is unclear, and production needs saving, so you bring it back up — and everything works just like before. You try to investigate what happened, but over time you switch to other tasks, and the incident fades into the background or gets forgotten entirely.
That’s all fine when you’re on your own. But once you have many customers, sooner or later you start feeling that something isn’t right, and that you need to dig into these spikes of entropy to find the root cause of such incidents.
This article describes our year-long investigation. You’ll learn why PostgreSQL (and any other application) can crash because of a bug in the Linux kernel, what XFS has to do with it, and why memory reclamation might not be as helpful as you thought.















