melanny20 May 26 at 11:00

cgroups: how to eliminate the “noisy neighbor” effect in PostgreSQL

Medium

6 min

622

Postgres Professional corporate blogPostgreSQL *

Tutorial

Translation

Original author: https://habr.com/ru/users/k_karimov/

If you've ever run multiple instances of PostgreSQL or other software on a single machine (whether virtual or physical), you've probably encountered the "noisy neighbor" effect — when instances disrupted each other. So, how do you make them get along? We’ve got the answer!

The Noisy Neighbor Problem

Running multiple instances of PostgreSQL or other applications on a single machine often leads to resource contention, where applications compete for CPU, memory, and I/O, hindering proper operation. We’re familiar with this issue and want to share our methods for addressing it. The short answer is to run your application in a container.

Containers

Advantages:

User-friendly interface.
Ability to download pre-built images from Docker Hub or build custom ones for deployment.

Disadvantages:

Docker containers are not always easy to integrate into certified environments.
requires management and maintenance;
potential overheads under heavy network loads.

Control groups (cgroups)

On Linux, built-in control groups (cgroups) allow effective resource management out of the box. This mechanism is well-optimized and eliminates overhead risks.

cgroups enable:

Restricting and managing OS process resources, such as CPU, memory, and I/O devices.
Grouping processes based on task type.
Running PostgreSQL in temporary control groups with time-limited resource constraints that are removed upon process completion.

Key features of cgroups architecture:

cgroups v2, unlike v1, enforces a strict hierarchy: each process can belong to only one control group.
The root group is located at /sys/fs/cgroup.
Parent group controllers affect child groups: disabling a controller in the parent also disables it in child groups. The same applies to restrictions.
Management is possible via systemd, offering a user-friendly interface and flexible limitations.

How to configure cgroups

There is no universal recipe for setting resource limits, as configurations depend on numerous factors. However, we’ll outline basic principles using examples so you can apply them effectively.

Example setup:

Database server: Debian 12 (2 sockets, 4 cores per socket, 16 GB RAM)
PostgreSQL instances: 2 (PostgreSQL 16)
DBA: 1

Tools for managing and monitoring cgroups:

systemd-cgtop – like top, but for control groups.
systemd-cgls – displays the hierarchical structure of control groups.
systemctl status postgresql@16-main.service – shows unit status and applied limitations.

Example 1

Let's check the output of the systemd-cgtop utility with two PostgreSQL instances: main and second. They consume 2.6GB and 2.7GB of memory, respectively, and also show the number of CPU cores in use.

Step 1. We can cap the max memory usage for PostgreSQL instances managed by systemd using the systemctl show -p MemoryMax <unit_name> command:

# systemctl show -p MemoryMax postgresql@16-main.service
MemoryMax=16661413888
# systemctl show -p MemoryMax postgresql@16-second.service
MemoryMax=16661413888

This means each instance has a max allowed memory of 16.6GB (in bytes).

Step 2. We now set a memory limit at 40% of the max value for both instances using the systemctl set-property <”unit_name”> MemoryMax=40% command

# systemctl set-property postgresql@16-main.service MemoryMax=40%
# systemctl show -p MemoryMax postgresql@16-main.service
MemoryMax=6664564736
# systemctl set-property postgresql@16-second.service MemoryMax=40%
# systemctl show -p MemoryMax postgresql@16-second.service
MemoryMax=6664564736

The new cap is around 6.2GB, with current memory usage at 22.9MB (essentially idle), and a max potential of 15.5GB.

Step 3. We’ll generate different workloads for each instance using this script:

set work_mem = '1GB';
select * from generate_series(1,1000000) as a
inner join generate_series(1,1000000) as b on a=b
inner join generate_series(1,10000000) as c on b=c;

Step 4. We execute the script using pgbench:

pgbench -c 10 -C -T 1000 -f load_script.sql -n -p 5432  # Instance: main
pgbench -c 5 -C -T 1000 -f load_script.sql -n -p 5433  # Instance: second

After some time, we notice that the second instance stays within limits, consuming 2.7GB without issues.

But what about main? It hit the memory limit and was killed by the Out of Memory (OOM) killer.

dmesg output:

[2928950.287414] memory: usage 6508364kB, limit 6508364kB
[2928950.287490] oom-kill:constraint=CONSTRAINT_MEMCG,...,task=postgres,pid=301415,uid=114
[2928950.287507] Memory cgroup out of memory: Killed process 301415 (postgres)...

The Memory cgroup out of memory message confirms the process was terminated within its cgroup. To prevent this, consider using the oom_group_kill directive, ensuring the entire cgroup is terminated when memory limits are exceeded.

If you set Limit High lower than Memory Max, memory is freed either by swapping or by returning file pages to their original files, making Limit High the preferred option.

Example 2

Let's apply CPU restrictions on the same PostgreSQL instances.

pgbench -c 30 -j 30 -T 1000 --progress 5 -p 5432 -n --protocol extended
pgbench -c 30 -j 30 -T 1000 --progress 5 -p 5433 -n --protocol extended

Step 1. We run the benchmarks. Initially, both instances show similar transactions per second (TPS) rates.

Step 2. We reduce CPU time for main by half

systemctl set-property postgresql@16-main.service CPUQuota=50%

After 65 seconds, the TPS for main drops by half, while second continues normally.

Step 3. We reassign the unused 50% CPU time from main to second:

systemctl set-property postgresql@16-second.service CPUQuota=150%

By the 100-second mark, TPS for second increases, leveraging the extra CPU time.

For better balancing, we can use CPU weight controllers, which prioritize resource allocation across different control groups.

Example 3

Let's consider binding process groups to a specific CPU node. In our configuration, cores zero to three are on node zero, and cores four to seven are on node one:

#numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3
node 0 size: 7933 MB
node 0 free: 204 MB
node 1 cpus: 4 5 6 7
node 1 size: 7955 MB
node 1 free: 4926 MB
node distances:
node 0 1
0: 10 20
1: 20 10

Step 1. We generate the load using pidstat and check which CPUs PostgreSQL:

#pidstat -C "postgres" -u -G "postgres"

PostgreSQL utilizes all available cores, which may hurt performance on large systems due to uneven memory access times.

Step 2. We assign CPUs to main:

#systemctl set-property postgresql@16-main.service AllowedCPUs=0-3

Step 3. We launch the load.

We see that the CPU operates on node zero, meaning we've bound a specific CPU to a specific node. The same can be done for memory nodes by setting the AllowMemoryNodes parameter. If you set it to zero, memory will be allocated only on node zero.

Monitoring

The cgroups pseudo-filesystem allows monitoring through the following files:

*.pressure (PSI, Pressure Stall Information)
*.stat
*.current
*.events

PSI (Pressure Stall Information)

PSI is a reactive monitoring tool that helps identify where performance is being lost. It provides data on processes waiting for resources (CPU, memory, I/O) over 10, 60, or 300 seconds.

Example output:

some avg10=54.82 avg60=13.98 avg300=3.10 total=10452199
full avg10=0.78 avg60=0.21 avg300=0.04 total=217371

some avg10=54.82 shows the percentage of time a process was stalled during the observation period. In this case, over a 10-second window, the process was waiting for a resource for about 50% of the time (approximately 5 seconds).

full avg10=0.78 represents the percentage of time all processes in the system were stalled during the same period.

.stat

Files with the .stat extension provide statistical information about the resources (CPU, memory, I/O) used by process groups.

Example output from cpu.stat:

# cat cpu.stat
usage_usec 860802
user_usec 255721
system_usec 605080
nr_periods 287
nr_throttled 0
throttled_usec 0
nr_bursts 0
burst_usec 0

Where:

usage_usec – total CPU time used (in microseconds).
user_usec – time spent in user space.
system_usec – time spent in kernel space.
nr_periods – number of scheduling periods.
nr_throttled – number of times the process was throttled (CPU cycles were skipped).
throttled_usec – total time spent in a throttled state.
nr_bursts – number of times the CPU limit was exceeded.

.events

Files with the .events extension provide information about events occurring in a control group. There are event files for memory, memory.swap, misc, and pids.

Example output from memory.events:

# cat memory.events
low 0
high 11899
max 0
oom 0
oom_kill 0
oom_group_kill 0

Where:

low – number of times the memory usage reached the low threshold.
high – number of times the memory usage hit the high threshold.
max – number of times the memory usage hit the maximum limit.
oom – number of times the system was close to triggering the OOM killer.
oom_kill – number of times the OOM killer terminated a process in this control group.
oom_group_kill – number of times the OOM killer terminated all processes in this control group. If you set the oom_group_kill parameter to true, the OOM killer will terminate all processes in the control group when triggered. In PostgreSQL, this results in a database restart.

.current

Files with the .current extension contain information about the current amount of resources used, including caches, memory pages, kernel data structures, and more.

Available .current files include memory, memory.swap, memory.zswap (сжатый swap), misc, pids, and rdma.

Why use cgroups?

With cgroups, you can:

Prevent noisy neighbor issues by prioritizing critical instances and balancing workloads.
Protect against resource abuse.
Dynamically adjust resource limits without restarting services.

Bonus: cgroups are built into Linux by default.

Using cgroups is an effective way to manage PostgreSQL resource allocation, prevent service interference, and optimize system performance.

Have experience with cgroups? Share your thoughts in the comments!

Hubs: