#server-performance

cgroups v2 resource limits with systemd

11 min read - June 3, 2026

Table of contents

cgroups v2 resource limits with systemd
Enabling cgroups v2
How systemd organises cgroups
CPU limits
Memory limits with cgroups v2
I/O limits
Multi-tenant isolation with slices
Monitoring with systemd-cgtop and PSI

Share

Set CPU, memory, and I/O limits with cgroups v2 and systemd. Practical config for multi-tenant Linux hosts, with PSI monitoring and slice isolation.

Table of contents

cgroups v2 resource limits with systemd
Enabling cgroups v2
How systemd organises cgroups
CPU limits
Memory limits with cgroups v2
I/O limits
Multi-tenant isolation with slices
Monitoring with systemd-cgtop and PSI

cgroups v2 resource limits with systemd

cgroups v2 is the Linux kernel's unified resource control framework. It replaces the fragmented v1 hierarchy with a single tree that handles CPU, memory, and I/O consistently, and underpins container isolation in Docker, Kubernetes, and systemd. This post covers how to enable cgroups v2, set limits via systemd, and apply it to real multi-tenant hosting scenarios.

Enabling cgroups v2

Modern distributions ship with cgroups v2 enabled by default: Ubuntu 21.10+, Debian 11+, Fedora 31+, and RHEL/Rocky 9+. Older systems may run a hybrid hierarchy or still default to v1. Check with:

stat -fc %T /sys/fs/cgroup/

Output of cgroup2fs confirms v2 is active. tmpfs typically means v1.

To switch a hybrid system to pure v2, edit /etc/default/grub and append the following to GRUB_CMDLINE_LINUX_DEFAULT:

systemd.unified_cgroup_hierarchy=1 cgroup_no_v1=all

Then regenerate GRUB and reboot:

sudo update-grub
sudo reboot

For production, run kernel 5.2 or newer so you get the cgroup freezer for v2, and systemd 244+ for full cpuset delegation. On Rocky Linux 8 and RHEL 8 you may also need to enable accounting explicitly by adding these lines to /etc/systemd/system.conf:

DefaultCPUAccounting=yes
DefaultMemoryAccounting=yes
DefaultIOAccounting=yes

Reload with sudo systemctl daemon-reexec. After the reboot, verify which controllers are available:

cat /sys/fs/cgroup/cgroup.controllers

You should see cpu, memory, io, and pids listed. These controllers are not enabled for child cgroups by default. To activate them, write to the root subtree control file:

echo "+cpu +memory +io" | sudo tee /sys/fs/cgroup/cgroup.subtree_control

For a thorough tour of how v2 differs from v1 internally, Michael Kerrisk's NDC TechTown talk is the best single resource:

How systemd organises cgroups

systemd creates a cgroup for every service it starts, named after the unit. nginx.service gets /sys/fs/cgroup/system.slice/nginx.service/, and every process it spawns lives inside that cgroup. Three unit types map directly to the hierarchy:

Unit type	Role	Description
`.slice`	Inner node	Groups related services and defines shared limits
`.service`	Terminal node	Manages processes started by systemd
`.scope`	Leaf node	Tracks processes started externally (container payloads, login sessions)

Four default slices ship out of the box: -.slice (root), system.slice, user.slice, and machine.slice. Any limit applied to a slice automatically applies to every service in it.

One v2 rule worth remembering: processes can only live in leaf nodes. A cgroup with child cgroups cannot directly host processes, which is why systemd never puts services into a slice's trunk.

Always set limits through systemd rather than writing to /sys/fs/cgroup/ directly. Manual writes don't persist across reboots and conflict with systemd's exclusive ownership of the hierarchy. Use systemctl set-property for one-off changes and unit drop-ins (systemctl edit nginx.service) for permanent ones.

CPU limits

cgroups v2 gives you two CPU controls: a hard cap (cpu.max, exposed as CPUQuota in systemd) and a proportional weight (cpu.weight / CPUWeight).

CPUQuota is an absolute ceiling. CPUQuota=50% allows half a core; CPUQuota=200% allows two full cores worth of time. The service is throttled if it tries to go higher, regardless of how idle the rest of the CPU is.

CPUWeight only matters under contention. The range is 1 to 10,000, default 100. Three services with weights of 150, 100, and 50 receive roughly 50%, 33%, and 17% of CPU time when they all want it at once. When the CPU is otherwise idle, weights don't constrain anything.

For latency-sensitive workloads, pin processes to specific cores with AllowedCPUs=. This reduces context switching and keeps the per-core cache hot:

[Service]
CPUQuota=200%
CPUWeight=150
AllowedCPUs=0-3

Use a hard quota when you need predictable cost (multi-tenant billing, noisy-neighbour isolation). Use weights when you want maximum hardware utilisation and just need priority ordering during spikes.

Memory limits with cgroups v2

Memory has two tiers: memory.high (soft, throttling) and memory.max (hard, OOM). For background on swap, page reclaim, and the kernel OOM killer, see our companion post on Linux memory management.

Set memory.high roughly 10 to 20% below memory.max. The kernel starts reclaiming pages and throttling allocations once memory.high is crossed, which usually lets the workload recover before the OOM killer fires. If usage hits memory.max, the kernel kills processes in the cgroup.

A typical configuration:

[Service]
MemoryHigh=400M
MemoryMax=512M
MemorySwapMax=0

MemorySwapMax=0 disables swap for this cgroup. Worth doing for latency-sensitive workloads (databases, real-time streaming) where swap I/O would tank tail latency.

For worker pools where leaving orphaned siblings behind would corrupt shared state, write 1 to the cgroup's memory.oom.group file. When one process is OOM-killed, the kernel kills every process in the cgroup together.

Check memory.events to see how often a service has been throttled or OOM-killed:

cat /sys/fs/cgroup/system.slice/nginx.service/memory.events

The high and oom_kill counters tell you whether your limits are sized correctly. Persistent non-zero values mean the workload needs more headroom.

I/O limits

The I/O controller has the same two-mode design: absolute caps via io.max and proportional sharing via io.weight.

Limits are per block device, identified by major:minor numbers. Find them with lsblk -o NAME,MAJ:MIN. A typical systemd config:

[Service]
IOReadBandwidthMax=/dev/sda 50M
IOWriteBandwidthMax=/dev/sda 30M
IOReadIOPSMax=/dev/sda 1000
IOWriteIOPSMax=/dev/sda 500

io.weight works like cpu.weight: range 1 to 10,000, default 100. Assigning 500 to a customer-facing service and 50 to a nightly backup keeps the backup from saturating the disk during peak hours, but lets it use full bandwidth when nothing else needs it.

I/O limits only apply when you target the right device. The kernel tracks I/O by block device, so a limit on /dev/sda does nothing for I/O going to /dev/nvme0n1. On hosts with multiple disks, set limits per device.

Multi-tenant isolation with slices

For shared environments, define a slice per tenant. Create /etc/systemd/system/tenant-a.slice:

[Slice]
CPUQuota=200%
CPUWeight=150
MemoryHigh=3584M
MemoryMax=4096M
MemorySwapMax=0
IOReadBandwidthMax=/dev/sda 200M
TasksMax=512

TasksMax=512 caps the total number of processes and threads, which stops a fork bomb in one tenant from taking down the host. Drop tenant services into this slice (via Slice=tenant-a.slice in their unit files) and they inherit everything automatically.

This pattern also works for separating noisy background work from user-facing services. Put backups, log rotation, and batch jobs in a background.slice with low CPUWeight and io.weight values. They get full resources when the system is idle and step aside when production traffic arrives.

For container runtimes like Docker and Podman, add Delegate=yes to their systemd unit files. This lets them manage their own sub-cgroups without root, and the limits set on the parent slice still apply to everything underneath.

Monitoring with systemd-cgtop and PSI

For a live top-style view of CPU, memory, and I/O per cgroup, run:

systemd-cgtop

For the static hierarchy and which processes live where, use systemd-cgls.

The single most useful v2 feature for production monitoring is Pressure Stall Information (PSI). PSI reports the percentage of time tasks in a cgroup were stalled waiting for a resource, exposed in three files per cgroup:

cat /sys/fs/cgroup/tenant-a.slice/cpu.pressure
cat /sys/fs/cgroup/tenant-a.slice/memory.pressure
cat /sys/fs/cgroup/tenant-a.slice/io.pressure

A CPU at 100% utilisation with 0% pressure is healthy. Every task that wants CPU is getting it. The same CPU at 80% utilisation but 30% pressure means tasks are queueing for runtime. Alert on PSI, not on utilisation: it catches contention that utilisation metrics miss completely.

Adjust limits live without restarting anything:

sudo systemctl set-property tenant-a.slice MemoryMax=6144M

The change applies immediately and persists across reboots. Combined with PSI-based alerting, this lets you respond to load shifts before they turn into OOM kills or runaway latency.

If you're running high-density multi-tenant workloads and need a host with the headroom to apply these policies cleanly, our dedicated servers are built for it.

Blog