#server-performance

Linux Memory Management: Swap, OOM Killer & Cgroups

12 min read - May 31, 2026

Table of contents

Linux memory management explained: swap, OOM killer, and cgroups
How Linux manages memory pages
Configuring swap
The OOM killer
Cgroups and memory limits
Memory configuration by server role

Share

How Linux swap, the OOM killer, and cgroups work together — with configuration examples for databases, web servers, and multi-tenant VPS hosts.

Table of contents

Linux memory management explained: swap, OOM killer, and cgroups
How Linux manages memory pages
Configuring swap
The OOM killer
Cgroups and memory limits
Memory configuration by server role

Linux memory management explained: swap, OOM killer, and cgroups

Linux handles memory differently than most operating systems. High RAM usage isn't always a problem — the kernel actively uses free memory for caching to speed up disk reads. But when real memory pressure builds, three mechanisms do the work: swap, the OOM killer, and cgroups. Understanding how each one behaves, and how to configure them, is the difference between a server that degrades gracefully under load and one that crashes without warning.

How Linux manages memory pages

Every process runs in its own virtual address space, up to 128 TB on 64-bit systems. The kernel maps these virtual addresses to physical RAM through page tables, with the Translation Lookaside Buffer (TLB) caching recent lookups. A TLB hit takes around 1 nanosecond; a miss costs 20–100 nanoseconds, which adds up in memory-intensive workloads like databases.

Physical memory is divided into 4 KB pages, and the kernel splits them into two categories:

File-backed pages — tied to files on disk. The kernel can discard clean ones or flush dirty ones without needing swap.
Anonymous pages — heap and stack memory with no backing file. These must be written to swap before the kernel can free them.

On servers with high memory demand, a large proportion of anonymous pages means swap gets involved early. Watch the si (swap in) and so (swap out) columns in vmstat 1 — persistent non-zero values are your first warning that the system is under pressure.

For monitoring, use the right tool for the job:

Tool	Best for	Key metric
`free -h`	Quick system-wide snapshot	`available` column
`vmstat 1`	Real-time swap and I/O monitoring	`si`, `so`
`htop`	Interactive per-process view	Memory bars, process list
`smem`	Accurate per-process usage	USS (Unique Set Size)
`/proc/meminfo`	Kernel-level detail	`MemAvailable`, `Dirty`, `Slab`

One common mistake: watching the free column in free -h and panicking. The available column is what matters. It includes memory the kernel can reclaim from cache on demand. A server showing only 512 MB free but 5 GB available is not in trouble.

When memory drops below a threshold, the kernel's kswapd daemon starts reclaiming pages in the background. If that's not enough, the kernel falls into direct reclaim, blocking processes until pages are freed. This is where latency spikes come from. Set an alert when MemAvailable falls below 10–15% of total RAM so you have time to respond.

Configuring swap

Swap is a disk area — either a partition or a file — where the kernel moves inactive anonymous pages when RAM fills up. The speed gap is significant: DDR4 RAM has roughly 100 ns latency, while NVMe SSDs are around 100,000 ns and SATA SSDs closer to 500,000 ns. Swap is a safety buffer, not extra RAM. A server that consistently relies on swap has a memory problem that more swap won't fix.

Use a swap file rather than a partition. It's easier to resize and doesn't require repartitioning.

sudo fallocate -l 2G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

Add the file to /etc/fstab to persist across reboots. The chmod 600 step is required — any data paged out of RAM is readable from swap, so the file must not be world-readable.

After creating swap, tune vm.swappiness. The default of 60 is aggressive. For most hosting workloads you want the kernel to prefer RAM and only use swap as a last resort:

Server role	`vm.swappiness`	`vm.vfs_cache_pressure`
General web server	10–20	50
Database (MySQL/PostgreSQL)	1–5	50
Default (most distros)	60	100

For swap sizing: 1–2 GB is enough for a 2 GB VPS handling occasional traffic spikes. On systems with 8 GB or more, a fixed 2–4 GB swap is generally sufficient. The goal is to give the kernel a pressure valve for cold pages, not to extend total addressable memory.

On RAM-constrained servers with ample CPU, zram creates a compressed swap area in memory, avoiding disk I/O entirely. It's worth considering on multi-tenant VPS hosts where NVMe is shared across tenants. Watch for I/O contention if swap lives on the same device as database files — heavy swapping and high-throughput disk writes don't coexist well.

The OOM killer

When the kernel exhausts RAM and swap and can't reclaim enough memory through other means, the OOM killer steps in. It scores processes using the oom_badness() function:

points = (rss_anon + rss_file + rss_shmem + swapents + pgtables_pages) + (oom_score_adj × totalpages / 1000)

The process with the highest score gets killed. The formula favors large memory consumers, and the kernel avoids killing multiple processes in quick succession by checking whether a process was already terminated in the last 5 seconds.

Two types of OOM events appear in logs:

Global OOM — the entire system is out of RAM and swap. Logs prefix with Out of memory:
Cgroup OOM — a container or service hit its memory.max limit. Logs prefix with Memory cgroup out of memory:

To review past OOM events:

dmesg -T | grep -i "out of memory"
journalctl -k --grep="oom"

Pay attention to the order field in OOM logs. A value above 0 suggests memory fragmentation rather than total exhaustion — the kernel couldn't find enough contiguous pages even with free memory available.

You can influence which processes the OOM killer targets by adjusting /proc/<pid>/oom_score_adj. The range is -1000 (never kill) to +1000 (kill first). For systemd-managed services, set this permanently in the unit file:

[Service]
OOMScoreAdjust=-1000

Additional sysctl parameters for tuning OOM behavior:

Parameter	Value	Effect
`vm.overcommit_memory`	0	Default heuristic overcommit mode
`vm.overcommit_memory`	2	Strict mode; prevents allocations exceeding RAM × overcommit_ratio + swap
`vm.panic_on_oom`	1	Reboots instead of killing a process
`vm.oom_kill_allocating_task`	1	Kills the process that triggered OOM rather than the largest consumer

For proactive monitoring, check /proc/pressure/memory (Pressure Stall Information, available since kernel 4.20). Watch the some avg10 value: below 5% is healthy, sustained above 20% means an OOM event is likely coming. A rising allocstall counter in /proc/vmstat is another early signal — it counts direct reclaim stalls, which often precede OOM kills. Tools like systemd-oomd or earlyoom can act on PSI thresholds before the kernel's OOM killer fires.

Cgroups and memory limits

Control groups (cgroups) let you organize processes into groups and enforce hard resource limits. Introduced in Linux 2.6.24, they're the foundation of container runtimes including Docker, Podman, Kubernetes, and LXC. The kernel tracks memory usage per cgroup, covering anonymous memory, file-backed pages, and kernel objects. If a cgroup hits its limit, the kernel reclaims memory within that group or triggers a cgroup-scoped OOM kill.

Cgroup v1 and v2 differ primarily in how they're structured. V1 mounts each controller (memory, CPU, I/O) separately under /sys/fs/cgroup/<controller>/, which leads to inconsistent resource tracking. V2 uses a unified hierarchy at /sys/fs/cgroup/. Kubernetes switched to v2 as default in version 1.25 and dropped v1 support in 1.31.

To check which version your system uses:

stat -fc %T /sys/fs/cgroup/

cgroup2fs means v2; tmpfs typically means v1.

Feature	Cgroup v1	Cgroup v2
Hierarchy	Multiple, per-controller	Single, unified
Hard memory limit	`memory.limit_in_bytes`	`memory.max`
Soft memory limit	`memory.soft_limit_in_bytes`	`memory.high` (throttles)
Usage tracking	`memory.usage_in_bytes`	`memory.current`
Pressure metrics	Limited	PSI integrated

The key memory controls in cgroup v2:

Parameter	Type	Description
`memory.max`	Hard limit	Exceeding this triggers the OOM killer
`memory.high`	Soft limit	Throttles allocation and triggers reclaim before hitting the hard limit
`memory.low`	Soft protection	Memory below this threshold is reclaimed last
`memory.min`	Hard protection	Memory below this level is never reclaimed
`memory.swap.max`	Swap limit	Set to 0 to disable swap for this cgroup
`memory.oom.group`	Boolean	If enabled, OOM kills all processes in the cgroup together

A practical rule: set memory.high around 10–20% below memory.max to give the kernel room to reclaim before hitting the hard limit. When sizing memory.max, add 20–30% above the application's peak usage to account for page cache, which counts against cgroup memory totals.

Manage cgroups via systemd rather than writing directly to the cgroup filesystem. Use unit file directives like MemoryMax=, MemoryHigh=, and MemoryMin= for persistent limits. For quick tests:

systemd-run --scope -p MemoryMax=512M <command>

For web server worker pools, setting memory.oom.group=1 ensures a clean kill if one worker exceeds its limit — no orphaned processes left behind. For database engines, memory.min protects the buffer pool from being reclaimed under system-wide pressure.

Memory configuration by server role

The right memory settings depend on what the server is doing. Applying the same configuration to a database and a PHP web server will hurt one of them.

Server role	`vm.swappiness`	OOM strategy	Cgroup policy
Database	1–5	Protect (`OOMScoreAdjust=-900`)	Use `memory.min` to protect buffer pool
Web/app server	10–20	Default	Cap per worker pool via `memory.max`
Background worker	60	Killable (`OOMScoreAdjust=+200`)	Throttle via `memory.high`
Multi-tenant VPS	60 (with zram)	Default	Hard isolation per tenant via `memory.max`

For MySQL and PostgreSQL, allocate 50–70% of available RAM to innodb_buffer_pool_size, disable Transparent Huge Pages to reduce latency spikes, and protect the process with OOMScoreAdjust=-900 in the systemd unit file.

For PHP-FPM, size worker pools against actual memory usage. Each worker typically uses 30–100 MB. Divide allocated RAM by average worker size to get a safe pm.max_children value. Use memory.max in cgroups to cap the pool.

For write-heavy workloads, set vm.dirty_ratio to around 10% and vm.dirty_background_ratio to 3%. This flushes dirty pages more frequently, avoiding large I/O stalls.

Make kernel tuning persistent by saving parameters to /etc/sysctl.d/90-memory.conf. Settings applied at runtime are lost on reboot.

For a summary of recommended values by role:

Parameter	Web/app server	Database server
`vm.swappiness`	10–20	1–5
`vm.vfs_cache_pressure`	50	50
`vm.dirty_ratio`	15%	10%
`vm.min_free_kbytes`	65536	65536
OOM protection	Default	`OOMScoreAdjust=-1000`

If you're running high-density workloads and need a server with the headroom to apply these policies properly, FDC's dedicated servers are worth a look.

Blog