#server-performance

Linux Memory Management: Swap, OOM Killer & Cgroups

12 min read - May 31, 2026

hero section cover
Table of contents
  • Linux memory management explained: swap, OOM killer, and cgroups
  • How Linux manages memory pages
  • Configuring swap
  • The OOM killer
  • Cgroups and memory limits
  • Memory configuration by server role
Share

How Linux swap, the OOM killer, and cgroups work together — with configuration examples for databases, web servers, and multi-tenant VPS hosts.

Linux memory management explained: swap, OOM killer, and cgroups

Linux handles memory differently than most operating systems. High RAM usage isn't always a problem — the kernel actively uses free memory for caching to speed up disk reads. But when real memory pressure builds, three mechanisms do the work: swap, the OOM killer, and cgroups. Understanding how each one behaves, and how to configure them, is the difference between a server that degrades gracefully under load and one that crashes without warning.

How Linux manages memory pages

Every process runs in its own virtual address space, up to 128 TB on 64-bit systems. The kernel maps these virtual addresses to physical RAM through page tables, with the Translation Lookaside Buffer (TLB) caching recent lookups. A TLB hit takes around 1 nanosecond; a miss costs 20–100 nanoseconds, which adds up in memory-intensive workloads like databases.

Physical memory is divided into 4 KB pages, and the kernel splits them into two categories:

  • File-backed pages — tied to files on disk. The kernel can discard clean ones or flush dirty ones without needing swap.
  • Anonymous pages — heap and stack memory with no backing file. These must be written to swap before the kernel can free them.

On servers with high memory demand, a large proportion of anonymous pages means swap gets involved early. Watch the si (swap in) and so (swap out) columns in vmstat 1 — persistent non-zero values are your first warning that the system is under pressure.

For monitoring, use the right tool for the job:

ToolBest forKey metric
free -hQuick system-wide snapshotavailable column
vmstat 1Real-time swap and I/O monitoringsi, so
htopInteractive per-process viewMemory bars, process list
smemAccurate per-process usageUSS (Unique Set Size)
/proc/meminfoKernel-level detailMemAvailable, Dirty, Slab

One common mistake: watching the free column in free -h and panicking. The available column is what matters. It includes memory the kernel can reclaim from cache on demand. A server showing only 512 MB free but 5 GB available is not in trouble.

When memory drops below a threshold, the kernel's kswapd daemon starts reclaiming pages in the background. If that's not enough, the kernel falls into direct reclaim, blocking processes until pages are freed. This is where latency spikes come from. Set an alert when MemAvailable falls below 10–15% of total RAM so you have time to respond.


 

Configuring swap

Swap is a disk area — either a partition or a file — where the kernel moves inactive anonymous pages when RAM fills up. The speed gap is significant: DDR4 RAM has roughly 100 ns latency, while NVMe SSDs are around 100,000 ns and SATA SSDs closer to 500,000 ns. Swap is a safety buffer, not extra RAM. A server that consistently relies on swap has a memory problem that more swap won't fix.

Use a swap file rather than a partition. It's easier to resize and doesn't require repartitioning.

sudo fallocate -l 2G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

Add the file to /etc/fstab to persist across reboots. The chmod 600 step is required — any data paged out of RAM is readable from swap, so the file must not be world-readable.

After creating swap, tune vm.swappiness. The default of 60 is aggressive. For most hosting workloads you want the kernel to prefer RAM and only use swap as a last resort:

Server rolevm.swappinessvm.vfs_cache_pressure
General web server10–2050
Database (MySQL/PostgreSQL)1–550
Default (most distros)60100

For swap sizing: 1–2 GB is enough for a 2 GB VPS handling occasional traffic spikes. On systems with 8 GB or more, a fixed 2–4 GB swap is generally sufficient. The goal is to give the kernel a pressure valve for cold pages, not to extend total addressable memory.

On RAM-constrained servers with ample CPU, zram creates a compressed swap area in memory, avoiding disk I/O entirely. It's worth considering on multi-tenant VPS hosts where NVMe is shared across tenants. Watch for I/O contention if swap lives on the same device as database files — heavy swapping and high-throughput disk writes don't coexist well.

The OOM killer

When the kernel exhausts RAM and swap and can't reclaim enough memory through other means, the OOM killer steps in. It scores processes using the oom_badness() function:

points = (rss_anon + rss_file + rss_shmem + swapents + pgtables_pages) + (oom_score_adj × totalpages / 1000)

The process with the highest score gets killed. The formula favors large memory consumers, and the kernel avoids killing multiple processes in quick succession by checking whether a process was already terminated in the last 5 seconds.

Two types of OOM events appear in logs:

  • Global OOM — the entire system is out of RAM and swap. Logs prefix with Out of memory:
  • Cgroup OOM — a container or service hit its memory.max limit. Logs prefix with Memory cgroup out of memory:

To review past OOM events:

dmesg -T | grep -i "out of memory"
journalctl -k --grep="oom"

Pay attention to the order field in OOM logs. A value above 0 suggests memory fragmentation rather than total exhaustion — the kernel couldn't find enough contiguous pages even with free memory available.

You can influence which processes the OOM killer targets by adjusting /proc/<pid>/oom_score_adj. The range is -1000 (never kill) to +1000 (kill first). For systemd-managed services, set this permanently in the unit file:

[Service]
OOMScoreAdjust=-1000

Additional sysctl parameters for tuning OOM behavior:

ParameterValueEffect
vm.overcommit_memory0Default heuristic overcommit mode
vm.overcommit_memory2Strict mode; prevents allocations exceeding RAM × overcommit_ratio + swap
vm.panic_on_oom1Reboots instead of killing a process
vm.oom_kill_allocating_task1Kills the process that triggered OOM rather than the largest consumer

For proactive monitoring, check /proc/pressure/memory (Pressure Stall Information, available since kernel 4.20). Watch the some avg10 value: below 5% is healthy, sustained above 20% means an OOM event is likely coming. A rising allocstall counter in /proc/vmstat is another early signal — it counts direct reclaim stalls, which often precede OOM kills. Tools like systemd-oomd or earlyoom can act on PSI thresholds before the kernel's OOM killer fires.

Cgroups and memory limits

Control groups (cgroups) let you organize processes into groups and enforce hard resource limits. Introduced in Linux 2.6.24, they're the foundation of container runtimes including Docker, Podman, Kubernetes, and LXC. The kernel tracks memory usage per cgroup, covering anonymous memory, file-backed pages, and kernel objects. If a cgroup hits its limit, the kernel reclaims memory within that group or triggers a cgroup-scoped OOM kill.

Cgroup v1 and v2 differ primarily in how they're structured. V1 mounts each controller (memory, CPU, I/O) separately under /sys/fs/cgroup/<controller>/, which leads to inconsistent resource tracking. V2 uses a unified hierarchy at /sys/fs/cgroup/. Kubernetes switched to v2 as default in version 1.25 and dropped v1 support in 1.31.

To check which version your system uses:

stat -fc %T /sys/fs/cgroup/

cgroup2fs means v2; tmpfs typically means v1.

FeatureCgroup v1Cgroup v2
HierarchyMultiple, per-controllerSingle, unified
Hard memory limitmemory.limit_in_bytesmemory.max
Soft memory limitmemory.soft_limit_in_bytesmemory.high (throttles)
Usage trackingmemory.usage_in_bytesmemory.current
Pressure metricsLimitedPSI integrated

The key memory controls in cgroup v2:

ParameterTypeDescription
memory.maxHard limitExceeding this triggers the OOM killer
memory.highSoft limitThrottles allocation and triggers reclaim before hitting the hard limit
memory.lowSoft protectionMemory below this threshold is reclaimed last
memory.minHard protectionMemory below this level is never reclaimed
memory.swap.maxSwap limitSet to 0 to disable swap for this cgroup
memory.oom.groupBooleanIf enabled, OOM kills all processes in the cgroup together

A practical rule: set memory.high around 10–20% below memory.max to give the kernel room to reclaim before hitting the hard limit. When sizing memory.max, add 20–30% above the application's peak usage to account for page cache, which counts against cgroup memory totals.

Manage cgroups via systemd rather than writing directly to the cgroup filesystem. Use unit file directives like MemoryMax=, MemoryHigh=, and MemoryMin= for persistent limits. For quick tests:

systemd-run --scope -p MemoryMax=512M <command>

For web server worker pools, setting memory.oom.group=1 ensures a clean kill if one worker exceeds its limit — no orphaned processes left behind. For database engines, memory.min protects the buffer pool from being reclaimed under system-wide pressure.

Memory configuration by server role

The right memory settings depend on what the server is doing. Applying the same configuration to a database and a PHP web server will hurt one of them.

Server rolevm.swappinessOOM strategyCgroup policy
Database1–5Protect (OOMScoreAdjust=-900)Use memory.min to protect buffer pool
Web/app server10–20DefaultCap per worker pool via memory.max
Background worker60Killable (OOMScoreAdjust=+200)Throttle via memory.high
Multi-tenant VPS60 (with zram)DefaultHard isolation per tenant via memory.max

For MySQL and PostgreSQL, allocate 50–70% of available RAM to innodb_buffer_pool_size, disable Transparent Huge Pages to reduce latency spikes, and protect the process with OOMScoreAdjust=-900 in the systemd unit file.

For PHP-FPM, size worker pools against actual memory usage. Each worker typically uses 30–100 MB. Divide allocated RAM by average worker size to get a safe pm.max_children value. Use memory.max in cgroups to cap the pool.

For write-heavy workloads, set vm.dirty_ratio to around 10% and vm.dirty_background_ratio to 3%. This flushes dirty pages more frequently, avoiding large I/O stalls.

Make kernel tuning persistent by saving parameters to /etc/sysctl.d/90-memory.conf. Settings applied at runtime are lost on reboot.

For a summary of recommended values by role:

ParameterWeb/app serverDatabase server
vm.swappiness10–201–5
vm.vfs_cache_pressure5050
vm.dirty_ratio15%10%
vm.min_free_kbytes6553665536
OOM protectionDefaultOOMScoreAdjust=-1000

If you're running high-density workloads and need a server with the headroom to apply these policies properly, FDC's dedicated servers are worth a look.

Blog

Featured this week

More articles
Linux Memory Management: Swap, OOM Killer & Cgroups
#server-performance

Linux Memory Management: Swap, OOM Killer & Cgroups

How Linux swap, the OOM killer, and cgroups work together — with configuration examples for databases, web servers, and multi-tenant VPS hosts.

12 min read - May 31, 2026

#server-performance

Prometheus and node_exporter setup guide

15 min read - May 29, 2026

More articles
background image

Have questions or need a custom solution?

icon

Flexible options

icon

Global reach

icon

Instant deployment

icon

Flexible options

icon

Global reach

icon

Instant deployment