#server-performance

NUMA Awareness and CPU Pinning for Dedicated Servers

16 min read - June 16, 2026

Table of contents

NUMA awareness and CPU pinning for dedicated servers
How NUMA works on multi-socket servers
Inspecting NUMA topology on Linux
CPU pinning and memory policies
Picking a strategy by workload
BIOS and kernel settings to check

Share

How to inspect NUMA topology and pin Linux workloads to the right cores and memory. Covers numactl, taskset, systemd, BIOS settings, and workload-specific strategies.

Table of contents

NUMA awareness and CPU pinning for dedicated servers
How NUMA works on multi-socket servers
Inspecting NUMA topology on Linux
CPU pinning and memory policies
Picking a strategy by workload
BIOS and kernel settings to check

NUMA awareness and CPU pinning for dedicated servers

On any multi-socket server, where a process runs and where its memory lives are two different questions, and getting them out of sync is one of the easiest ways to leave performance on the table. NUMA awareness and CPU pinning are the two knobs that fix this. This post covers how NUMA works, how to inspect it on Linux, and how to pin workloads correctly for databases, AI training, and latency-sensitive services.

How NUMA works on multi-socket servers

A NUMA (Non-Uniform Memory Access) node is a group of CPU cores bound to a local block of RAM through a dedicated memory controller. On a two-socket server you usually have two nodes. Any core can read any address, but local access is roughly 80 ns while a cross-socket hop over Intel's UPI or AMD's Infinity Fabric is around 130–150 ns. On larger systems with more sockets, the worst-case node can push past 250 ns.

Bandwidth follows the same pattern. A two-socket Sapphire Rapids system can sustain around 600 GB/s when cores hit local memory, but the inter-socket link is a fraction of that, so traffic crossing it bottlenecks fast. High-core processors make this more granular: Intel's Sub-NUMA Clustering (SNC) and AMD's Nodes Per Socket (NPS) split each socket into multiple NUMA domains, so a "two-socket" box can easily present four or eight nodes to Linux.

Without NUMA awareness, the Linux scheduler will happily migrate a thread between sockets while its working set stays on the original node. Every subsequent access becomes a remote one. The visible symptom is high CPU utilisation with low actual throughput, because the cores are spending their time waiting on memory. I/O devices make this worse. A GPU or NIC is attached to a specific PCIe root, which belongs to one NUMA node. If the process feeding it runs on the other socket, every DMA transfer crosses the interconnect.

Inspecting NUMA topology on Linux

Four tools cover almost everything you need:

lscpu for a quick socket and node summary.
numactl --hardware for node memory totals and the inter-node distance matrix.
numastat for per-process hit/miss counters.
lstopo (from hwloc) for cache hierarchy and PCIe device locality.

Start with numactl --hardware. It lists each node, the cores and memory belonging to it, and the distance matrix. A value of 10 is local, 20+ is remote. If you see a single node on a multi-socket box, your BIOS has Node Interleaving enabled and is hiding the topology, fix that first (see below).

For a specific process, numastat -p <PID> breaks down where its memory is actually allocated. Four counters matter:

numa_hit: memory allocated on the intended node. You want this high.
numa_miss: intended node was full, allocation spilled elsewhere.
numa_foreign: another node tried to allocate locally and couldn't, indicates memory pressure.
other_node: pages allocated on a node other than where the process is running. High values here are the classic sign of bad pinning.

For GPU or NIC workloads, run lstopo-no-graphics and look at which NUMA node each PCIe device is attached to. If the cores driving the device are on the other node, that's the first thing to fix.

CPU pinning and memory policies

CPU pinning (or CPU affinity) binds a process to specific cores so the scheduler can't migrate it. By itself that's not enough, because Linux uses a first-touch memory policy by default: pages are allocated on whichever node first writes to them. If a thread starts on the wrong node before it gets pinned, its memory stays there. You need to control both placement and allocation together.

Three tools cover the common cases:

Tool	Controls	Use for
`taskset`	CPU cores only	Quick one-off binding of an existing process
`numactl`	CPU cores and memory	Launching workloads with strict locality
systemd	CPU cores and memory, persistent	Services that need pinning across reboots

numactl supports four memory policies:

--membind=N: allocate only on node N, fail if full.
--preferred=N: prefer node N, fall back to others if needed.
--interleave=all: round-robin across nodes for even bandwidth distribution.
--localalloc: allocate on whichever node the running CPU is on.

Pinning a workload to one node

First, identify which cores belong to your target node:

numactl --hardware

Then launch the application bound to that node for both cores and memory:

numactl --cpunodebind=0 --membind=0 ./your_application

For an already-running process, adjust CPU affinity with taskset:

taskset -cp 0-7 <PID>

To make it survive a reboot, set it in the systemd unit:

[Service]
CPUAffinity=0-7
NUMAPolicy=bind
NUMAMask=0

Reload and restart:

sudo systemctl daemon-reload && sudo systemctl restart <service>

When you're manually pinning, turn off the kernel's auto-balancer so it doesn't fight your placement:

sysctl -w kernel.numa_balancing=0

Add it to /etc/sysctl.conf to persist. Then verify with numastat -p <PID> over a couple of minutes of real workload. If other_node stays near zero, the pinning is taking effect.

Picking a strategy by workload

The right policy depends on whether your workload benefits more from low latency or from aggregate bandwidth across all nodes.

Workload	Policy	Why
Databases (PostgreSQL, MySQL, SQL Server)	`--cpunodebind` + `--membind`	Large shared buffers, latency-sensitive query paths
In-memory cache (Redis, Memcached)	Single-node bind	Everything is RAM access, remote latency shows up immediately
AI/ML training and inference	Bind to the GPU's NUMA node	Avoids tensor transfers crossing PCIe roots
Analytics (Spark, Elasticsearch)	`--interleave=all`	Large working set needs bandwidth across all nodes
Latency-sensitive APIs, trading	Strict pin + IRQ affinity	Predictability matters more than peak throughput
Network-heavy (RoCEv2, InfiniBand)	Pin to NIC's NUMA node, dedicate cores for IRQs	Keeps interrupt processing local and out of the way of app threads

For GPU workloads specifically, run lstopo to find which NUMA node the GPU sits on, then launch the training or inference process with numactl --cpunodebind=N --membind=N for that same N. This is one of the easiest wins on a multi-socket GPU server, because the default scheduler placement is often wrong.

For HPC and MPI workloads that span both sockets, pin each rank to a single node with localalloc rather than interleaving everything. Each rank gets local memory, and the parallelism happens at the rank level.

One practical note: if you pin to a single node, leave 2–4 GB of headroom on it. A node running close to full triggers reclaim, which costs you the latency you were trying to save.

BIOS and kernel settings to check

Tool output is only as accurate as the topology the firmware exposes. A few settings to confirm:

Node Interleaving: disable it. When enabled, the BIOS presents all memory as a single flat pool and hides NUMA from the OS entirely. numactl --hardware will show one node on a multi-socket box if this is the case.
Sub-NUMA Clustering (Intel) or Nodes Per Socket (AMD): enable on high-core processors when you want finer locality. Confirms in lscpu after reboot.
vm.zone_reclaim_mode: set to 0 for most production servers. A non-zero value aggressively reclaims local memory rather than allocating remotely, which can evict useful page cache.
kernel.numa_balancing: leave on for general-purpose workloads, turn off when you're manually pinning. The auto-balancer will migrate pages and threads in ways that conflict with your policy.

If you're running NUMA tuning on bare metal where you control the BIOS, kernel parameters, and IRQ affinity, you can apply all of the above without working around hypervisor abstractions. That's the main reason this kind of work is easier on dedicated hardware than in cloud VMs.

For multi-socket dedicated servers with full root access, see FDC's dedicated servers.

Blog