NUMA Awareness and CPU Pinning for Dedicated Servers
16 min read - June 16, 2026

How to inspect NUMA topology and pin Linux workloads to the right cores and memory. Covers numactl, taskset, systemd, BIOS settings, and workload-specific strategies.
NUMA awareness and CPU pinning for dedicated servers
On any multi-socket server, where a process runs and where its memory lives are two different questions, and getting them out of sync is one of the easiest ways to leave performance on the table. NUMA awareness and CPU pinning are the two knobs that fix this. This post covers how NUMA works, how to inspect it on Linux, and how to pin workloads correctly for databases, AI training, and latency-sensitive services.
How NUMA works on multi-socket servers
A NUMA (Non-Uniform Memory Access) node is a group of CPU cores bound to a local block of RAM through a dedicated memory controller. On a two-socket server you usually have two nodes. Any core can read any address, but local access is roughly 80 ns while a cross-socket hop over Intel's UPI or AMD's Infinity Fabric is around 130–150 ns. On larger systems with more sockets, the worst-case node can push past 250 ns.
Bandwidth follows the same pattern. A two-socket Sapphire Rapids system can sustain around 600 GB/s when cores hit local memory, but the inter-socket link is a fraction of that, so traffic crossing it bottlenecks fast. High-core processors make this more granular: Intel's Sub-NUMA Clustering (SNC) and AMD's Nodes Per Socket (NPS) split each socket into multiple NUMA domains, so a "two-socket" box can easily present four or eight nodes to Linux.
Without NUMA awareness, the Linux scheduler will happily migrate a thread between sockets while its working set stays on the original node. Every subsequent access becomes a remote one. The visible symptom is high CPU utilisation with low actual throughput, because the cores are spending their time waiting on memory. I/O devices make this worse. A GPU or NIC is attached to a specific PCIe root, which belongs to one NUMA node. If the process feeding it runs on the other socket, every DMA transfer crosses the interconnect.
Inspecting NUMA topology on Linux
Four tools cover almost everything you need:
lscpufor a quick socket and node summary.numactl --hardwarefor node memory totals and the inter-node distance matrix.numastatfor per-process hit/miss counters.lstopo(from hwloc) for cache hierarchy and PCIe device locality.
Start with numactl --hardware. It lists each node, the cores and memory belonging to it, and the distance matrix. A value of 10 is local, 20+ is remote. If you see a single node on a multi-socket box, your BIOS has Node Interleaving enabled and is hiding the topology, fix that first (see below).
For a specific process, numastat -p <PID> breaks down where its memory is actually allocated. Four counters matter:
numa_hit: memory allocated on the intended node. You want this high.numa_miss: intended node was full, allocation spilled elsewhere.numa_foreign: another node tried to allocate locally and couldn't, indicates memory pressure.other_node: pages allocated on a node other than where the process is running. High values here are the classic sign of bad pinning.
For GPU or NIC workloads, run lstopo-no-graphics and look at which NUMA node each PCIe device is attached to. If the cores driving the device are on the other node, that's the first thing to fix.
CPU pinning and memory policies
CPU pinning (or CPU affinity) binds a process to specific cores so the scheduler can't migrate it. By itself that's not enough, because Linux uses a first-touch memory policy by default: pages are allocated on whichever node first writes to them. If a thread starts on the wrong node before it gets pinned, its memory stays there. You need to control both placement and allocation together.
Three tools cover the common cases:
| Tool | Controls | Use for |
|---|---|---|
taskset | CPU cores only | Quick one-off binding of an existing process |
numactl | CPU cores and memory | Launching workloads with strict locality |
| systemd | CPU cores and memory, persistent | Services that need pinning across reboots |
numactl supports four memory policies:
--membind=N: allocate only on node N, fail if full.--preferred=N: prefer node N, fall back to others if needed.--interleave=all: round-robin across nodes for even bandwidth distribution.--localalloc: allocate on whichever node the running CPU is on.
Pinning a workload to one node
First, identify which cores belong to your target node:
numactl --hardwareThen launch the application bound to that node for both cores and memory:
numactl --cpunodebind=0 --membind=0 ./your_applicationFor an already-running process, adjust CPU affinity with taskset:
taskset -cp 0-7 <PID>To make it survive a reboot, set it in the systemd unit:
[Service]
CPUAffinity=0-7
NUMAPolicy=bind
NUMAMask=0Reload and restart:
sudo systemctl daemon-reload && sudo systemctl restart <service>When you're manually pinning, turn off the kernel's auto-balancer so it doesn't fight your placement:
sysctl -w kernel.numa_balancing=0Add it to /etc/sysctl.conf to persist. Then verify with numastat -p <PID> over a couple of minutes of real workload. If other_node stays near zero, the pinning is taking effect.
Picking a strategy by workload
The right policy depends on whether your workload benefits more from low latency or from aggregate bandwidth across all nodes.
| Workload | Policy | Why |
|---|---|---|
| Databases (PostgreSQL, MySQL, SQL Server) | --cpunodebind + --membind | Large shared buffers, latency-sensitive query paths |
| In-memory cache (Redis, Memcached) | Single-node bind | Everything is RAM access, remote latency shows up immediately |
| AI/ML training and inference | Bind to the GPU's NUMA node | Avoids tensor transfers crossing PCIe roots |
| Analytics (Spark, Elasticsearch) | --interleave=all | Large working set needs bandwidth across all nodes |
| Latency-sensitive APIs, trading | Strict pin + IRQ affinity | Predictability matters more than peak throughput |
| Network-heavy (RoCEv2, InfiniBand) | Pin to NIC's NUMA node, dedicate cores for IRQs | Keeps interrupt processing local and out of the way of app threads |
For GPU workloads specifically, run lstopo to find which NUMA node the GPU sits on, then launch the training or inference process with numactl --cpunodebind=N --membind=N for that same N. This is one of the easiest wins on a multi-socket GPU server, because the default scheduler placement is often wrong.
For HPC and MPI workloads that span both sockets, pin each rank to a single node with localalloc rather than interleaving everything. Each rank gets local memory, and the parallelism happens at the rank level.
One practical note: if you pin to a single node, leave 2–4 GB of headroom on it. A node running close to full triggers reclaim, which costs you the latency you were trying to save.
BIOS and kernel settings to check
Tool output is only as accurate as the topology the firmware exposes. A few settings to confirm:
- Node Interleaving: disable it. When enabled, the BIOS presents all memory as a single flat pool and hides NUMA from the OS entirely.
numactl --hardwarewill show one node on a multi-socket box if this is the case. - Sub-NUMA Clustering (Intel) or Nodes Per Socket (AMD): enable on high-core processors when you want finer locality. Confirms in
lscpuafter reboot. vm.zone_reclaim_mode: set to 0 for most production servers. A non-zero value aggressively reclaims local memory rather than allocating remotely, which can evict useful page cache.kernel.numa_balancing: leave on for general-purpose workloads, turn off when you're manually pinning. The auto-balancer will migrate pages and threads in ways that conflict with your policy.
If you're running NUMA tuning on bare metal where you control the BIOS, kernel parameters, and IRQ affinity, you can apply all of the above without working around hypervisor abstractions. That's the main reason this kind of work is easier on dedicated hardware than in cloud VMs.
For multi-socket dedicated servers with full root access, see FDC's dedicated servers.

Tuned Profiles for Linux Server Workload Optimisation
How to choose, apply, and customise tuned profiles for GPU, database, and high-bandwidth Linux servers, with examples and Ansible deployment tips.
16 min read - June 9, 2026

Have questions or need a custom solution?
Flexible options
Global reach
Instant deployment
Flexible options
Global reach
Instant deployment