#server-performance

NUMA Awareness and CPU Pinning for Dedicated Servers

16 min read - June 16, 2026

hero section cover
Table of contents
  • NUMA awareness and CPU pinning for dedicated servers
  • How NUMA works on multi-socket servers
  • Inspecting NUMA topology on Linux
  • CPU pinning and memory policies
  • Picking a strategy by workload
  • BIOS and kernel settings to check
Share

How to inspect NUMA topology and pin Linux workloads to the right cores and memory. Covers numactl, taskset, systemd, BIOS settings, and workload-specific strategies.

NUMA awareness and CPU pinning for dedicated servers

On any multi-socket server, where a process runs and where its memory lives are two different questions, and getting them out of sync is one of the easiest ways to leave performance on the table. NUMA awareness and CPU pinning are the two knobs that fix this. This post covers how NUMA works, how to inspect it on Linux, and how to pin workloads correctly for databases, AI training, and latency-sensitive services.

How NUMA works on multi-socket servers

A NUMA (Non-Uniform Memory Access) node is a group of CPU cores bound to a local block of RAM through a dedicated memory controller. On a two-socket server you usually have two nodes. Any core can read any address, but local access is roughly 80 ns while a cross-socket hop over Intel's UPI or AMD's Infinity Fabric is around 130–150 ns. On larger systems with more sockets, the worst-case node can push past 250 ns.

Bandwidth follows the same pattern. A two-socket Sapphire Rapids system can sustain around 600 GB/s when cores hit local memory, but the inter-socket link is a fraction of that, so traffic crossing it bottlenecks fast. High-core processors make this more granular: Intel's Sub-NUMA Clustering (SNC) and AMD's Nodes Per Socket (NPS) split each socket into multiple NUMA domains, so a "two-socket" box can easily present four or eight nodes to Linux.

Without NUMA awareness, the Linux scheduler will happily migrate a thread between sockets while its working set stays on the original node. Every subsequent access becomes a remote one. The visible symptom is high CPU utilisation with low actual throughput, because the cores are spending their time waiting on memory. I/O devices make this worse. A GPU or NIC is attached to a specific PCIe root, which belongs to one NUMA node. If the process feeding it runs on the other socket, every DMA transfer crosses the interconnect.

Inspecting NUMA topology on Linux

Four tools cover almost everything you need:

  • lscpu for a quick socket and node summary.
  • numactl --hardware for node memory totals and the inter-node distance matrix.
  • numastat for per-process hit/miss counters.
  • lstopo (from hwloc) for cache hierarchy and PCIe device locality.

Start with numactl --hardware. It lists each node, the cores and memory belonging to it, and the distance matrix. A value of 10 is local, 20+ is remote. If you see a single node on a multi-socket box, your BIOS has Node Interleaving enabled and is hiding the topology, fix that first (see below).

For a specific process, numastat -p <PID> breaks down where its memory is actually allocated. Four counters matter:

  • numa_hit: memory allocated on the intended node. You want this high.
  • numa_miss: intended node was full, allocation spilled elsewhere.
  • numa_foreign: another node tried to allocate locally and couldn't, indicates memory pressure.
  • other_node: pages allocated on a node other than where the process is running. High values here are the classic sign of bad pinning.

For GPU or NIC workloads, run lstopo-no-graphics and look at which NUMA node each PCIe device is attached to. If the cores driving the device are on the other node, that's the first thing to fix.

CPU pinning and memory policies

CPU pinning (or CPU affinity) binds a process to specific cores so the scheduler can't migrate it. By itself that's not enough, because Linux uses a first-touch memory policy by default: pages are allocated on whichever node first writes to them. If a thread starts on the wrong node before it gets pinned, its memory stays there. You need to control both placement and allocation together.

Three tools cover the common cases:

ToolControlsUse for
tasksetCPU cores onlyQuick one-off binding of an existing process
numactlCPU cores and memoryLaunching workloads with strict locality
systemdCPU cores and memory, persistentServices that need pinning across reboots

numactl supports four memory policies:

  • --membind=N: allocate only on node N, fail if full.
  • --preferred=N: prefer node N, fall back to others if needed.
  • --interleave=all: round-robin across nodes for even bandwidth distribution.
  • --localalloc: allocate on whichever node the running CPU is on.

Pinning a workload to one node

First, identify which cores belong to your target node:

numactl --hardware

Then launch the application bound to that node for both cores and memory:

numactl --cpunodebind=0 --membind=0 ./your_application

For an already-running process, adjust CPU affinity with taskset:

taskset -cp 0-7 <PID>

To make it survive a reboot, set it in the systemd unit:

[Service]
CPUAffinity=0-7
NUMAPolicy=bind
NUMAMask=0

Reload and restart:

sudo systemctl daemon-reload && sudo systemctl restart <service>

When you're manually pinning, turn off the kernel's auto-balancer so it doesn't fight your placement:

sysctl -w kernel.numa_balancing=0

Add it to /etc/sysctl.conf to persist. Then verify with numastat -p <PID> over a couple of minutes of real workload. If other_node stays near zero, the pinning is taking effect.

Picking a strategy by workload

The right policy depends on whether your workload benefits more from low latency or from aggregate bandwidth across all nodes.

WorkloadPolicyWhy
Databases (PostgreSQL, MySQL, SQL Server)--cpunodebind + --membindLarge shared buffers, latency-sensitive query paths
In-memory cache (Redis, Memcached)Single-node bindEverything is RAM access, remote latency shows up immediately
AI/ML training and inferenceBind to the GPU's NUMA nodeAvoids tensor transfers crossing PCIe roots
Analytics (Spark, Elasticsearch)--interleave=allLarge working set needs bandwidth across all nodes
Latency-sensitive APIs, tradingStrict pin + IRQ affinityPredictability matters more than peak throughput
Network-heavy (RoCEv2, InfiniBand)Pin to NIC's NUMA node, dedicate cores for IRQsKeeps interrupt processing local and out of the way of app threads

For GPU workloads specifically, run lstopo to find which NUMA node the GPU sits on, then launch the training or inference process with numactl --cpunodebind=N --membind=N for that same N. This is one of the easiest wins on a multi-socket GPU server, because the default scheduler placement is often wrong.

For HPC and MPI workloads that span both sockets, pin each rank to a single node with localalloc rather than interleaving everything. Each rank gets local memory, and the parallelism happens at the rank level.

One practical note: if you pin to a single node, leave 2–4 GB of headroom on it. A node running close to full triggers reclaim, which costs you the latency you were trying to save.

BIOS and kernel settings to check

Tool output is only as accurate as the topology the firmware exposes. A few settings to confirm:

  • Node Interleaving: disable it. When enabled, the BIOS presents all memory as a single flat pool and hides NUMA from the OS entirely. numactl --hardware will show one node on a multi-socket box if this is the case.
  • Sub-NUMA Clustering (Intel) or Nodes Per Socket (AMD): enable on high-core processors when you want finer locality. Confirms in lscpu after reboot.
  • vm.zone_reclaim_mode: set to 0 for most production servers. A non-zero value aggressively reclaims local memory rather than allocating remotely, which can evict useful page cache.
  • kernel.numa_balancing: leave on for general-purpose workloads, turn off when you're manually pinning. The auto-balancer will migrate pages and threads in ways that conflict with your policy.

If you're running NUMA tuning on bare metal where you control the BIOS, kernel parameters, and IRQ affinity, you can apply all of the above without working around hypervisor abstractions. That's the main reason this kind of work is easier on dedicated hardware than in cloud VMs.

For multi-socket dedicated servers with full root access, see FDC's dedicated servers.

Blog

Featured this week

More articles
Tuned Profiles for Linux Server Workload Optimisation
#server-performance

Tuned Profiles for Linux Server Workload Optimisation

How to choose, apply, and customise tuned profiles for GPU, database, and high-bandwidth Linux servers, with examples and Ansible deployment tips.

16 min read - June 9, 2026

#vps#server-performance

Linux OOM Killer Tuning for VPS: A Practical Guide

12 min read - June 8, 2026

More articles
background image

Have questions or need a custom solution?

icon

Flexible options

icon

Global reach

icon

Instant deployment

icon

Flexible options

icon

Global reach

icon

Instant deployment