#server-performance

strace and perf: Linux troubleshooting cheat sheet

13 min read - June 4, 2026

hero section cover
Table of contents
  • strace and perf for Linux troubleshooting
  • When to use strace vs perf
  • Installing strace and perf
  • Tracing system calls with strace
  • Profiling CPU with perf
  • A practical workflow on a live server
Share

When to use strace vs perf on Linux, the commands you'll actually run, and how to keep overhead low when debugging a busy production server.

strace and perf for Linux troubleshooting

When a Linux server is slow, crashing, or burning CPU and the application logs don't explain why, two tools cover most of the gap. strace tells you what a process is asking the kernel for. perf tells you where the CPU is spending its time. Together they answer the "why is it stuck" and "what is it doing" questions that nothing else handles as cheaply.

This post covers when to reach for each tool, how to install them, the commands you'll actually run, and how to keep overhead manageable on a live server.


 

When to use strace vs perf

The split is simple. Use perf when the CPU is busy and you need to know which function is responsible. Use strace when a process is hanging, crashing, returning weird errors, or behaving in a way the logs don't explain.

perf samples the kernel's hardware counters at a configurable frequency, so overhead is usually under 1% and safe to run in production. strace intercepts every system call through ptrace, which can slow the target process by 10x to 100x. Use it sparingly on live systems, and always with filters.

SymptomStart withFollow up with
High CPU usageperf top or perf record -gstrace -c on the hot process
Slow disk or I/O waitperf stat for cache missesstrace -e trace=file
Process hang or silent errorstrace -e trace=file,networkperf stat to rule out CPU pressure
Lock contention or slow APIstrace -c, watch for futexperf record -g

Installing strace and perf

Both tools are in the standard repositories. strace relies on the ptrace syscall, which has been part of every modern kernel for years. perf uses the perf_events interface and needs a package that matches your running kernel.

On Ubuntu or Debian:

sudo apt install strace linux-tools-common linux-tools-$(uname -r)

On RHEL, AlmaLinux, or Fedora:

sudo dnf install strace perf

The $(uname -r) bit matters. A perf binary built against a different kernel version produces confusing output and may silently drop events. Verify with perf --version and strace -V after installing.

For perf to show function names instead of hex addresses, you need debug symbols. Install the relevant -dbg or -debuginfo package (for example, libc6-dbg on Debian), and compile your own binaries with -g in GCC.

Inside containers, strace needs --cap-add=SYS_PTRACE and perf needs --cap-add=SYS_ADMIN when launching with Docker. Without these caps the tools fail in ways that look like bugs.

Tracing system calls with strace

Running strace command traces a process from launch. To attach to a running process, use strace -p PID. For anything multi-threaded or that forks, add -f to follow children, or you'll miss most of the activity.

Output lines end with a return value. -1 ENOENT means the file the process asked for isn't there. -1 EACCES means permissions. Those two errors account for a surprising share of production bugs on their own.

The most useful flag is -e trace=GROUP, which limits output to a named syscall category and keeps the noise manageable.

GroupCalls includedWhat it's good for
fileopenat, stat, read, writeMissing configs, permission errors, slow I/O
networksocket, connect, bind, recvfromConnection refused, DNS failures, TLS issues
processexecve, clone, wait4Crashes, fork storms, missing binaries
futexfutexLock contention and thread stalls

On a busy server, start with strace -c -p PID for ten or twenty seconds. The -c flag prints a summary of syscall counts, total time, and errors when you detach. That tells you which category is worth a closer look without flooding the terminal. Then re-attach with a narrow filter.

Other flags worth knowing: -T logs the time spent in each call, -Z shows only failed calls, and -o file writes to a log instead of the terminal, which is much faster on a noisy process.

Profiling CPU with perf

perf has four commands you'll use most of the time.

CommandWhat it doesCommon flags
perf statSnapshot of counters: cycles, cache misses, context switches-e, -p, -a
perf topLive view of the hottest functions on the system--sort comm,dso,symbol
perf recordCaptures samples to perf.data for offline analysis-F, -g, -p
perf reportReads perf.data, ranks functions by sample share--stdio, --sort

Start with perf stat -p PID for a quick overview. The numbers to watch:

  • IPC (instructions per cycle) below 1.0: the CPU is stalling, usually on memory access.
  • High LLC-load-misses: the working set doesn't fit in cache, so the CPU is waiting on RAM.
  • High context-switch counts: classic for I/O-bound workloads where threads keep blocking on disk or network.

If something looks off, follow up with perf record -F 99 -g -p PID -- sleep 30. The -F 99 samples at 99 Hz, which is enough to find hot functions and avoids syncing with the kernel timer at round numbers like 100 Hz. The -g flag captures call graphs so perf report can show you which paths into a function are responsible.

In perf report, the Overhead column is the share of total samples. High overhead in _int_malloc or memcpy means heavy allocation. High overhead in one of your own functions is the hotspot you came looking for.

If you see hex addresses instead of function names, the binary is stripped or debug symbols are missing. Install the matching -dbg package, or rebuild the binary with -g.

A practical workflow on a live server

For a real incident on a busy box, the routine looks like this:

  1. Confirm the symptom with cheap tools first: top, vmstat, iostat. If you're on a VM, check the st (steal) column. Anything above 5% means the hypervisor is the bottleneck, not your code.
  2. If CPU is high, run perf top for a few seconds, then perf record -F 99 -g -p PID -- sleep 30 for the offending process. A 30 second capture at 99 Hz produces around 1.7 MB of data.
  3. If the process is hung, slow, or returning errors, run strace -c -p PID for ten seconds and read the summary. If one syscall category dominates, narrow with strace -e trace=GROUP -T -p PID.
  4. When you've found the suspect syscall or function, detach. Don't leave either tool running on production any longer than you have to.

Two cautions. strace output can include environment variables, file paths, and bytes read from sockets, so sanitise logs before sharing them outside your team. And if you're going to be doing this regularly, look at bpftrace and the wider eBPF toolkit as the next step: same kind of visibility, sub-1% overhead, built for production from the start.

If you run workloads where deep diagnostic access matters and shared infrastructure isn't an option, take a look at our dedicated servers.

background image
Is your VPS up to the job?

FDC VPS come with NVMe drives, EPYC processors and truly unmetered bandwidth as standard. Ready to upgrade?

Unlock performance now

Blog

Featured this week

More articles
Why it's important to have a powerful and unmetered VPS

Why it's important to have a powerful and unmetered VPS

An unmetered VPS gives flat-rate bandwidth at a fixed port speed. How it differs from metered plans, when it pays off, and what to check before buying.

7 min read - May 9, 2025

#server-performance

Linux Memory Management: Swap, OOM Killer & Cgroups

12 min read - May 31, 2026

More articles
background image

Have questions or need a custom solution?

icon

Flexible options

icon

Global reach

icon

Instant deployment

icon

Flexible options

icon

Global reach

icon

Instant deployment