strace and perf: Linux troubleshooting cheat sheet
13 min read - June 4, 2026

When to use strace vs perf on Linux, the commands you'll actually run, and how to keep overhead low when debugging a busy production server.
strace and perf for Linux troubleshooting
When a Linux server is slow, crashing, or burning CPU and the application logs don't explain why, two tools cover most of the gap. strace tells you what a process is asking the kernel for. perf tells you where the CPU is spending its time. Together they answer the "why is it stuck" and "what is it doing" questions that nothing else handles as cheaply.
This post covers when to reach for each tool, how to install them, the commands you'll actually run, and how to keep overhead manageable on a live server.
When to use strace vs perf
The split is simple. Use perf when the CPU is busy and you need to know which function is responsible. Use strace when a process is hanging, crashing, returning weird errors, or behaving in a way the logs don't explain.
perf samples the kernel's hardware counters at a configurable frequency, so overhead is usually under 1% and safe to run in production. strace intercepts every system call through ptrace, which can slow the target process by 10x to 100x. Use it sparingly on live systems, and always with filters.
| Symptom | Start with | Follow up with |
|---|---|---|
| High CPU usage | perf top or perf record -g | strace -c on the hot process |
| Slow disk or I/O wait | perf stat for cache misses | strace -e trace=file |
| Process hang or silent error | strace -e trace=file,network | perf stat to rule out CPU pressure |
| Lock contention or slow API | strace -c, watch for futex | perf record -g |
Installing strace and perf
Both tools are in the standard repositories. strace relies on the ptrace syscall, which has been part of every modern kernel for years. perf uses the perf_events interface and needs a package that matches your running kernel.
sudo apt install strace linux-tools-common linux-tools-$(uname -r)
On RHEL, AlmaLinux, or Fedora:
sudo dnf install strace perf
The $(uname -r) bit matters. A perf binary built against a different kernel version produces confusing output and may silently drop events. Verify with perf --version and strace -V after installing.
For perf to show function names instead of hex addresses, you need debug symbols. Install the relevant -dbg or -debuginfo package (for example, libc6-dbg on Debian), and compile your own binaries with -g in GCC.
Inside containers, strace needs --cap-add=SYS_PTRACE and perf needs --cap-add=SYS_ADMIN when launching with Docker. Without these caps the tools fail in ways that look like bugs.
Tracing system calls with strace
Running strace command traces a process from launch. To attach to a running process, use strace -p PID. For anything multi-threaded or that forks, add -f to follow children, or you'll miss most of the activity.
Output lines end with a return value. -1 ENOENT means the file the process asked for isn't there. -1 EACCES means permissions. Those two errors account for a surprising share of production bugs on their own.
The most useful flag is -e trace=GROUP, which limits output to a named syscall category and keeps the noise manageable.
| Group | Calls included | What it's good for |
|---|---|---|
file | openat, stat, read, write | Missing configs, permission errors, slow I/O |
network | socket, connect, bind, recvfrom | Connection refused, DNS failures, TLS issues |
process | execve, clone, wait4 | Crashes, fork storms, missing binaries |
futex | futex | Lock contention and thread stalls |
On a busy server, start with strace -c -p PID for ten or twenty seconds. The -c flag prints a summary of syscall counts, total time, and errors when you detach. That tells you which category is worth a closer look without flooding the terminal. Then re-attach with a narrow filter.
Other flags worth knowing: -T logs the time spent in each call, -Z shows only failed calls, and -o file writes to a log instead of the terminal, which is much faster on a noisy process.
Profiling CPU with perf
perf has four commands you'll use most of the time.
| Command | What it does | Common flags |
|---|---|---|
perf stat | Snapshot of counters: cycles, cache misses, context switches | -e, -p, -a |
perf top | Live view of the hottest functions on the system | --sort comm,dso,symbol |
perf record | Captures samples to perf.data for offline analysis | -F, -g, -p |
perf report | Reads perf.data, ranks functions by sample share | --stdio, --sort |
Start with perf stat -p PID for a quick overview. The numbers to watch:
- IPC (instructions per cycle) below 1.0: the CPU is stalling, usually on memory access.
- High LLC-load-misses: the working set doesn't fit in cache, so the CPU is waiting on RAM.
- High context-switch counts: classic for I/O-bound workloads where threads keep blocking on disk or network.
If something looks off, follow up with perf record -F 99 -g -p PID -- sleep 30. The -F 99 samples at 99 Hz, which is enough to find hot functions and avoids syncing with the kernel timer at round numbers like 100 Hz. The -g flag captures call graphs so perf report can show you which paths into a function are responsible.
In perf report, the Overhead column is the share of total samples. High overhead in _int_malloc or memcpy means heavy allocation. High overhead in one of your own functions is the hotspot you came looking for.
If you see hex addresses instead of function names, the binary is stripped or debug symbols are missing. Install the matching -dbg package, or rebuild the binary with -g.
A practical workflow on a live server
For a real incident on a busy box, the routine looks like this:
- Confirm the symptom with cheap tools first:
top,vmstat,iostat. If you're on a VM, check thest(steal) column. Anything above 5% means the hypervisor is the bottleneck, not your code. - If CPU is high, run
perf topfor a few seconds, thenperf record -F 99 -g -p PID -- sleep 30for the offending process. A 30 second capture at 99 Hz produces around 1.7 MB of data. - If the process is hung, slow, or returning errors, run
strace -c -p PIDfor ten seconds and read the summary. If one syscall category dominates, narrow withstrace -e trace=GROUP -T -p PID. - When you've found the suspect syscall or function, detach. Don't leave either tool running on production any longer than you have to.
Two cautions. strace output can include environment variables, file paths, and bytes read from sockets, so sanitise logs before sharing them outside your team. And if you're going to be doing this regularly, look at bpftrace and the wider eBPF toolkit as the next step: same kind of visibility, sub-1% overhead, built for production from the start.
If you run workloads where deep diagnostic access matters and shared infrastructure isn't an option, take a look at our dedicated servers.

FDC VPS come with NVMe drives, EPYC processors and truly unmetered bandwidth as standard. Ready to upgrade?
Unlock performance nowWhy it's important to have a powerful and unmetered VPS
An unmetered VPS gives flat-rate bandwidth at a fixed port speed. How it differs from metered plans, when it pays off, and what to check before buying.
7 min read - May 9, 2025

Have questions or need a custom solution?
Flexible options
Global reach
Instant deployment
Flexible options
Global reach
Instant deployment