#server-performance

strace and perf: Linux troubleshooting cheat sheet

13 min read - June 4, 2026

Table of contents

strace and perf for Linux troubleshooting
When to use strace vs perf
Installing strace and perf
Tracing system calls with strace
Profiling CPU with perf
A practical workflow on a live server

Share

When to use strace vs perf on Linux, the commands you'll actually run, and how to keep overhead low when debugging a busy production server.

Table of contents

strace and perf for Linux troubleshooting
When to use strace vs perf
Installing strace and perf
Tracing system calls with strace
Profiling CPU with perf
A practical workflow on a live server

strace and perf for Linux troubleshooting

When a Linux server is slow, crashing, or burning CPU and the application logs don't explain why, two tools cover most of the gap. strace tells you what a process is asking the kernel for. perf tells you where the CPU is spending its time. Together they answer the "why is it stuck" and "what is it doing" questions that nothing else handles as cheaply.

This post covers when to reach for each tool, how to install them, the commands you'll actually run, and how to keep overhead manageable on a live server.

When to use strace vs perf

The split is simple. Use perf when the CPU is busy and you need to know which function is responsible. Use strace when a process is hanging, crashing, returning weird errors, or behaving in a way the logs don't explain.

perf samples the kernel's hardware counters at a configurable frequency, so overhead is usually under 1% and safe to run in production. strace intercepts every system call through ptrace, which can slow the target process by 10x to 100x. Use it sparingly on live systems, and always with filters.

Symptom	Start with	Follow up with
High CPU usage	`perf top` or `perf record -g`	`strace -c` on the hot process
Slow disk or I/O wait	`perf stat` for cache misses	`strace -e trace=file`
Process hang or silent error	`strace -e trace=file,network`	`perf stat` to rule out CPU pressure
Lock contention or slow API	`strace -c`, watch for `futex`	`perf record -g`

Installing strace and perf

Both tools are in the standard repositories. strace relies on the ptrace syscall, which has been part of every modern kernel for years. perf uses the perf_events interface and needs a package that matches your running kernel.

On Ubuntu or Debian:

sudo apt install strace linux-tools-common linux-tools-$(uname -r)

On RHEL, AlmaLinux, or Fedora:

sudo dnf install strace perf

The $(uname -r) bit matters. A perf binary built against a different kernel version produces confusing output and may silently drop events. Verify with perf --version and strace -V after installing.

For perf to show function names instead of hex addresses, you need debug symbols. Install the relevant -dbg or -debuginfo package (for example, libc6-dbg on Debian), and compile your own binaries with -g in GCC.

Inside containers, strace needs --cap-add=SYS_PTRACE and perf needs --cap-add=SYS_ADMIN when launching with Docker. Without these caps the tools fail in ways that look like bugs.

Tracing system calls with strace

Running strace command traces a process from launch. To attach to a running process, use strace -p PID. For anything multi-threaded or that forks, add -f to follow children, or you'll miss most of the activity.

Output lines end with a return value. -1 ENOENT means the file the process asked for isn't there. -1 EACCES means permissions. Those two errors account for a surprising share of production bugs on their own.

The most useful flag is -e trace=GROUP, which limits output to a named syscall category and keeps the noise manageable.

Group	Calls included	What it's good for
`file`	`openat`, `stat`, `read`, `write`	Missing configs, permission errors, slow I/O
`network`	`socket`, `connect`, `bind`, `recvfrom`	Connection refused, DNS failures, TLS issues
`process`	`execve`, `clone`, `wait4`	Crashes, fork storms, missing binaries
`futex`	`futex`	Lock contention and thread stalls

On a busy server, start with strace -c -p PID for ten or twenty seconds. The -c flag prints a summary of syscall counts, total time, and errors when you detach. That tells you which category is worth a closer look without flooding the terminal. Then re-attach with a narrow filter.

Other flags worth knowing: -T logs the time spent in each call, -Z shows only failed calls, and -o file writes to a log instead of the terminal, which is much faster on a noisy process.

Profiling CPU with perf

perf has four commands you'll use most of the time.

Command	What it does	Common flags
`perf stat`	Snapshot of counters: cycles, cache misses, context switches	`-e`, `-p`, `-a`
`perf top`	Live view of the hottest functions on the system	`--sort comm,dso,symbol`
`perf record`	Captures samples to `perf.data` for offline analysis	`-F`, `-g`, `-p`
`perf report`	Reads `perf.data`, ranks functions by sample share	`--stdio`, `--sort`

Start with perf stat -p PID for a quick overview. The numbers to watch:

IPC (instructions per cycle) below 1.0: the CPU is stalling, usually on memory access.
High LLC-load-misses: the working set doesn't fit in cache, so the CPU is waiting on RAM.
High context-switch counts: classic for I/O-bound workloads where threads keep blocking on disk or network.

If something looks off, follow up with perf record -F 99 -g -p PID -- sleep 30. The -F 99 samples at 99 Hz, which is enough to find hot functions and avoids syncing with the kernel timer at round numbers like 100 Hz. The -g flag captures call graphs so perf report can show you which paths into a function are responsible.

In perf report, the Overhead column is the share of total samples. High overhead in _int_malloc or memcpy means heavy allocation. High overhead in one of your own functions is the hotspot you came looking for.

If you see hex addresses instead of function names, the binary is stripped or debug symbols are missing. Install the matching -dbg package, or rebuild the binary with -g.

A practical workflow on a live server

For a real incident on a busy box, the routine looks like this:

Confirm the symptom with cheap tools first: top, vmstat, iostat. If you're on a VM, check the st (steal) column. Anything above 5% means the hypervisor is the bottleneck, not your code.
If CPU is high, run perf top for a few seconds, then perf record -F 99 -g -p PID -- sleep 30 for the offending process. A 30 second capture at 99 Hz produces around 1.7 MB of data.
If the process is hung, slow, or returning errors, run strace -c -p PID for ten seconds and read the summary. If one syscall category dominates, narrow with strace -e trace=GROUP -T -p PID.
When you've found the suspect syscall or function, detach. Don't leave either tool running on production any longer than you have to.

Two cautions. strace output can include environment variables, file paths, and bytes read from sockets, so sanitise logs before sharing them outside your team. And if you're going to be doing this regularly, look at bpftrace and the wider eBPF toolkit as the next step: same kind of visibility, sub-1% overhead, built for production from the start.

If you run workloads where deep diagnostic access matters and shared infrastructure isn't an option, take a look at our dedicated servers.