#server-performance

iostat and iotop: diagnose Linux storage bottlenecks

14 min read - June 12, 2026

hero section cover
Table of contents
  • iostat and iotop: diagnosing Linux storage bottlenecks
  • Installing iostat and iotop
  • Reading iostat output
  • Reading iotop output
  • A diagnostic workflow
  • Fixing common I/O bottlenecks
Share

Use iostat and iotop to find Linux disk I/O bottlenecks. Covers the %util gotcha on NVMe, reading await and queue depth, and the workflow to find and fix it.

iostat and iotop: diagnosing Linux storage bottlenecks

When a Linux server feels slow, storage is one of the first places to look. iostat shows you whether the disk is overwhelmed; iotop shows you which process is causing the load. Used together they answer the two questions that matter: is the disk actually the bottleneck, and if so, what's hammering it? This post covers installation, how to read the output (including where iostat's %util metric lies on modern hardware), and a workflow for going from symptom to fix.

Installing iostat and iotop

iostat comes with the sysstat package; iotop ships separately. Install both:

# Debian/Ubuntu
sudo apt install sysstat iotop
 
# RHEL, AlmaLinux, Rocky, CentOS Stream
sudo dnf install sysstat iotop
 
# Arch
sudo pacman -S sysstat iotop

On Ubuntu, sysstat ships disabled. To collect background data for later analysis with sar, edit /etc/default/sysstat, set ENABLED="true", and restart the service:

sudo systemctl restart sysstat

iotop must run as root. On RHEL 9 and newer, delay accounting is disabled by default, which leaves the IO and SWAPIN columns empty. Enable it with:

echo 1 | sudo tee /proc/sys/kernel/task_delayacct

Add kernel.task_delayacct = 1 to /etc/sysctl.conf to make it persist across reboots.

Reading iostat output

Run iostat with extended stats and ignore the first sample, which only shows averages since boot:

iostat -xz 2

The -x flag adds extended statistics, -z hides idle devices, and the 2 refreshes every two seconds. The columns that matter:

  • await: average time in milliseconds for an I/O request to complete, including queue time. The single most important number when users complain about slowness.
  • r/s and w/s: read and write IOPS. Combined with rkB/s and wkB/s these tell you whether your workload is random (high IOPS, low throughput) or sequential (low IOPS, high throughput).
  • aqu-sz: average queue depth. For HDDs, anything sustained above 1 means the disk can't keep up.
  • %util: percentage of time the device had at least one in-flight I/O. Useful on HDDs, misleading on NVMe (see below).

A quick threshold reference:

Drive typeawait concernaqu-sz concern%util reliable?
7200 RPM HDD> 20 ms> 1Yes
SATA SSD> 10 ms> 4Mostly
NVMe> 1-2 ms> 16No

Where %util lies

%util is the metric most people reach for first, and on NVMe it's actively misleading. The kernel counts %util as "any I/O in flight at any moment", which is fine for a spinning disk that processes one request at a time but meaningless for NVMe devices that handle thousands of requests in parallel across hardware queues. An NVMe drive can sit at 100% %util while operating at 5% of its real capacity.

On NVMe, trust r_await, w_await, and aqu-sz instead. If r_await stays under 1 ms and the queue depth is comfortably below the device's hardware queue depth (often 1024 or higher), the drive isn't actually saturated regardless of what %util says.

For a fast-NVMe view in MB/s rather than kB/s:

iostat -xm 1

For long-term collection you can correlate with application logs later:

iostat -x -t 5 720 > /var/log/iostat.log

That samples every 5 seconds for an hour. sar from the same sysstat package gives you the equivalent data with persistent historical storage and is the better choice for ongoing monitoring.

Confirming with CPU iowait

If iostat shows storage stress, cross-check with the %iowait column in the CPU summary at the top of the same output. Sustained %iowait above 15-20% together with high await confirms the bottleneck is storage. If %iowait is high but await looks normal, run vmstat 1 and check the si and so columns. Non-zero swap activity means you're memory-bound and the disk traffic is paging, not application I/O.

Reading iotop output

Once iostat confirms a storage bottleneck, iotop tells you which process is responsible. Start with:

sudo iotop -o

The -o flag hides idle processes, leaving only those actively doing I/O. The columns to watch:

  • DISK READ / DISK WRITE: real-time throughput per process. Identifies the obvious heavy hitters.
  • IO: percentage of time the process is blocked on I/O. A process writing just 50 kB/s can show 99% IO if it's doing tiny synchronous fsync() calls. This column matters more than raw throughput.
  • SWAPIN: percentage of time waiting on swap pages. Non-zero here means the system is paging and your "storage problem" is really a memory problem.

For multi-threaded applications (MySQL, PostgreSQL, Java workloads), aggregate threads back into processes with -P, and add -a to accumulate totals since iotop started:

sudo iotop -oPa

The -a flag is the trick for catching bursty workloads like backup jobs that only run for a few seconds at a time and would otherwise be hard to spot in a live view.

For unattended logging during overnight windows when nobody's watching:

sudo iotop -botqq -d 10 > /var/log/iotop.log

That writes a non-interactive snapshot every 10 seconds. Pair it with timestamps from your backup or cron jobs to find the culprit after the fact.

A diagnostic workflow

Most disk I/O investigations follow the same path:

  1. iostat -xz 2 to confirm the disk is actually the bottleneck. Look at await, aqu-sz, and %iowait. If these are normal, the problem isn't storage and you should be looking somewhere else entirely.
  2. iotop -oPa to find the process driving the load. Watch the IO column more than the throughput column. The worst offenders are often programs doing many small synchronous writes, not the ones moving the most bytes.
  3. lsof -p <pid> to see which files that process is touching. This usually identifies the workload type immediately: a database write-ahead log, an application log file, a backup mount point, a temp file.

Two patterns worth knowing.

If you see kernel threads like jbd2/... (ext4 journal) or txg_sync (ZFS) at the top of iotop's writers, they're responding to writes from other processes, not initiating them. The user-space process driving the journal traffic is the actual cause; keep digging.

On a VPS, high await with low %util is the classic noisy-neighbour signal. Another tenant is monopolising the shared storage and your I/O is queuing on the hypervisor side, not on your virtual disk. You can confirm but not fix this from inside the guest; the answer is either to escalate to the provider or to move to a server with isolated storage.

Fixing common I/O bottlenecks

Once you know what's hitting the disk, the fixes are usually straightforward.

De-prioritise non-critical I/O. ionice puts a process into the idle scheduling class, where it only uses disk bandwidth when nothing else wants it:

ionice -c 3 -p <pid>
sudo ionice -c 3 rsync -a /data /backup

The first form changes a running process; the second launches a new one already in the idle class. Inside iotop, you can change priority on a running process interactively by pressing i.

Move hot workloads to faster storage. If iostat shows a SATA disk overloaded by database writes and there's an idle NVMe in the same box, relocate the database data directory. The orders-of-magnitude difference in IOPS makes this the highest-leverage fix available.

Set the right I/O scheduler. Modern kernels default reasonably, but it's worth checking. For NVMe and SSDs, set the scheduler to none. The device handles queueing in hardware better than the kernel can:

echo none > /sys/block/nvme0n1/queue/scheduler

For HDDs handling mixed workloads, mq-deadline is the usual choice.

Mount with noatime. Every read updates the file's last-accessed timestamp by default, generating a write for every read. On read-heavy filesystems this is gratuitous I/O. Add noatime to the mount options in /etc/fstab:

UUID=... /data ext4 defaults,noatime 0 2

Tune writeback for bursty writes. On servers with plenty of RAM, the default dirty-page thresholds let the page cache accumulate gigabytes of unwritten data, then flush it in one large stall. Lower the thresholds in /etc/sysctl.conf to smooth this out:

vm.dirty_ratio = 10
vm.dirty_background_ratio = 5

The disk itself is usually not the problem. When iostat shows high IOPS and low throughput, the workload is doing random I/O on data that could be sequential, or running many small writes that could be batched. Look at the application before blaming the hardware.

If you're running storage-heavy workloads on a server where the network can outrun the disk, FDC's NVMe-backed dedicated servers give you the headroom to apply the tuning above productively.

Blog

Featured this week

More articles
Tuned Profiles for Linux Server Workload Optimisation
#server-performance

Tuned Profiles for Linux Server Workload Optimisation

How to choose, apply, and customise tuned profiles for GPU, database, and high-bandwidth Linux servers, with examples and Ansible deployment tips.

16 min read - June 9, 2026

#vps#server-performance

Linux OOM Killer Tuning for VPS: A Practical Guide

12 min read - June 8, 2026

More articles
background image

Have questions or need a custom solution?

icon

Flexible options

icon

Global reach

icon

Instant deployment

icon

Flexible options

icon

Global reach

icon

Instant deployment