#server-performance

Linux I/O Scheduler Tuning: mq-deadline, none, BFQ

16 min read - June 1, 2026

hero section cover
Table of contents
  • Linux I/O scheduler tuning: mq-deadline, none, and BFQ
  • How mq-deadline, none, and BFQ differ
  • Matching a scheduler to your workload
  • Changing and tuning scheduler parameters
  • Verifying performance after tuning
  • Picking the right scheduler
Share

How to pick and tune the right Linux I/O scheduler for NVMe, SATA, and HDD workloads, with sysfs commands, udev rules, and fio benchmarking steps.

Linux I/O scheduler tuning: mq-deadline, none, and BFQ

The Linux I/O scheduler decides the order in which read and write requests reach your storage device, and the right choice depends almost entirely on your hardware. Use none for NVMe, mq-deadline for SATA SSDs and HDDs running mixed workloads, and bfq when you need to stop one process from starving the others. This guide covers how the three main schedulers work, how to match one to your workload, and how to tune and verify the result.

If you want a hands-on walkthrough before reading, this video covers the basics of switching and testing schedulers from the terminal.


 

How mq-deadline, none, and BFQ differ

Each scheduler handles requests with a different strategy. Knowing how they differ is what lets you choose deliberately instead of running whatever the kernel picked at boot.

mq-deadline

The mq-deadline scheduler makes sure no request waits indefinitely. It keeps separate sorted queues for reads and writes, ordering them by Logical Block Address to cut seek time, and enforces deadlines: 500 ms for reads and 5 seconds for writes by default. When a request hits its deadline, it jumps to the front of the queue.

Reads take priority over writes, since reads usually block the application while writes are handled asynchronously. To stop writes from being starved entirely, the scheduler services a batch of overdue writes after a set number of reads. The result is consistent low latency, which makes it a strong fit for database servers and any workload that mixes reads and writes.

none

The none scheduler does almost nothing. It passes requests straight to the device in First-In-First-Out order, with no reordering, merging, or prioritization. That suits modern NVMe drives, which manage their own internal queues and can track tens of thousands of in-flight requests at once. Removing the software scheduling layer gives the shortest possible path from application to device, which is exactly what high-throughput NVMe workloads want.

The catch is that this only works when the hardware can schedule intelligently on its own. On HDDs or SATA SSDs with shallow queues, skipping software reordering usually makes performance worse, not better.

BFQ

BFQ (Budget Fair Queuing) puts fairness first. Instead of time slices, it gives each process a budget measured in disk sectors. Large sequential readers get bigger budgets to keep throughput up, while latency-sensitive tasks get smaller budgets so they are serviced quickly, and a feedback loop adjusts the budgets as it runs.

BFQ keeps interactive tasks responsive even under heavy load, so video playback or a database query stays smooth while a large file transfer runs in the background. That fairness costs CPU. Its per-request overhead is roughly 1.9 microseconds, about three times that of mq-deadline, and on a slower ARM core that overhead caps throughput well below what the same scheduler reaches on a fast x86 chip. On servers where raw throughput and CPU efficiency matter most, that tradeoff is hard to justify.

SchedulerAlgorithmCPU overheadBest hardwarePrimary goal
mq-deadlineSorted LBA with deadlinesLow (~0.7 µs/req)SATA SSDs, HDDs, virtual disksPredictable low latency
noneFIFO, no reorderingNegligibleNVMe SSDsMaximum throughput
bfqProportional-share budgetsModerate (~1.9 µs/req)HDDs, shared and desktop systemsFairness and responsiveness

Matching a scheduler to your workload

Two things decide the right scheduler: your storage hardware and your application's access pattern. Start with the hardware. If the device already reorders requests, like an NVMe drive with capable firmware, software scheduling only adds overhead, so none wins. On spinning HDDs, where seek time dominates, software reordering cuts latency, so mq-deadline or bfq are the better picks. SATA SSDs sit in between: faster than HDDs but without NVMe's deep queues, which is where mq-deadline fits.

The same logic applies when something else is already scheduling for you. Guest VMs on virtio-blk rely on the host to schedule I/O, and hardware RAID controllers with write-back cache optimize their own ordering. In both cases none avoids paying for the work twice.

Access pattern is the second factor. A database doing thousands of random 4K reads per second has nothing in common with a training job streaming large sequential blocks off an NVMe array, and they want different schedulers. The table below maps common workloads to a starting point.

WorkloadStorageSchedulerReason
AI/ML trainingNVMe SSDnoneSequential high throughput; firmware handles queuing
OLTP databaseNVMe SSDnoneLow-latency random I/O; avoid software overhead
OLTP databaseSATA SSDmq-deadlinePrevents write starvation; predictable tail latency
Data warehouse / OLAPNVMe / fast SSDnoneDeep parallel queues; maximum throughput
General web hostingSATA SSD / HDDmq-deadlineConsistent response for mixed small-file I/O
Shared / multi-tenant hostingHDD / SSDbfqFairness between tenants; stops I/O monopolization
Virtual machine guestvirtio-blknoneHost already schedules; double-scheduling wastes CPU
Backup / archiveHDDmq-deadlineSequential throughput with starvation prevention

There is one exception worth flagging. Even on NVMe, if tail latency at p99 or p999 is the metric you care about, such as in financial systems, mq-deadline can beat none by enforcing strict deadlines and preventing the occasional delayed request.

Changing and tuning scheduler parameters

Both switching schedulers and tuning their parameters happen through sysfs, with no reboot required to test a change.

Switching the active scheduler

Check what is available for a device, where the value in brackets is the active one:

cat /sys/block/sda/queue/scheduler

Switch to a different scheduler at runtime. This takes effect immediately but does not survive a reboot:

echo bfq | sudo tee /sys/block/sda/queue/scheduler

If bfq is not listed, load the module first:

sudo modprobe bfq

To make a choice persistent, use a udev rule rather than the old elevator= kernel parameter, which no longer changes the scheduler on RHEL 9 and similar releases. This rule sets mq-deadline for all non-rotational SCSI disks in /etc/udev/rules.d/60-io-scheduler.rules:

ACTION=="add|change", SUBSYSTEM=="block", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="0", ATTR{queue/scheduler}="mq-deadline"

Reload and apply it without rebooting:

sudo udevadm control --reload-rules && sudo udevadm trigger

On RHEL-based systems, TuneD profiles do the same job through system-wide profiles instead of per-device rules.

Parameters worth tuning

Each scheduler exposes its tunables under /sys/block/<device>/queue/iosched/. For mq-deadline, the deadlines are the main levers. Latency-sensitive databases on SATA SSDs benefit from shorter ones:

echo 100 | sudo tee /sys/block/sda/queue/iosched/read_expire
echo 1000 | sudo tee /sys/block/sda/queue/iosched/write_expire

For bfq on high-throughput systems, disabling the latency heuristics raises throughput:

echo 0 | sudo tee /sys/block/sda/queue/iosched/low_latency
echo 0 | sudo tee /sys/block/sda/queue/iosched/slice_idle
SchedulerParameterDefaultTuning goal
mq-deadlineread_expire500 msLower for faster read response
mq-deadlinewrite_expire5000 msLower to reduce write latency
mq-deadlinewrites_starved3Increase for read-heavy loads
mq-deadlinefifo_batch16Set to 1 for minimum latency
bfqlow_latency1Set to 0 for maximum throughput
bfqslice_idle8 msSet to 0 for SSDs or RAID
bfqstrict_guarantees0Set to 1 for strict bandwidth sharing

For shared hosting, BFQ pairs well with cgroups v2. Assigning io.weight values lets you give a database process ten times the I/O share of a backup job, for example, so background work cannot drown out interactive traffic. Whatever you change, BFQ's higher per-request cost adds up on CPU-bound, high-IOPS systems, so benchmark before committing.

Verifying performance after tuning

Always capture a baseline before you change anything. Without it, you have no way to tell whether a tweak helped.

fio is the standard tool for this. It reproduces specific workload patterns through block size, queue depth, and I/O engine settings. Always pass --direct=1 so it bypasses the page cache and measures the scheduler and device directly rather than cached reads. Match the test to the real workload:

Workloadfio parameters
OLTP database--rw=randread --bs=4k --iodepth=32 --direct=1
Data warehouse--rw=read --bs=1m --iodepth=32 --direct=1
Write-ahead / redo log--rw=write --bs=4k --iodepth=1 --direct=1
Object storage--rw=randrw --bs=64k --iodepth=64 --direct=1

Run the same test across iodepth values from 1 to 256 to find the device's saturation point, the depth where IOPS stop climbing and latency spikes. For live monitoring after a change, iostat -x 1 reports the metrics that matter: r_await and w_await for read and write completion latency, aqu-sz for average queue depth, and %util for device utilization. When %util sits near 100 percent, the hardware is the limit and no scheduler change will help.

To separate software cost from hardware cost, run blktrace with btt. It splits latency into Q2D, the time spent in the software queue, and D2C, the time the device takes to service the request. If Q2D dominates, the scheduler is your bottleneck. If D2C dominates, the hardware is.

One thing to keep in mind when reading the results: scheduler choice mostly shapes the tail of the latency distribution, not the median. Switching from none to mq-deadline on NVMe might nudge median latency up a few microseconds while cutting p99 and p999 latency by half. For user-facing services bound by SLAs, that tradeoff is almost always worth it, which is why measuring tail latency, not average throughput, is the point of the exercise.

Picking the right scheduler

Scheduler tuning is about fitting the algorithm to the hardware and the access pattern, then proving it with measurement. The short version:

  • NVMe: use none and let the firmware do the queuing.
  • SATA SSDs and HDDs with mixed I/O: use mq-deadline for predictable latency.
  • Shared or multi-tenant hosts: use bfq to keep one workload from starving the rest.
  • Tail latency, not median: scheduler changes show up at p99 and p999, so that is what to measure.
  • Make it persistent: use udev rules or TuneD, never the dead elevator= parameter.

Getting the most out of any scheduler starts with hardware that can keep up. If you need NVMe-backed servers built for high-throughput, low-latency workloads, explore FDC's VPS options.

Blog

Featured this week

More articles
Why it's important to have a powerful and unmetered VPS

Why it's important to have a powerful and unmetered VPS

An unmetered VPS gives flat-rate bandwidth at a fixed port speed. How it differs from metered plans, when it pays off, and what to check before buying.

7 min read - May 9, 2025

#server-performance

Linux Memory Management: Swap, OOM Killer & Cgroups

12 min read - May 31, 2026

More articles
background image

Have questions or need a custom solution?

icon

Flexible options

icon

Global reach

icon

Instant deployment

icon

Flexible options

icon

Global reach

icon

Instant deployment