Linux I/O Scheduler Tuning: mq-deadline, none, BFQ
16 min read - June 1, 2026

How to pick and tune the right Linux I/O scheduler for NVMe, SATA, and HDD workloads, with sysfs commands, udev rules, and fio benchmarking steps.
Linux I/O scheduler tuning: mq-deadline, none, and BFQ
The Linux I/O scheduler decides the order in which read and write requests reach your storage device, and the right choice depends almost entirely on your hardware. Use none for NVMe, mq-deadline for SATA SSDs and HDDs running mixed workloads, and bfq when you need to stop one process from starving the others. This guide covers how the three main schedulers work, how to match one to your workload, and how to tune and verify the result.
If you want a hands-on walkthrough before reading, this video covers the basics of switching and testing schedulers from the terminal.
How mq-deadline, none, and BFQ differ
Each scheduler handles requests with a different strategy. Knowing how they differ is what lets you choose deliberately instead of running whatever the kernel picked at boot.
mq-deadline
The mq-deadline scheduler makes sure no request waits indefinitely. It keeps separate sorted queues for reads and writes, ordering them by Logical Block Address to cut seek time, and enforces deadlines: 500 ms for reads and 5 seconds for writes by default. When a request hits its deadline, it jumps to the front of the queue.
Reads take priority over writes, since reads usually block the application while writes are handled asynchronously. To stop writes from being starved entirely, the scheduler services a batch of overdue writes after a set number of reads. The result is consistent low latency, which makes it a strong fit for database servers and any workload that mixes reads and writes.
none
The none scheduler does almost nothing. It passes requests straight to the device in First-In-First-Out order, with no reordering, merging, or prioritization. That suits modern NVMe drives, which manage their own internal queues and can track tens of thousands of in-flight requests at once. Removing the software scheduling layer gives the shortest possible path from application to device, which is exactly what high-throughput NVMe workloads want.
The catch is that this only works when the hardware can schedule intelligently on its own. On HDDs or SATA SSDs with shallow queues, skipping software reordering usually makes performance worse, not better.
BFQ
BFQ (Budget Fair Queuing) puts fairness first. Instead of time slices, it gives each process a budget measured in disk sectors. Large sequential readers get bigger budgets to keep throughput up, while latency-sensitive tasks get smaller budgets so they are serviced quickly, and a feedback loop adjusts the budgets as it runs.
BFQ keeps interactive tasks responsive even under heavy load, so video playback or a database query stays smooth while a large file transfer runs in the background. That fairness costs CPU. Its per-request overhead is roughly 1.9 microseconds, about three times that of mq-deadline, and on a slower ARM core that overhead caps throughput well below what the same scheduler reaches on a fast x86 chip. On servers where raw throughput and CPU efficiency matter most, that tradeoff is hard to justify.
| Scheduler | Algorithm | CPU overhead | Best hardware | Primary goal |
|---|---|---|---|---|
mq-deadline | Sorted LBA with deadlines | Low (~0.7 µs/req) | SATA SSDs, HDDs, virtual disks | Predictable low latency |
none | FIFO, no reordering | Negligible | NVMe SSDs | Maximum throughput |
bfq | Proportional-share budgets | Moderate (~1.9 µs/req) | HDDs, shared and desktop systems | Fairness and responsiveness |
Matching a scheduler to your workload
Two things decide the right scheduler: your storage hardware and your application's access pattern. Start with the hardware. If the device already reorders requests, like an NVMe drive with capable firmware, software scheduling only adds overhead, so none wins. On spinning HDDs, where seek time dominates, software reordering cuts latency, so mq-deadline or bfq are the better picks. SATA SSDs sit in between: faster than HDDs but without NVMe's deep queues, which is where mq-deadline fits.
The same logic applies when something else is already scheduling for you. Guest VMs on virtio-blk rely on the host to schedule I/O, and hardware RAID controllers with write-back cache optimize their own ordering. In both cases none avoids paying for the work twice.
Access pattern is the second factor. A database doing thousands of random 4K reads per second has nothing in common with a training job streaming large sequential blocks off an NVMe array, and they want different schedulers. The table below maps common workloads to a starting point.
| Workload | Storage | Scheduler | Reason |
|---|---|---|---|
| AI/ML training | NVMe SSD | none | Sequential high throughput; firmware handles queuing |
| OLTP database | NVMe SSD | none | Low-latency random I/O; avoid software overhead |
| OLTP database | SATA SSD | mq-deadline | Prevents write starvation; predictable tail latency |
| Data warehouse / OLAP | NVMe / fast SSD | none | Deep parallel queues; maximum throughput |
| General web hosting | SATA SSD / HDD | mq-deadline | Consistent response for mixed small-file I/O |
| Shared / multi-tenant hosting | HDD / SSD | bfq | Fairness between tenants; stops I/O monopolization |
| Virtual machine guest | virtio-blk | none | Host already schedules; double-scheduling wastes CPU |
| Backup / archive | HDD | mq-deadline | Sequential throughput with starvation prevention |
There is one exception worth flagging. Even on NVMe, if tail latency at p99 or p999 is the metric you care about, such as in financial systems, mq-deadline can beat none by enforcing strict deadlines and preventing the occasional delayed request.
Changing and tuning scheduler parameters
Both switching schedulers and tuning their parameters happen through sysfs, with no reboot required to test a change.
Switching the active scheduler
Check what is available for a device, where the value in brackets is the active one:
cat /sys/block/sda/queue/schedulerSwitch to a different scheduler at runtime. This takes effect immediately but does not survive a reboot:
echo bfq | sudo tee /sys/block/sda/queue/schedulerIf bfq is not listed, load the module first:
sudo modprobe bfqTo make a choice persistent, use a udev rule rather than the old elevator= kernel parameter, which no longer changes the scheduler on RHEL 9 and similar releases. This rule sets mq-deadline for all non-rotational SCSI disks in /etc/udev/rules.d/60-io-scheduler.rules:
ACTION=="add|change", SUBSYSTEM=="block", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="0", ATTR{queue/scheduler}="mq-deadline"Reload and apply it without rebooting:
sudo udevadm control --reload-rules && sudo udevadm triggerOn RHEL-based systems, TuneD profiles do the same job through system-wide profiles instead of per-device rules.
Parameters worth tuning
Each scheduler exposes its tunables under /sys/block/<device>/queue/iosched/. For mq-deadline, the deadlines are the main levers. Latency-sensitive databases on SATA SSDs benefit from shorter ones:
echo 100 | sudo tee /sys/block/sda/queue/iosched/read_expire
echo 1000 | sudo tee /sys/block/sda/queue/iosched/write_expireFor bfq on high-throughput systems, disabling the latency heuristics raises throughput:
echo 0 | sudo tee /sys/block/sda/queue/iosched/low_latency
echo 0 | sudo tee /sys/block/sda/queue/iosched/slice_idle| Scheduler | Parameter | Default | Tuning goal |
|---|---|---|---|
mq-deadline | read_expire | 500 ms | Lower for faster read response |
mq-deadline | write_expire | 5000 ms | Lower to reduce write latency |
mq-deadline | writes_starved | 3 | Increase for read-heavy loads |
mq-deadline | fifo_batch | 16 | Set to 1 for minimum latency |
bfq | low_latency | 1 | Set to 0 for maximum throughput |
bfq | slice_idle | 8 ms | Set to 0 for SSDs or RAID |
bfq | strict_guarantees | 0 | Set to 1 for strict bandwidth sharing |
For shared hosting, BFQ pairs well with cgroups v2. Assigning io.weight values lets you give a database process ten times the I/O share of a backup job, for example, so background work cannot drown out interactive traffic. Whatever you change, BFQ's higher per-request cost adds up on CPU-bound, high-IOPS systems, so benchmark before committing.
Verifying performance after tuning
Always capture a baseline before you change anything. Without it, you have no way to tell whether a tweak helped.
fio is the standard tool for this. It reproduces specific workload patterns through block size, queue depth, and I/O engine settings. Always pass --direct=1 so it bypasses the page cache and measures the scheduler and device directly rather than cached reads. Match the test to the real workload:
| Workload | fio parameters |
|---|---|
| OLTP database | --rw=randread --bs=4k --iodepth=32 --direct=1 |
| Data warehouse | --rw=read --bs=1m --iodepth=32 --direct=1 |
| Write-ahead / redo log | --rw=write --bs=4k --iodepth=1 --direct=1 |
| Object storage | --rw=randrw --bs=64k --iodepth=64 --direct=1 |
Run the same test across iodepth values from 1 to 256 to find the device's saturation point, the depth where IOPS stop climbing and latency spikes. For live monitoring after a change, iostat -x 1 reports the metrics that matter: r_await and w_await for read and write completion latency, aqu-sz for average queue depth, and %util for device utilization. When %util sits near 100 percent, the hardware is the limit and no scheduler change will help.
To separate software cost from hardware cost, run blktrace with btt. It splits latency into Q2D, the time spent in the software queue, and D2C, the time the device takes to service the request. If Q2D dominates, the scheduler is your bottleneck. If D2C dominates, the hardware is.
One thing to keep in mind when reading the results: scheduler choice mostly shapes the tail of the latency distribution, not the median. Switching from none to mq-deadline on NVMe might nudge median latency up a few microseconds while cutting p99 and p999 latency by half. For user-facing services bound by SLAs, that tradeoff is almost always worth it, which is why measuring tail latency, not average throughput, is the point of the exercise.
Picking the right scheduler
Scheduler tuning is about fitting the algorithm to the hardware and the access pattern, then proving it with measurement. The short version:
- NVMe: use
noneand let the firmware do the queuing. - SATA SSDs and HDDs with mixed I/O: use
mq-deadlinefor predictable latency. - Shared or multi-tenant hosts: use
bfqto keep one workload from starving the rest. - Tail latency, not median: scheduler changes show up at p99 and p999, so that is what to measure.
- Make it persistent: use udev rules or TuneD, never the dead
elevator=parameter.
Getting the most out of any scheduler starts with hardware that can keep up. If you need NVMe-backed servers built for high-throughput, low-latency workloads, explore FDC's VPS options.
Why it's important to have a powerful and unmetered VPS
An unmetered VPS gives flat-rate bandwidth at a fixed port speed. How it differs from metered plans, when it pays off, and what to check before buying.
7 min read - May 9, 2025

Have questions or need a custom solution?
Flexible options
Global reach
Instant deployment
Flexible options
Global reach
Instant deployment