Fix SG_SS Cookie Mismatch Error on Linux

If you've ever run smartctl -a /dev/sda or badblocks -sv /dev/sdb and watched the terminal fill up with SG_SS Cookie Mismatch messages, you know the frustration. The drive works fine for normal reads and writes, but any ATA passthrough command triggers this noise. Let's fix it.

The Fast Fix: Disable NCQ

The most reliable fix is to disable Native Command Queuing (NCQ) on the drive. This tells the kernel's ATA driver to stop using the command queuing mechanism that's causing the mismatch.

# First, find the drive's host adapter number
echo /sys/block/sdX/device/../../scsi_host/host*/device/*/block/sdX/queue/read_ahead_kb 2>/dev/null
# Get the actual host number
HOST=$(basename $(readlink /sys/block/sdX/device/../../host))
echo 1 > /sys/block/sdX/device/../../host$HOST/device/queue_depth
echo 1 > /sys/block/sdX/device/../../host$HOST/device/scsi_device/*/queue_depth

That sets the queue depth to 1, effectively disabling NCQ. If you're using a recent kernel (5.x or later), you can also try replacing the I/O scheduler:

echo mq-deadline > /sys/block/sdX/queue/scheduler

On some systems, switching from none or kyber to mq-deadline eliminates the mismatch. The reason is that the cookie mismatch happens when the ATA passthrough command (like those used by smartctl) clashes with the block layer's internal command tracking. Single-queue schedulers handle this better than multi-queue ones.

Why This Happens

The SG_SS Cookie Mismatch error comes from the libata layer in the Linux kernel. What's actually happening here is a race condition in the command completion path. When you issue an ATA passthrough command via the SCSI generic (SG) interface, the kernel assigns a cookie to track the command. If the drive completes the command using a different path (like the internal DMA completion), the cookie doesn't match, and you get the error.

This is not a hardware failure, though many people panic and RMA their drives. The drive is fine. The kernel's ATA driver has a bug or design limitation that becomes visible under certain conditions:

Multi-queue block layer with NCQ enabled (default on modern distros like Ubuntu 20.04+, Fedora 30+, Debian 10+)
Drives that support NCQ (SATA drives almost always do)
Any tool that uses ATA passthrough: smartctl, badblocks, hdparm, sg_ses

The kernel developers have been patching this over the years—some kernel versions 4.x had it worse, 5.x improved things, but it still shows up under load. The fix above bypasses the issue by reducing complexity in the I/O path.

Less Common Variations

Virtual Machines and Pass-Through Drives

If you're passing an entire SATA controller or individual drive to a KVM/QEMU VM, the same error can appear on the host's kernel log. In this case, disable NCQ on the host side for that device:

echo 1 > /sys/block/sdX/device/queue_depth

You'll need to do this after each VM boot if the host reboots. Make it permanent by adding the command to /etc/rc.local or a systemd service.

NVMe Drives (Rare)

I've seen one case where an NVMe drive triggered a similar error. The fix there was different—update the NVMe firmware and ensure ASPM is disabled in BIOS. But the cookie mismatch wording is almost always tied to SATA/AHCI, so if you see it on NVMe, it's usually a red herring pointing to an underlying PCIe issue.

USB Enclosures

External USB drives sometimes generate this error when the bridge chip (JMicron, ASMedia) doesn't properly handle ATA passthrough. In that case, disable UAS (USB Attached SCSI) for that device:

echo N > /sys/module/usb_storage/parameters/uas

Or blacklist the uas module entirely. But be warned—this kills performance on USB 3.0 enclosures because it falls back to USB BOT (Bulk-Only Transport).

Prevention

You can't prevent a kernel-level bug in the ATA driver, but you can avoid triggering it. Here's what I do on my machines:

Use the single-queue scheduler by default. Switch all SATA drives to mq-deadline instead of none or kyber. Edit /etc/udev/rules.d/60-iosched.rules:

ACTION=="add|change", KERNEL=="sd*", ATTR{queue/scheduler}="mq-deadline"

Disable NCQ on drives you run diagnostics on regularly. If you're a sysadmin running cron jobs with smartctl, set queue depth to 1 on those drives. It hurts peak throughput a bit, but for diagnostic tools you don't care about speed.
Keep your kernel updated. Newer kernels have fewer race conditions in the ATA layer. As of kernel 6.x, I haven't seen this error on any of my systems after applying the scheduler fix.
Check your motherboard's SATA controller. Some older chipsets (Intel PCH 7-series and earlier) are more prone to this. If you're building a NAS for reliability, use an LSI/Broadcom HBA in IT mode instead of the motherboard controller.

The bottom line: ignore the panic, disable NCQ, and move on. Your drive is fine. The kernel is being dramatic.