[illumos-Developer] Important - time sensitive: Drive failures and infinite waits

Thu May 26 15:40:35 PDT 2011

Hi Haudy,

On 26 May 2011, at 22:46, Haudy Kazemi wrote:
> If anyone experiences this scenario running the whole system in a virtual machine, perhaps you can use your virtual machine software to save a state snapshot.
> 
> There have been sporadic reports sounding similar to Alasdair's over the years.  The zfs-discuss list has records of some of them and related discussion (for examples, look at the archives from 2010-07-22 to 24 to see posts in the thread called '1tb SATA drives' by Miles Nordin, myself, and others.)
> http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg40826.html
> 
> IIRC, many discussions ended with "use enterprise drives, equipment, and TLER" instead of less expensive equipment.  That was not a satisfactory response to discussion participants who held the view that ZFS should be able to smoothly recover from failure of drives/controllers/storage drivers, whether or not those subsystems behaved as expected during a failure.  To do otherwise, means ZFS is making storage subsystem behavior and reliability assumptions are not in tune with its motto of being the "last word in filesystems" and its goal of achieving high reliability and high performance, even on less expensive storage devices.

+1

If the sd subsystem doesn't do the right thing, I'd love it ZFS would just go right ahead and offline the device, based on a configurable timeout.

There seems to be a "pass the buck" issue - ZFS assumes sd will take care of timeouts, sd assumes the scsi drivers handle timeouts, some scsi drivers probably assume the firmware on the HBA does it. If nobody takes responsibility, then you have a big issue on your hands.

If ZFS took control over the situation, any misbehaviour at the lower systems would be much less of an issue.

But, IANAKE (I am not a kernel engineer)

> The slow-death failure scenario comes to mind, where a particular drive still works but is much slower than usual.  No error is reported, however the response time is severely impacted.  Borderline sectors failing slowly can be triggers.  Given appropriate redundancy, and an expected maximum response time threshold, ZFS could reconstruct the needed data using the other devices anytime the maximum response time threshold was exceeded.  E.g. If a device access request is not responded to within 7 seconds (configurable to allow for different pool architectures), attempt to reconstruct the data; no need for TLER support in the drive, no need for the storage drivers to respond in a particular way.  A monitoring of device performance changes over time and observed delays could also serve as a warning that the devices may be failing via slow-death.

This is the hardest situation to detect.

There was talk at the Nexenta conference that ZFS could be enhanced to read from the faster pair of a mirror, for example if you do cross-datacenter mirroring where one half of the drives are being sucked down a cross-site link. So that's related.

On the write side, you have no choice but to write to both disks of a mirror, or all disks of a stripe, so if you have any writes coming through you're still screwed.

My preference would be for a zpool "disk timeout" parameter - if you're using iSCSI you could set a higher timeout, like 120 seconds. If you're using enterprise SAS disks you could set 10 seconds.

The ZFS code could try and detect if the whole IO subsystem is screwed by attempting some kind of null trivial write to all disks, and if > eg 50% hang, don't take action because you don't want your entire pool dropping all its disks.

But since ZFS shouldn't/wouldn't offline a drive if it would fault the pool, too low a timeout would only degrade a pool, not entirely screw it.

zpool set data disk_timeout=5
zpool set data disk_retries=3

I'm not sure how you detect super-slow drives - that comes down to statistics. Perhaps average drive response times could be monitored, and when this drops significantly for greater than the timeout period, the drive gets offlined. Say, falling to 25% of normal.

But.. in this bug I hit today, the kernel was locked up waiting on IO. Zpool commands hung. I doubt ZFS could offline a device even if it wanted to. There was suggestion this is an issue in the mpt driver. But, I think improvements may be needed to the sd subsystem, so it can handle shitty drivers, since those are a fact of life and are sometimes outside our control. Like with mpt, which is closed.

Cheers,

Alasdair