[illumos-Developer] Important - time sensitive: Drive failures and infinite waits

Thu May 26 14:46:16 PDT 2011

On 5/26/2011 2:10 PM, Alasdair Lumsden wrote:
> Hi Garrett,
>
> I've collected together all the info as best I could here:
>
> https://www.illumos.org/issues/1069
>
> I'm going to send another email with login details so if you find an opportunity to take a look it would of course be much appreciated. It sounds like quite a few other people have been bitten by this over the years. George Wilson believes he's seen it before, as does estibi, and a few others.
>
> Thanks,
>
> Alasdair

If anyone experiences this scenario running the whole system in a 
virtual machine, perhaps you can use your virtual machine software to 
save a state snapshot.

There have been sporadic reports sounding similar to Alasdair's over the 
years.  The zfs-discuss list has records of some of them and related 
discussion (for examples, look at the archives from 2010-07-22 to 24 to 
see posts in the thread called '1tb SATA drives' by Miles Nordin, 
myself, and others.)
http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg40826.html

IIRC, many discussions ended with "use enterprise drives, equipment, and 
TLER" instead of less expensive equipment.  That was not a satisfactory 
response to discussion participants who held the view that ZFS should be 
able to smoothly recover from failure of drives/controllers/storage 
drivers, whether or not those subsystems behaved as expected during a 
failure.  To do otherwise, means ZFS is making storage subsystem 
behavior and reliability assumptions are not in tune with its motto of 
being the "last word in filesystems" and its goal of achieving high 
reliability and high performance, even on less expensive storage devices.

The slow-death failure scenario comes to mind, where a particular drive 
still works but is much slower than usual.  No error is reported, 
however the response time is severely impacted.  Borderline sectors 
failing slowly can be triggers.  Given appropriate redundancy, and an 
expected maximum response time threshold, ZFS could reconstruct the 
needed data using the other devices anytime the maximum response time 
threshold was exceeded.  E.g. If a device access request is not 
responded to within 7 seconds (configurable to allow for different pool 
architectures), attempt to reconstruct the data; no need for TLER 
support in the drive, no need for the storage drivers to respond in a 
particular way.  A monitoring of device performance changes over time 
and observed delays could also serve as a warning that the devices may 
be failing via slow-death.

IIRC, ZFS built on top of iSCSI devices was also impacted by timeout 
issues.  I don't know if those iSCSI timeouts were addressed or 
otherwise made configurable or not.