[illumos-Developer] Important - time sensitive: Drive failures and infinite waits

Richard Elling richard.elling at richardelling.com
Fri May 27 12:16:17 PDT 2011


On May 27, 2011, at 5:33 AM, Gary Mills wrote:

> On Fri, May 27, 2011 at 02:04:42AM +0100, Alasdair Lumsden wrote:
>> 
>> Gordon Ross and George Wilson were kind enough to do some extensive
>> rummaging around prior to the reboot, and with some input from Eric
>> Schrock, it sounds like the issue was a phy lock due to an ASIC
>> fault in the LSI 1068 present on the cards when used with SATA
>> drives specifically.
> 
> Perhaps we need a software watchdog to protect against hardware
> failures of that sort?  Doesn't the SCSI driver time out and do a bus
> reset when the target doesn't respond?

The failure modes are many, and their interactions complex. I recently saw a
bad cable, on of the 4x SAS cables. One path was flaky, so if you were unlucky
enough that your I/O got scheduled down that path, then it looked to ZFS like a
checksum error. All disks connected to the wire (dozens) showed hundreds of
thousands of checksum errors.

It is not clear to me that we can disable a specific phy in an automated manner
at the OS level. Nor is it clear to me that we can reliably root-cause this failure mode
in all cases: phy - cable - phy ?
 -- richard




More information about the Developer mailing list