[illumos-Developer] Important - time sensitive: Drive failures and infinite waits
Alasdair Lumsden
alasdairrr at gmail.com
Thu May 26 18:04:42 PDT 2011
Hi All,
After rebooting the box, the drive that seemed to be misbehaving now seems completely fine, at least some basic throughput tests show the drive is performing normally.
Gordon Ross and George Wilson were kind enough to do some extensive rummaging around prior to the reboot, and with some input from Eric Schrock, it sounds like the issue was a phy lock due to an ASIC fault in the LSI 1068 present on the cards when used with SATA drives specifically.
It was suggested that I look at alternative HBA cards, such as the LSI 9201-16i.
I'm going to do just that, since we need to return this box to production as soon as possible.
I'd like to thank everyone who helped investigate this for their time, it's very much appreciated. I still think it would be good if some logic could be added to sd to work around issues in drivers/firmware/hardware if its at all feasible.
The odd thing is I saw virtually identical behaviour on a different box a few weeks ago with the mpt_sas driver with SAS drives in an external enclosure in a HA storage cluster. I contacted Garrett when the thing locked up. He suggested dropping sd_io_time. This case was definitely a disk failure rather than a phy lock, as it persisted when we did a cluster failover to the other head, the pool import locked up.
I'm not 100% sure replacing the cards will save me from this happening again, but if there is a known bug in the ASIC then I'd prefer not to take the chance.
Cheers,
Alasdair
More information about the Developer
mailing list