[illumos-Developer] Important - time sensitive: Drive failures and infinite waits

Alasdair Lumsden alasdairrr at gmail.com
Fri May 27 12:45:53 PDT 2011


Hi Garrett,

On 27 May 2011, at 20:14, Garrett D'Amore wrote:

> It does.  One type of problem is a drive that does not hard fail but manages to limp along doing a request or two per second.  We dont have a good defense for this at present.  (Internal retry logic in the drives make this harder too.)

With this issue, I don't think any requests were making it to the drive. Things were wedged, and stayed wedged for 6 hours with no change in the number of queued requests. George Wilson's comments on the bug provides some insight:

https://www.illumos.org/issues/1069

I didn't see this particular issue on snv_130, but it only occurred when the box was re-installed with oi_148. estibi on IRC saw exactly the same thing on systems with the same LSI chipset. He has 40 machines all with the same LSI card (SAS1068E) and never saw things wedge, but saw exactly the same behaviour as me when he upgraded some to oi_148. He has observed this wedge on oi_148 more than 3 times.

There was talk of there being a bug in the ASIC with phy lockups, but it seems odd that the issue wasn't observed with snv_130 but was with oi_148.

Cheers,

Alasdair





More information about the Developer mailing list