[illumos-Developer] Important - time sensitive: Drive failures and infinite waits

Alasdair Lumsden alasdairrr at gmail.com
Thu May 26 10:58:38 PDT 2011


Hi Richard,

On 26 May 2011, at 18:43, Richard Elling wrote:
> Good. So there should be a record when a retry is sent.  I've got some dtrace running around
> somewhere that watches the reset/retries, or you can enable more detailed sd debugging. See
> http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/io/scsi/targets/sd.c#251
> 
> The decision to kick in a hot spare is made in the zfs-retire FMA module that subscribes to various
> io error events. Before this can happen, something has to generate the io error events.

I don't think any errors are making their way up the chain, no errors are reported through iostat (although perhaps they wouldn't be), and things have been wedged for several hours now. There's no fma events, and nothing mentioned in dmesg.

There's a kernel thread blocked in biowait, its been stuck in the same state for many hours now.

I think we've hit a kernel bug here, or perhaps a "kernel assumption" about what the driver should or shouldn't be doing with regards to timeouts.

I can give you a login to the box if you'd like to have a poke around? It's on a public IP.

Cheers,

Alasdair


More information about the Developer mailing list