[illumos-Developer] Important - time sensitive: Drive failures and infinite waits
Alasdair Lumsden
alasdairrr at gmail.com
Thu May 26 10:58:38 PDT 2011
Hi Richard,
On 26 May 2011, at 18:43, Richard Elling wrote:
> Good. So there should be a record when a retry is sent. I've got some dtrace running around
> somewhere that watches the reset/retries, or you can enable more detailed sd debugging. See
> http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/io/scsi/targets/sd.c#251
>
> The decision to kick in a hot spare is made in the zfs-retire FMA module that subscribes to various
> io error events. Before this can happen, something has to generate the io error events.
I don't think any errors are making their way up the chain, no errors are reported through iostat (although perhaps they wouldn't be), and things have been wedged for several hours now. There's no fma events, and nothing mentioned in dmesg.
There's a kernel thread blocked in biowait, its been stuck in the same state for many hours now.
I think we've hit a kernel bug here, or perhaps a "kernel assumption" about what the driver should or shouldn't be doing with regards to timeouts.
I can give you a login to the box if you'd like to have a poke around? It's on a public IP.
Cheers,
Alasdair
More information about the Developer
mailing list