[illumos-Developer] Important - time sensitive: Drive failures and infinite waits

Richard Elling richard.elling at richardelling.com
Thu May 26 10:43:32 PDT 2011


On May 26, 2011, at 8:14 AM, Alasdair Lumsden wrote:

> Hi Richard,
> 
> This box is running the latest oi_148 (actually it's running a slightly newer un-released oi_148 which has additional Illumos backports:
> 
> http://hg.openindiana.org/mq_onnv-gate/file/3e2c4091ddeb)
> 
> What makes you think the timeout values haven't stuck? mdb is showing the values did propagate to the per-disk sd state:
> 
> root ~ (san01.ixlon1): /usr/bin/uname -a
> SunOS san01.ixlon1.everycity.co.uk 5.11 oi_148 i86pc i386 i86pc
> root ~ (san01.ixlon1): fmdump -eV
> TIME                           CLASS
> fmdump: warning: /var/fm/fmd/errlog is empty
> root ~ (san01.ixlon1): echo "sd_io_time::print" | mdb -k
> 0x7
> root ~ (san01.ixlon1): echo "::walk sd_state | ::grep '.!=0' | ::sd_state" | mdb -k | egrep "^un|un_retry_count|un_cmd_timeout"
> un: ffffff090f43c640
> 
>    un_retry_count = 0x3
>    un_cmd_timeout = 0x7

Good. So there should be a record when a retry is sent.  I've got some dtrace running around
somewhere that watches the reset/retries, or you can enable more detailed sd debugging. See
http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/io/scsi/targets/sd.c#251

The decision to kick in a hot spare is made in the zfs-retire FMA module that subscribes to various
io error events. Before this can happen, something has to generate the io error events.

I'd be very interested in the FMA ereports you're seeing.
 -- richard

> un: ffffff09184a5940
>    un_retry_count = 0x3
>    un_cmd_timeout = 0x7
> un: ffffff09184a5300
>    un_retry_count = 0x3
>    un_cmd_timeout = 0x7
> un: ffffff09184a4cc0
>    un_retry_count = 0x3
>    un_cmd_timeout = 0x7
> un: ffffff09184a4680
>    un_retry_count = 0x3
>    un_cmd_timeout = 0x7
> un: ffffff09184a4040
>    un_retry_count = 0x3
>    un_cmd_timeout = 0x7
> un: ffffff0918a81980
>    un_retry_count = 0x3
>    un_cmd_timeout = 0x7
> un: ffffff0918a81340
>    un_retry_count = 0x3
>    un_cmd_timeout = 0x7
> un: ffffff090f43c000
>    un_retry_count = 0x3
>    un_cmd_timeout = 0x7
> un: ffffff0918bd3980
>    un_retry_count = 0x3
>    un_cmd_timeout = 0x7
> un: ffffff0918bd3340
>    un_retry_count = 0x3
>    un_cmd_timeout = 0x7
> un: ffffff0918bd2d00
>    un_retry_count = 0x3
>    un_cmd_timeout = 0x7
> un: ffffff0918bd26c0
>    un_retry_count = 0x3
>    un_cmd_timeout = 0x7
> un: ffffff0918bd2080
>    un_retry_count = 0x3
>    un_cmd_timeout = 0x7
> un: ffffff09169c8000
>    un_retry_count = 0x3
>    un_cmd_timeout = 0x7
> un: ffffff090dfa26c0
>    un_retry_count = 0x3
>    un_cmd_timeout = 0x7
> un: ffffff0918bd9940
>    un_retry_count = 0x3
>    un_cmd_timeout = 0x7
> un: ffffff0918bb6640
>    un_retry_count = 0x3
>    un_cmd_timeout = 0x7
> un: ffffff0918bd9300
>    un_retry_count = 0x3
>    un_cmd_timeout = 0x7
> un: ffffff0918bd8cc0
>    un_retry_count = 0x3
>    un_cmd_timeout = 0x7
> un: ffffff090edde9c0
>    un_retry_count = 0x3
>    un_cmd_timeout = 0x7
> un: ffffff090eddd700
>    un_retry_count = 0x3
>    un_cmd_timeout = 0x7
> un: ffffff0918bd8680
>    un_retry_count = 0x3
>    un_cmd_timeout = 0x7
> un: ffffff0918bd8040
>    un_retry_count = 0x3
>    un_cmd_timeout = 0x7
> un: ffffff0918bcad40
>    un_retry_count = 0x3
>    un_cmd_timeout = 0x7
> un: ffffff0918bcb380
>    un_retry_count = 0x3
>    un_cmd_timeout = 0x7
> un: ffffff090eddd0c0
>    un_retry_count = 0x3
>    un_cmd_timeout = 0x7
> un: ffffff090f43d2c0
>    un_retry_count = 0x3
>    un_cmd_timeout = 0x7
> un: ffffff0918bf0c80
>    un_retry_count = 0x3
>    un_cmd_timeout = 0x7
> un: ffffff090f43cc80
>    un_retry_count = 0x3
>    un_cmd_timeout = 0x7
> un: ffffff0918a806c0
>    un_retry_count = 0x3
>    un_cmd_timeout = 0x7
> un: ffffff0918a80d00
>    un_retry_count = 0x3
>    un_cmd_timeout = 0x7
> un: ffffff0918bc79c0
>    un_retry_count = 0x3
>    un_cmd_timeout = 0x7
> un: ffffff0918bc7380
>    un_retry_count = 0x3
>    un_cmd_timeout = 0x7
> un: ffffff0918bc6d40
>    un_retry_count = 0x3
>    un_cmd_timeout = 0x7
> un: ffffff0918bc6700
>    un_retry_count = 0x3
>    un_cmd_timeout = 0x7
> un: ffffff0918bc60c0
>    un_retry_count = 0x3
>    un_cmd_timeout = 0x7
> un: ffffff0918bb7900
>    un_retry_count = 0x3
>    un_cmd_timeout = 0x7
> un: ffffff0918bb72c0
>    un_retry_count = 0x3
>    un_cmd_timeout = 0x7
> un: ffffff0918bb6c80
>    un_retry_count = 0x3
>    un_cmd_timeout = 0x7
> un: ffffff0918bcb9c0
>    un_retry_count = 0x5
>    un_cmd_timeout = 0x7
> un: ffffff0918bca700
>    un_retry_count = 0x5
>    un_cmd_timeout = 0x7
> un: ffffff0918a80080
>    un_retry_count = 0x5
>    un_cmd_timeout = 0x7
> un: ffffff090edddd40
>    un_retry_count = 0x5
>    un_cmd_timeout = 0x7
> un: ffffff0918bf12c0
>    un_retry_count = 0x3
>    un_cmd_timeout = 0x7
> un: ffffff090dfa2d00
>    un_retry_count = 0x3
>    un_cmd_timeout = 0x7
> un: ffffff090dfa2080
>    un_retry_count = 0x3
>    un_cmd_timeout = 0x7
> un: ffffff0918bb6000
>    un_retry_count = 0x3
>    un_cmd_timeout = 0x7
> un: ffffff0918bca0c0
>    un_retry_count = 0x3
>    un_cmd_timeout = 0x7
> un: ffffff0918bf0640
>    un_retry_count = 0x3
>    un_cmd_timeout = 0x7
> un: ffffff0918bf1900
>    un_retry_count = 0x3
>    un_cmd_timeout = 0x7
> un: ffffff0918bf0000
>    un_retry_count = 0x3
>    un_cmd_timeout = 0x7
> root ~ (san01.ixlon1): 
> root ~ (san01.ixlon1): 
> 
> But I might have misunderstood.
> 
> Cheers,
> 
> Alasdair




More information about the Developer mailing list