[illumos-Developer] Important - time sensitive: Drive failures and infinite waits
Richard Elling
richard.elling at richardelling.com
Thu May 26 10:43:32 PDT 2011
On May 26, 2011, at 8:14 AM, Alasdair Lumsden wrote:
> Hi Richard,
>
> This box is running the latest oi_148 (actually it's running a slightly newer un-released oi_148 which has additional Illumos backports:
>
> http://hg.openindiana.org/mq_onnv-gate/file/3e2c4091ddeb)
>
> What makes you think the timeout values haven't stuck? mdb is showing the values did propagate to the per-disk sd state:
>
> root ~ (san01.ixlon1): /usr/bin/uname -a
> SunOS san01.ixlon1.everycity.co.uk 5.11 oi_148 i86pc i386 i86pc
> root ~ (san01.ixlon1): fmdump -eV
> TIME CLASS
> fmdump: warning: /var/fm/fmd/errlog is empty
> root ~ (san01.ixlon1): echo "sd_io_time::print" | mdb -k
> 0x7
> root ~ (san01.ixlon1): echo "::walk sd_state | ::grep '.!=0' | ::sd_state" | mdb -k | egrep "^un|un_retry_count|un_cmd_timeout"
> un: ffffff090f43c640
>
> un_retry_count = 0x3
> un_cmd_timeout = 0x7
Good. So there should be a record when a retry is sent. I've got some dtrace running around
somewhere that watches the reset/retries, or you can enable more detailed sd debugging. See
http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/io/scsi/targets/sd.c#251
The decision to kick in a hot spare is made in the zfs-retire FMA module that subscribes to various
io error events. Before this can happen, something has to generate the io error events.
I'd be very interested in the FMA ereports you're seeing.
-- richard
> un: ffffff09184a5940
> un_retry_count = 0x3
> un_cmd_timeout = 0x7
> un: ffffff09184a5300
> un_retry_count = 0x3
> un_cmd_timeout = 0x7
> un: ffffff09184a4cc0
> un_retry_count = 0x3
> un_cmd_timeout = 0x7
> un: ffffff09184a4680
> un_retry_count = 0x3
> un_cmd_timeout = 0x7
> un: ffffff09184a4040
> un_retry_count = 0x3
> un_cmd_timeout = 0x7
> un: ffffff0918a81980
> un_retry_count = 0x3
> un_cmd_timeout = 0x7
> un: ffffff0918a81340
> un_retry_count = 0x3
> un_cmd_timeout = 0x7
> un: ffffff090f43c000
> un_retry_count = 0x3
> un_cmd_timeout = 0x7
> un: ffffff0918bd3980
> un_retry_count = 0x3
> un_cmd_timeout = 0x7
> un: ffffff0918bd3340
> un_retry_count = 0x3
> un_cmd_timeout = 0x7
> un: ffffff0918bd2d00
> un_retry_count = 0x3
> un_cmd_timeout = 0x7
> un: ffffff0918bd26c0
> un_retry_count = 0x3
> un_cmd_timeout = 0x7
> un: ffffff0918bd2080
> un_retry_count = 0x3
> un_cmd_timeout = 0x7
> un: ffffff09169c8000
> un_retry_count = 0x3
> un_cmd_timeout = 0x7
> un: ffffff090dfa26c0
> un_retry_count = 0x3
> un_cmd_timeout = 0x7
> un: ffffff0918bd9940
> un_retry_count = 0x3
> un_cmd_timeout = 0x7
> un: ffffff0918bb6640
> un_retry_count = 0x3
> un_cmd_timeout = 0x7
> un: ffffff0918bd9300
> un_retry_count = 0x3
> un_cmd_timeout = 0x7
> un: ffffff0918bd8cc0
> un_retry_count = 0x3
> un_cmd_timeout = 0x7
> un: ffffff090edde9c0
> un_retry_count = 0x3
> un_cmd_timeout = 0x7
> un: ffffff090eddd700
> un_retry_count = 0x3
> un_cmd_timeout = 0x7
> un: ffffff0918bd8680
> un_retry_count = 0x3
> un_cmd_timeout = 0x7
> un: ffffff0918bd8040
> un_retry_count = 0x3
> un_cmd_timeout = 0x7
> un: ffffff0918bcad40
> un_retry_count = 0x3
> un_cmd_timeout = 0x7
> un: ffffff0918bcb380
> un_retry_count = 0x3
> un_cmd_timeout = 0x7
> un: ffffff090eddd0c0
> un_retry_count = 0x3
> un_cmd_timeout = 0x7
> un: ffffff090f43d2c0
> un_retry_count = 0x3
> un_cmd_timeout = 0x7
> un: ffffff0918bf0c80
> un_retry_count = 0x3
> un_cmd_timeout = 0x7
> un: ffffff090f43cc80
> un_retry_count = 0x3
> un_cmd_timeout = 0x7
> un: ffffff0918a806c0
> un_retry_count = 0x3
> un_cmd_timeout = 0x7
> un: ffffff0918a80d00
> un_retry_count = 0x3
> un_cmd_timeout = 0x7
> un: ffffff0918bc79c0
> un_retry_count = 0x3
> un_cmd_timeout = 0x7
> un: ffffff0918bc7380
> un_retry_count = 0x3
> un_cmd_timeout = 0x7
> un: ffffff0918bc6d40
> un_retry_count = 0x3
> un_cmd_timeout = 0x7
> un: ffffff0918bc6700
> un_retry_count = 0x3
> un_cmd_timeout = 0x7
> un: ffffff0918bc60c0
> un_retry_count = 0x3
> un_cmd_timeout = 0x7
> un: ffffff0918bb7900
> un_retry_count = 0x3
> un_cmd_timeout = 0x7
> un: ffffff0918bb72c0
> un_retry_count = 0x3
> un_cmd_timeout = 0x7
> un: ffffff0918bb6c80
> un_retry_count = 0x3
> un_cmd_timeout = 0x7
> un: ffffff0918bcb9c0
> un_retry_count = 0x5
> un_cmd_timeout = 0x7
> un: ffffff0918bca700
> un_retry_count = 0x5
> un_cmd_timeout = 0x7
> un: ffffff0918a80080
> un_retry_count = 0x5
> un_cmd_timeout = 0x7
> un: ffffff090edddd40
> un_retry_count = 0x5
> un_cmd_timeout = 0x7
> un: ffffff0918bf12c0
> un_retry_count = 0x3
> un_cmd_timeout = 0x7
> un: ffffff090dfa2d00
> un_retry_count = 0x3
> un_cmd_timeout = 0x7
> un: ffffff090dfa2080
> un_retry_count = 0x3
> un_cmd_timeout = 0x7
> un: ffffff0918bb6000
> un_retry_count = 0x3
> un_cmd_timeout = 0x7
> un: ffffff0918bca0c0
> un_retry_count = 0x3
> un_cmd_timeout = 0x7
> un: ffffff0918bf0640
> un_retry_count = 0x3
> un_cmd_timeout = 0x7
> un: ffffff0918bf1900
> un_retry_count = 0x3
> un_cmd_timeout = 0x7
> un: ffffff0918bf0000
> un_retry_count = 0x3
> un_cmd_timeout = 0x7
> root ~ (san01.ixlon1):
> root ~ (san01.ixlon1):
>
> But I might have misunderstood.
>
> Cheers,
>
> Alasdair
More information about the Developer
mailing list