[illumos-Developer] Important - time sensitive: Drive failures and infinite waits

Garrett D'Amore garrett at damore.org
Thu May 26 07:18:01 PDT 2011


Please supply your zpool status -v  so that we can see your pools.

  -- Garrett D'Amore

On May 26, 2011, at 5:41 PM, Alasdair Lumsden <alasdairrr at gmail.com> wrote:

> Hi All,
> 
> Twice in the past 2 weeks we've suffered a drive failure which caused an entire storage node to lock up not responding to IO, with iostat showing a 100% busy time against a single disk whilst the others sit idle. The only resolution was to yank the drive out.
> 
> These were two completely different machines as well, one a pair of Dell R710s attached to LSI SAS 6Gbps disk shelves via an LSI 9200-8e card using the mpt_sas driver, with 36 Seagate Constellation ES SAS disks. The other machine is a custom build with a Supermicro motherboard, LSI 3801E-R cards using the mpt driver, and 48 Western Digital SATA drives.
> 
> So this is two different machines, different RAID cards, different drivers, different disks, exhibiting exactly the same failure mode.
> 
> On the storage array this happened on today, I had already adjusted the sd timeout to 7 seconds, with 3 retries, using:
> 
> set sd:sd_io_time=7 (/etc/system)
> sd-config-list = "ATA     WDC WD7501AALS-0", "retries-timeout:3"; (/kernel/drv/sd.conf)
> 
> So in theory, when a disk stalls, it should get removed by sd after 21 seconds. It has been over 30 mins now whilst the machine sits there attempting to write to the pool.
> 
> The good news is, that this SAN wasn't in production and has nothing on it (yet). I need to return it to service within the next 48 hours, but in the mean time this is an ideal opportunity for one of the Illumos kernel developers to get on the box and do some diagnosing.
> 
> This is one of the biggest and most serious issues with using ZFS in SAN/NAS environments that I've seen - that when a drive fails, it doesn't get taken out of service, and I've seen it quite a few times before.
> 
> I'm hoping that now it can be reproduced, the devs can nail this once and for all. Please contact me off-list and I'll provide SSH access details to get on it.
> 
> But this disk may fail completely soon, so please act quickly otherwise the window of opportunity may be lost.
> 
> Cheers,
> 
> Alasdair
> 
> 
> 
> 
> 
> _______________________________________________
> Developer mailing list
> Developer at lists.illumos.org
> http://lists.illumos.org/m/listinfo/developer



More information about the Developer mailing list