[illumos-Developer] Changing sd_io_time to 8?
Joerg Schilling
Joerg.Schilling at fokus.fraunhofer.de
Tue May 3 06:10:08 PDT 2011
Richard Elling <richard.elling at richardelling.com> wrote:
> On Apr 30, 2011, at 6:59 AM, Mike La Spina wrote:
> > Hi Richard,
> >
> > This would be a very beneficial change since a single failing sd device in a raidz1 seg would cause a 60 second delay in all IO for that seg. I can see some potential for a negative impact in the VM world but if it?s tunable that?s really a non issue.
> > e.g
> >
> > if VENDOR = VMware
> > then
> > sd_io_time = 120
>
> IIRC, I wrote an RFE for that about 10 years ago... don't recall the CR number, though.
> As you can see, for high availability systems (I was working with Sun Cluster at the time)
> the timeouts are critical. Similarly, for upper layers, like ZFS, that rely on the lower layers
> to handle errors, waiting 5 minutes to declare an HDD dead is unacceptable.
>
> Judging from some of the responses here, and a depressingly huge amount of misinformation
> on the 'net, the reset logic and control structure could use a good dose of exposure. I had done
> that analysis 10 years ago, and can redo today, but please give me a few weeks to pull it together.
It depends on what exactly happened. If the drive completely dies, you should
be able to detect this in a different way and disable high level retries in the
driver.
In cases where a drive becomes slow, it may still be better to be able to get
the data and in case that there is a ZFS RAID on top of the drive, it makes
sense to tell the driver to do a quick abort too.
Jörg
--
EMail:joerg at schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin
js at cs.tu-berlin.de (uni)
joerg.schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/
URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
More information about the Developer
mailing list