[illumos-Developer] Changing sd_io_time to 8?

Thu May 5 10:50:57 PDT 2011

>-----Original Message-----
>From: Joerg Schilling [mailto:Joerg.Schilling at fokus.fraunhofer.de] 
>Sent: Thursday, May 05, 2011 5:16 AM
>To: Mike La Spina; garrett at nexenta.com
>Cc: developer at lists.illumos.org
>Subject: Re: [illumos-Developer] Changing sd_io_time to 8?
>
>>"Garrett D'Amore" <garrett at nexenta.com> wrote:
>>
>> I think we can be smarter here too.
>>
>> CDROMS need a longer recovery time.
>>
>> Disks on Parallel SCSI might need longer.
>>
>> Disks on fibre, SAS, and SATA, should all respond within a narrow window
>> of time unless something is seriously amiss.
>
>
>There is no difference between Parallel SCSI and other SCSI transports besides 
>just the transport itself.

>From a standards perspective, I totally agree, however the specific transport hardware plays a significant role in response behavior. A reset cycle may actually end up with longer waits on the SAS protocol vs Fiber so we have to be cautious.  
For example SAS reset behavior at the SCSI-3 level is very different from parallel SCSI-1.   

>A hard-disk that is just booting after a power on or reset and that has read 
>problems in one of the copies for the firmware sectors or in one of the copies
>for the defect list sectors may take a _long_ time too to recover and it is of 
>course worth to wait at least a minute in such a case.

I have experienced this type of failure, it's actually the highest failure event with any disk target. "The power cycle"
In the perspective of a storage system fault recovery event. Power is cycled on any physical drive swap, so this is definitely an element that needs to be considered.

+1

>If you really like to speed things up, start thinking about a way to reduce the 
>number of driver level _retries_ in case of problems if the related disk is 
>part of a RAID-Z2 and the dataa could be recovered otherwise.

Total agree, the longest element in the wait cycle is a retry of a request that will not recover.
Currently it's 5 retries @ 60 seconds of sd_io_time grace time. 
I would hazard 2 as sufficient for RAID-Zn, any thoughts? 

+1 

>Jörg

-- 
 EMail:joerg at schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin
       js at cs.tu-berlin.de                (uni)  
       joerg.schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily