[bugs] [illumos gate - Bug #1197] Hang after resilver finished with mpt
illumos bugs
bugs at lists.illumos.org
Mon Jul 11 12:47:05 PDT 2011
Issue #1197 has been updated by Roy Sigurd Karlsbakk.
Just rebooted the box - couldn't get anything more out of it anyway. After the reboot, it shows a few more drives have died, and the drives that had finished resilvering (c4t37d0 and c4t43d0) are now resilvering once more. The whole zpool status is below. After the resilver, there were no issues reported, but still the pool/machine hung because of an issue probably related to c4t23d0, which is now marked as faulted. It should be noted that I have seen this and other OI machines with the same controllers kick off drives that have later shown to be ok, also after thorough testing.
roy
rsk at prv-backup:~$ zpool status pbpool
pool: pbpool
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Mon Jul 11 21:37:12 2011
763G scanned out of 37.6T at 2.60G/s, 4h1m to go
29.8G resilvered, 1.99% done
config:
NAME STATE READ WRITE CKSUM
pbpool DEGRADED 0 0 0
raidz2-0 ONLINE 0 0 0
c4t0d0 ONLINE 0 0 0
c4t1d0 ONLINE 0 0 0
c4t2d0 ONLINE 0 0 0
c4t3d0 ONLINE 0 0 0
c4t4d0 ONLINE 0 0 0
c4t5d0 ONLINE 0 0 0
c4t6d0 ONLINE 0 0 0
raidz2-1 ONLINE 0 0 0
c4t7d0 ONLINE 0 0 0
c4t8d0 ONLINE 0 0 0
c4t9d0 ONLINE 0 0 0
c4t10d0 ONLINE 0 0 0
c4t11d0 ONLINE 0 0 0
c4t12d0 ONLINE 0 0 0
c4t13d0 ONLINE 0 0 0
raidz2-2 ONLINE 0 0 0
c4t14d0 ONLINE 0 0 0
c4t15d0 ONLINE 0 0 0
c4t16d0 ONLINE 0 0 0
c4t17d0 ONLINE 0 0 0
c4t18d0 ONLINE 0 0 0
c4t19d0 ONLINE 0 0 0
c4t20d0 ONLINE 0 0 0
raidz2-3 DEGRADED 0 0 0
c4t21d0 ONLINE 0 0 0
c4t22d0 ONLINE 0 0 0
spare-2 UNAVAIL 0 0 0
c4t23d0 FAULTED 0 0 0 corrupted data
c8t35d0 ONLINE 0 0 0 (resilvering)
c4t24d0 ONLINE 0 0 0
c4t25d0 ONLINE 0 0 0
c4t26d0 ONLINE 0 0 0
c4t27d0 ONLINE 0 0 0
raidz2-4 DEGRADED 0 0 0
c4t28d0 ONLINE 0 0 0
c4t29d0 ONLINE 0 0 0
c4t30d0 ONLINE 0 0 0
c4t31d0 ONLINE 0 0 0
c4t32d0 FAULTED 0 0 0 too many errors
c4t33d0 ONLINE 0 0 0
c4t34d0 ONLINE 0 0 0
raidz2-5 DEGRADED 0 0 0
c4t35d0 ONLINE 0 0 0
c4t36d0 ONLINE 0 0 0
replacing-2 DEGRADED 0 0 0
c4t37d0/old OFFLINE 0 0 0
c4t37d0 ONLINE 0 0 0 (resilvering)
c4t38d0 ONLINE 0 0 0
c4t39d0 ONLINE 0 0 0
c4t40d0 ONLINE 0 0 0
c4t41d0 ONLINE 0 0 0
raidz2-6 DEGRADED 0 0 0
c4t42d0 ONLINE 0 0 0
spare-1 DEGRADED 0 0 0
replacing-0 DEGRADED 0 0 0
c4t43d0/old FAULTED 0 0 0 corrupted data
c4t43d0 ONLINE 0 0 0 (resilvering)
c8t34d0 ONLINE 0 0 0
c4t44d0 ONLINE 0 0 0
c8t2d0 ONLINE 0 0 0
c8t3d0 ONLINE 0 0 0
c8t4d0 ONLINE 0 0 0
c8t5d0 ONLINE 0 0 0
raidz2-7 ONLINE 0 0 0
c8t6d0 ONLINE 0 0 0
c8t7d0 ONLINE 0 0 0
c8t8d0 ONLINE 0 0 0
c8t9d0 ONLINE 0 0 0
c8t10d0 ONLINE 0 0 0
c8t11d0 ONLINE 0 0 0
c8t12d0 ONLINE 0 0 0
raidz2-8 ONLINE 0 0 0
c8t13d0 ONLINE 0 0 0
c8t14d0 ONLINE 0 0 0
c8t15d0 ONLINE 0 0 0
c8t16d0 ONLINE 0 0 0
c8t17d0 ONLINE 0 0 0
c8t18d0 ONLINE 0 0 0
c8t19d0 ONLINE 0 0 0
raidz2-9 ONLINE 0 0 0
c8t20d0 ONLINE 0 0 0
c8t21d0 ONLINE 0 0 0
c8t22d0 ONLINE 0 0 0
c8t23d0 ONLINE 0 0 0
c8t24d0 ONLINE 0 0 0
c8t25d0 ONLINE 0 0 0
c8t26d0 ONLINE 0 0 0
raidz2-10 ONLINE 0 0 0
c8t27d0 ONLINE 0 0 0
c8t28d0 ONLINE 0 0 0
c8t29d0 ONLINE 0 0 0
c8t30d0 ONLINE 0 0 0
c8t31d0 ONLINE 0 0 0
c8t32d0 ONLINE 0 0 0
c8t33d0 ONLINE 0 0 0
logs
mirror-11 ONLINE 0 0 0
c6d1 ONLINE 0 0 0
c7d1 ONLINE 0 0 0
cache
c8t0d0 ONLINE 0 0 0
c8t1d0 ONLINE 0 0 0
spares
c8t34d0 INUSE currently in use
c8t35d0 INUSE currently in use
errors: No known data errors
----------------------------------------
Bug #1197: Hang after resilver finished with mpt
https://www.illumos.org/issues/1197
Author: Roy Sigurd Karlsbakk
Status: New
Priority: Urgent
Assignee:
Category: driver - device drivers
Target version:
Difficulty: Medium
Tags: needs-triage
Hi all
I just had a machine finish resilver after a drive (well, two actually) died. After resilver was finished, the Icinga (ex Nagios) check told me the pool was healthy again, so fine. But then, about 15 minutes later, Icinga complained the check timed out, and the box was unavailable. From a remote, I could see OpenIndiana spamming it with messages:
scsi: WARNING: /pci at 0,0/pci8086,340e at 7/pci1000,30a0 at 0... (mpt0):
Disconnected command timeout for Target 23
This looks familiar - I have seen similar on other servers, also just after resilver. The box is using LSI 3801 and 3081 controllers with the mpt driver. Current OS version is OpenIndiana b148.
It looks like this is the same bug I've hit earlier. I just became aware of the resilver issue when this happened within two days with two different machines (the other is 1700km from here, and I don't have a remote console for it yet - long story).
Is there anything I can do to debug this? I ran 'zpool status' from the console, and it apparently hangs there and won't go anywhere.....
Thank you for any help on this one!
roy
--
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here: http://www.illumos.org/my/account
More information about the bugs
mailing list