[bugs] [illumos gate - Bug #1197] Hang after resilver finished with mpt

illumos bugs bugs at lists.illumos.org
Mon Jul 11 12:47:05 PDT 2011


Issue #1197 has been updated by Roy Sigurd Karlsbakk.


Just rebooted the box - couldn't get anything more out of it anyway. After the reboot, it shows a few more drives have died, and the drives that had finished resilvering (c4t37d0 and c4t43d0) are now resilvering once more. The whole zpool status is below. After the resilver, there were no issues reported, but still the pool/machine hung because of an issue probably related to c4t23d0, which is now marked as faulted. It should be noted that I have seen this and other OI machines with the same controllers kick off drives that have later shown to be ok, also after thorough testing.

roy

rsk at prv-backup:~$ zpool status pbpool
  pool: pbpool
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scan: resilver in progress since Mon Jul 11 21:37:12 2011
    763G scanned out of 37.6T at 2.60G/s, 4h1m to go
    29.8G resilvered, 1.99% done
config:

        NAME                 STATE     READ WRITE CKSUM
        pbpool               DEGRADED     0     0     0
          raidz2-0           ONLINE       0     0     0
            c4t0d0           ONLINE       0     0     0
            c4t1d0           ONLINE       0     0     0
            c4t2d0           ONLINE       0     0     0
            c4t3d0           ONLINE       0     0     0
            c4t4d0           ONLINE       0     0     0
            c4t5d0           ONLINE       0     0     0
            c4t6d0           ONLINE       0     0     0
          raidz2-1           ONLINE       0     0     0
            c4t7d0           ONLINE       0     0     0
            c4t8d0           ONLINE       0     0     0
            c4t9d0           ONLINE       0     0     0
            c4t10d0          ONLINE       0     0     0
            c4t11d0          ONLINE       0     0     0
            c4t12d0          ONLINE       0     0     0
            c4t13d0          ONLINE       0     0     0
          raidz2-2           ONLINE       0     0     0
            c4t14d0          ONLINE       0     0     0
            c4t15d0          ONLINE       0     0     0
            c4t16d0          ONLINE       0     0     0
            c4t17d0          ONLINE       0     0     0
            c4t18d0          ONLINE       0     0     0
            c4t19d0          ONLINE       0     0     0
            c4t20d0          ONLINE       0     0     0
          raidz2-3           DEGRADED     0     0     0
            c4t21d0          ONLINE       0     0     0
            c4t22d0          ONLINE       0     0     0
            spare-2          UNAVAIL      0     0     0
              c4t23d0        FAULTED      0     0     0  corrupted data
              c8t35d0        ONLINE       0     0     0  (resilvering)
            c4t24d0          ONLINE       0     0     0
            c4t25d0          ONLINE       0     0     0
            c4t26d0          ONLINE       0     0     0
            c4t27d0          ONLINE       0     0     0
          raidz2-4           DEGRADED     0     0     0
            c4t28d0          ONLINE       0     0     0
            c4t29d0          ONLINE       0     0     0
            c4t30d0          ONLINE       0     0     0
            c4t31d0          ONLINE       0     0     0
            c4t32d0          FAULTED      0     0     0  too many errors
            c4t33d0          ONLINE       0     0     0
            c4t34d0          ONLINE       0     0     0
          raidz2-5           DEGRADED     0     0     0
            c4t35d0          ONLINE       0     0     0
            c4t36d0          ONLINE       0     0     0
            replacing-2      DEGRADED     0     0     0
              c4t37d0/old    OFFLINE      0     0     0
              c4t37d0        ONLINE       0     0     0  (resilvering)
            c4t38d0          ONLINE       0     0     0
            c4t39d0          ONLINE       0     0     0
            c4t40d0          ONLINE       0     0     0
            c4t41d0          ONLINE       0     0     0
          raidz2-6           DEGRADED     0     0     0
            c4t42d0          ONLINE       0     0     0
            spare-1          DEGRADED     0     0     0
              replacing-0    DEGRADED     0     0     0
                c4t43d0/old  FAULTED      0     0     0  corrupted data
                c4t43d0      ONLINE       0     0     0  (resilvering)
              c8t34d0        ONLINE       0     0     0
            c4t44d0          ONLINE       0     0     0
            c8t2d0           ONLINE       0     0     0
            c8t3d0           ONLINE       0     0     0
            c8t4d0           ONLINE       0     0     0
            c8t5d0           ONLINE       0     0     0
          raidz2-7           ONLINE       0     0     0
            c8t6d0           ONLINE       0     0     0
            c8t7d0           ONLINE       0     0     0
            c8t8d0           ONLINE       0     0     0
            c8t9d0           ONLINE       0     0     0
            c8t10d0          ONLINE       0     0     0
            c8t11d0          ONLINE       0     0     0
            c8t12d0          ONLINE       0     0     0
          raidz2-8           ONLINE       0     0     0
            c8t13d0          ONLINE       0     0     0
            c8t14d0          ONLINE       0     0     0
            c8t15d0          ONLINE       0     0     0
            c8t16d0          ONLINE       0     0     0
            c8t17d0          ONLINE       0     0     0
            c8t18d0          ONLINE       0     0     0
            c8t19d0          ONLINE       0     0     0
          raidz2-9           ONLINE       0     0     0
            c8t20d0          ONLINE       0     0     0
            c8t21d0          ONLINE       0     0     0
            c8t22d0          ONLINE       0     0     0
            c8t23d0          ONLINE       0     0     0
            c8t24d0          ONLINE       0     0     0
            c8t25d0          ONLINE       0     0     0
            c8t26d0          ONLINE       0     0     0
          raidz2-10          ONLINE       0     0     0
            c8t27d0          ONLINE       0     0     0
            c8t28d0          ONLINE       0     0     0
            c8t29d0          ONLINE       0     0     0
            c8t30d0          ONLINE       0     0     0
            c8t31d0          ONLINE       0     0     0
            c8t32d0          ONLINE       0     0     0
            c8t33d0          ONLINE       0     0     0
        logs
          mirror-11          ONLINE       0     0     0
            c6d1             ONLINE       0     0     0
            c7d1             ONLINE       0     0     0
        cache
          c8t0d0             ONLINE       0     0     0
          c8t1d0             ONLINE       0     0     0
        spares
          c8t34d0            INUSE     currently in use
          c8t35d0            INUSE     currently in use

errors: No known data errors

----------------------------------------
Bug #1197: Hang after resilver finished with mpt
https://www.illumos.org/issues/1197

Author: Roy Sigurd Karlsbakk
Status: New
Priority: Urgent
Assignee: 
Category: driver - device drivers
Target version: 
Difficulty: Medium
Tags: needs-triage


Hi all

I just had a machine finish resilver after a drive (well, two actually) died. After resilver was finished, the Icinga (ex Nagios) check told me the pool was healthy again, so fine. But then, about 15 minutes later, Icinga complained the check timed out, and the box was unavailable. From a remote, I could see OpenIndiana spamming it with messages:

scsi: WARNING: /pci at 0,0/pci8086,340e at 7/pci1000,30a0 at 0... (mpt0):
Disconnected command timeout for Target 23

This looks familiar - I have seen similar on other servers, also just after resilver. The box is using LSI 3801 and 3081 controllers with the mpt driver. Current OS version is OpenIndiana b148.

It looks like this is the same bug I've hit earlier. I just became aware of the resilver issue when this happened within two days with two different machines (the other is 1700km from here, and I don't have a remote console for it yet - long story).

Is there anything I can do to debug this? I ran 'zpool status' from the console, and it apparently hangs there and won't go anywhere.....

Thank you for any help on this one!

roy


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here: http://www.illumos.org/my/account



More information about the bugs mailing list