[bugs] [OpenIndiana Distribution - Bug #841] OI_148a locks up on any problems with ZFS pools

Thu Jul 7 00:00:51 PDT 2011

Issue #841 has been updated by Jim Klimov.

Hello, thanks for looking into this ;)
Alas, after almost 4 months of ZFS adventures on this box behind, I can't responsibly say what was configured back then exactly.
Currently, and perhaps initially, pool (6-spindle) and rpool have failmode=continue, and dcpool (iscsi volume) has failmode=wait.
The USB sticks are out of equation for several months as well.

Even if failures with "pool" or "dcpool" would lock up their IOs by design, this should not cripple the OS from any likeness of being alive (X11, SSH, reboot, etc.) - and the rpool is a separate pool on a separate disk.
----------------------------------------
Bug #841: OI_148a locks up on any problems with ZFS pools
https://www.illumos.org/issues/841

Author: Jim Klimov
Status: New
Priority: Normal
Assignee: 
Category: 
Target version: 
Difficulty: Medium
Tags: needs-triage

I have a 6-spindle raidz2 pool named "pool" on commodity hardware. Over that pool I've made a zvol which is exported and re-imported as an iSCSI device, and over that iSCSI device the same system creates another ZFS pool "dcpool".

Occasionally the system times out on some accesses to either pool (at least once due to a flaky SATA connector) and locks up completely - the X11 session is unresponsive, I can't SSH into the machine, it does not respond to power-button so I have to reset it.

I expected the ZFS pool to continue working properly (albeit slowly) in such case and perhaps trigger a resilver - even if one SATA connection is lost, the raidz2 pool without one disk still has an extra disk worth of parity data.

Also on some timeouts (i.e. the system is too busy processing data) the "dcpool"'s iSCSI localhost connection gets dropped, so the underlying device becomes UNAVAIL. This also causes system-wide lockup. I kinda hoped that the system would just fail the outstanding IOs (like on USB flash being yanked out) or wait for the device (iSCSI connection) to come back (like on NFS server reboot).

Instead, this system can hang for hours until it is reset. Which makes me nervous, at least. And it's not reliable as I'd hoped.

Some details regarding the "dcpool":

To be precise, the zvol is compressed in "pool" and for all datasets in "dcpool" I have enabled deduplication, so the system goes through write cycles faster. Instead of 'compress+dedup+dispose of common blocks' it can go like 'dedup+dispose of common blocks+compress unique blocks'.

The iSCSI loopback instead of making "dcpool" directly in the zvol was suggested by Darren Moffat:
http://blogs.sun.com/darren/entry/compress_encrypt_checksum_deduplicate_with

When the system boots up, the "dcpool" is deemed "UNAVAIL" (it was imported and remains cached via zpool.cache). However, after the startup of "iscsi/target" and "iscsi/initiator" it is automatically found, imported and mounted by the time X11 session logs in.

I had to tweak the SMF service properties - initiator now depends on target, so that they start in proper order for this setup.

-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here: http://www.illumos.org/my/account