[illumos-Developer] NFS hang during copy

Sun Mar 20 11:21:39 PDT 2011

Sounds bad.  I'd like to get a threadlist with stack backtraces on your
server.

	- Garrett

On Sun, 2011-03-20 at 15:49 +0100, Roy Sigurd Karlsbakk wrote:
> Hi all
> 
> I'm fighting a problem with an OpenIndiana 148 server and NFS3 mounts from Linux clients. A simple cron job is run that moves some data files from another server to the OI box. This runs well for a while, until at some point, the client hangs and reports NFS server connection failure. The calltrace from linux is 
> 
> [  484.712558] INFO: task mv:2353 blocked for more than 120 seconds.
> [  484.712562] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  484.712566] mv            D 0000000100000a8b     0  2353   2352 0x00000001
> [  484.712573]  ffff880234b75ba8 0000000000000086 ffff880234b75b18 0000000000015980
> [  484.712579]  ffff880234b75fd8 0000000000015980 ffff880234b75fd8 ffff8802349896e0
> [  484.712584]  0000000000015980 0000000000015980 ffff880234b75fd8 0000000000015980
> [  484.712589] Call Trace:
> [  484.712599]  [<ffffffff81100d60>] ? sync_page+0x0/0x50
> [  484.712606]  [<ffffffff8159e053>] io_schedule+0x73/0xc0
> [  484.712610]  [<ffffffff81100d9d>] sync_page+0x3d/0x50
> [  484.712614]  [<ffffffff8159e6cf>] __wait_on_bit+0x5f/0x90
> [  484.712618]  [<ffffffff81100f53>] wait_on_page_bit+0x73/0x80
> [  484.712623]  [<ffffffff8107f250>] ? wake_bit_function+0x0/0x40
> [  484.712628]  [<ffffffff8110b975>] ? pagevec_lookup_tag+0x25/0x40
> [  484.712632]  [<ffffffff8110141d>] filemap_fdatawait_range+0x10d/0x1a0
> [  484.712637]  [<ffffffff811014db>] filemap_fdatawait+0x2b/0x30
> [  484.712640]  [<ffffffff811017e4>] filemap_write_and_wait+0x44/0x50
> [  484.712660]  [<ffffffffa038dfcc>] nfs_setattr+0x14c/0x160 [nfs]
> [  484.712666]  [<ffffffff8116c55b>] notify_change+0x16b/0x310
> [  484.712671]  [<ffffffff8117b15c>] utimes_common+0xdc/0x1b0
> [  484.712675]  [<ffffffff8117b2d1>] do_utimes+0xa1/0xf0
> [  484.712678]  [<ffffffff8117b3e3>] sys_utimensat+0x33/0x90
> [  484.712684]  [<ffffffff8100a307>] tracesys+0xd9/0xde
> 
> When I strace the mv job from the client, it hangs on utimensat()
> 
> read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
> write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
> read(3, "\0\0\6q7\17\\\30\3L\342\0\277\2\16\355!\33\362\366\22\201\223\1h\201\16\355\22\n\227\340"..., 1048576) = 848404
> write(4, "\0\0\6q7\17\\\30\3L\342\0\277\2\16\355!\33\362\366\22\201\223\1h\201\16\355\22\n\227\340"..., 848404) = 848404
> read(3, "", 1048576)                    = 0
> utimensat(4, NULL, {{1300591624, 0}, {1300508167, 0}}, 0
> 
> This server has been working well for well over a year, and it normally works well, but in this case, we see repeatedly hangs. The clients experiencing this problem, will hang with 100% "wio" on one core, and the only way I've found to solve it temporarily is to reboot the client. I can't find anything in the server logs, but since the problem is from both an elderly Fedora box and an updated Ubuntu 10.04.2 machine, and that it has been working well for quite some time, I guess the upgrade to OI may be to blame.
> 
> Does anyone know how I can debug this further?
> 
> Vennlige hilsener / Best regards
> 
> roy
> --
> Roy Sigurd Karlsbakk
> (+47) 97542685
> roy at karlsbakk.net
> http://blogg.karlsbakk.net/
> --
> I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk.
> 
> _______________________________________________
> Developer mailing list
> Developer at lists.illumos.org
> http://lists.illumos.org/m/listinfo/developer