[illumos-Developer] missing lwp_exit() in kcfpool_svc()
Bryan Cantrill
bryancantrill at gmail.com
Thu Mar 31 22:39:18 PDT 2011
All,
One of our engineers recently saw a spate of panics in prchoose():
> $c
prchoose+0x72(ffffff0d3761e008)
prgetpsinfo32+0x2b(ffffff0d3761e008, ffffff006a372b00)
pr_read_psinfo_32+0x4e(ffffff0d420b8640, ffffff006a372e20)
prread+0x5c(ffffff0d420b4c80, ffffff006a372e20, 0, ffffff0d52489480, 0)
fop_read+0xc9(ffffff0d420b4c80, ffffff006a372e20, 0, ffffff0d52489480, 0)
read+0x2b8(4, 8047af0, 150)
read32+0x22(4, 8047af0, 150)
_sys_sysenter_post_swapgs+0x149()
We seem to be dying on a stale p_tlist, which should generally be
impossible. ;) Interestingly, the proc_t in question is always
kcfpoold:
> ffffff0d3761e008::ps
S PID PPID PGID SID UID FLAGS ADDR NAME
R 4 0 0 0 0 0x00020001 ffffff0d3761e008 kcfpoold
And indeed, kcfpool and the kcfpoold proc_t have wildly divergent
ideas of how many threads are associated with kcfpoold:
> *kcfpool::print kcf_pool_t kp_threads
kp_threads = 0x1
> ::pgrep kcfpoold | ::print proc_t p_lwpcnt
p_lwpcnt = 0x11
The problem appears to be in kcfpool_svc(), which is what the
in-kernel (i.e., synthetic) kcfpoold sets its LWPs to run: this
routine simply returns when the size of the thread pool exceeds
available work -- but it in fact needs to grab its own p_lock and call
lwp_exit(), lest the LWP state associated with the process become
stale. Garrett, would you like me to get an illumos issue open on
this?
The fix seems straightforward, but I would obviously like to
thoroughly test the code path; what is the easiest way to induce this?
Beyond testing the fix, understanding how one induces work in KCF
would also help address why we haven't seen this more broadly -- or
have others seen this?
- Bryan
More information about the Developer
mailing list