[illumos-Developer] missing lwp_exit() in kcfpool_svc()

Thu Mar 31 22:39:18 PDT 2011

All,

One of our engineers recently saw a spate of panics in prchoose():

  > $c
  prchoose+0x72(ffffff0d3761e008)
  prgetpsinfo32+0x2b(ffffff0d3761e008, ffffff006a372b00)
  pr_read_psinfo_32+0x4e(ffffff0d420b8640, ffffff006a372e20)
  prread+0x5c(ffffff0d420b4c80, ffffff006a372e20, 0, ffffff0d52489480, 0)
  fop_read+0xc9(ffffff0d420b4c80, ffffff006a372e20, 0, ffffff0d52489480, 0)
  read+0x2b8(4, 8047af0, 150)
  read32+0x22(4, 8047af0, 150)
  _sys_sysenter_post_swapgs+0x149()

We seem to be dying on a stale p_tlist, which should generally be
impossible. ;)  Interestingly, the proc_t in question is always
kcfpoold:

  > ffffff0d3761e008::ps
  S    PID   PPID   PGID    SID    UID      FLAGS             ADDR NAME
  R      4      0      0      0      0 0x00020001 ffffff0d3761e008 kcfpoold

And indeed, kcfpool and the kcfpoold proc_t have wildly divergent
ideas of how many threads are associated with kcfpoold:

  > *kcfpool::print kcf_pool_t kp_threads
  kp_threads = 0x1
  > ::pgrep kcfpoold | ::print proc_t p_lwpcnt
  p_lwpcnt = 0x11

The problem appears to be in kcfpool_svc(), which is what the
in-kernel (i.e., synthetic) kcfpoold sets its LWPs to run:  this
routine simply returns when the size of the thread pool exceeds
available work -- but it in fact needs to grab its own p_lock and call
lwp_exit(), lest the LWP state associated with the process become
stale.  Garrett, would you like me to get an illumos issue open on
this?

The fix seems straightforward, but I would obviously like to
thoroughly test the code path; what is the easiest way to induce this?
 Beyond testing the fix, understanding how one induces work in KCF
would also help address why we haven't seen this more broadly -- or
have others seen this?

        - Bryan