[illumos-Developer] missing lwp_exit() in kcfpool_svc()

Garrett D'Amore garrett at damore.org
Thu Mar 31 23:04:01 PDT 2011


  On 03/31/11 10:39 PM, Bryan Cantrill wrote:
> All,
>
> One of our engineers recently saw a spate of panics in prchoose():
>
>    >  $c
>    prchoose+0x72(ffffff0d3761e008)
>    prgetpsinfo32+0x2b(ffffff0d3761e008, ffffff006a372b00)
>    pr_read_psinfo_32+0x4e(ffffff0d420b8640, ffffff006a372e20)
>    prread+0x5c(ffffff0d420b4c80, ffffff006a372e20, 0, ffffff0d52489480, 0)
>    fop_read+0xc9(ffffff0d420b4c80, ffffff006a372e20, 0, ffffff0d52489480, 0)
>    read+0x2b8(4, 8047af0, 150)
>    read32+0x22(4, 8047af0, 150)
>    _sys_sysenter_post_swapgs+0x149()
>
> We seem to be dying on a stale p_tlist, which should generally be
> impossible. ;)  Interestingly, the proc_t in question is always
> kcfpoold:
>
>    >  ffffff0d3761e008::ps
>    S    PID   PPID   PGID    SID    UID      FLAGS             ADDR NAME
>    R      4      0      0      0      0 0x00020001 ffffff0d3761e008 kcfpoold
>
> And indeed, kcfpool and the kcfpoold proc_t have wildly divergent
> ideas of how many threads are associated with kcfpoold:
>
>    >  *kcfpool::print kcf_pool_t kp_threads
>    kp_threads = 0x1
>    >  ::pgrep kcfpoold | ::print proc_t p_lwpcnt
>    p_lwpcnt = 0x11
>
> The problem appears to be in kcfpool_svc(), which is what the
> in-kernel (i.e., synthetic) kcfpoold sets its LWPs to run:  this
> routine simply returns when the size of the thread pool exceeds
> available work -- but it in fact needs to grab its own p_lock and call
> lwp_exit(), lest the LWP state associated with the process become
> stale.  Garrett, would you like me to get an illumos issue open on
> this?
>
> The fix seems straightforward, but I would obviously like to
> thoroughly test the code path; what is the easiest way to induce this?
>   Beyond testing the fix, understanding how one induces work in KCF
> would also help address why we haven't seen this more broadly -- or
> have others seen this?


Please do file a bug in illumos; you are welcome to either set yourself, 
or set me, as the responsible engineer.   As I'm the guy that hacked 
that code together to get rid of the userland process, its my fault.  
:-/  I'm not surprised I screwed this up actually, its the first time 
I've had to do anything with lwps.

Kudos to you or your engineer(s) for figuring this out.  Reading the 
code, I concur with your analysis.

Generating lots of kcf jobs can be tricky.  If you have a hardware 
crypto card, you can do it with OpenSSL.  You can also trigger this 
generically with IPsec.  Doing a lot of work with /dev/random *might* 
trigger it, but I'd have to look to see.

     - Garrett




>          - Bryan
>
> _______________________________________________
> Developer mailing list
> Developer at lists.illumos.org
> http://lists.illumos.org/m/listinfo/developer




More information about the Developer mailing list