[illumos-Developer] CODE REVIEW -> Illumos 915, compliments of Joyent

Thu Apr 21 17:58:15 PDT 2011

Obviously looks good to me. ;)  Thanks for picking this one up.  Has
anyone else seen this?  We were getting crushed by it in our VMware
environment, and it has a somewhat distant failure mode.  For the
record, here's my analysis from our internal bug tracking system:

---8<---

We died in cv_wait() called from ill_taskq_dispatch() because our lock
was unheld.  The code in ill_taskq_dispatch(), however, looks airtight --
and the nature of the mutex code makes it impossible for another thread
to have errantly handled the lock.

Some relevant facts (from the dump from January 14th at 17:30:30):

- ill_taskq_dispatch is straight-line from ill_taskq_dispatch to the
  call to mutex_enter at ill_taskq_dispatch+0x7e

- The call to mutex_enter uses %r13 to load %rdi

- In ill_taskq_dispatch, %r13 is saved and then loaded before any branches
  (at ill_taskq_dispatch+0x58), and not altered until it is restored at
  ill_taskq_dispatch+0x1bb.

- %r13 from ill_taskq_dispatch is saved at cv_wait+0x12, and is
  0xffffff016bfdcf58

- This value corresponds to the value on the stack as part of the
  mutex_vector_exit() (as expected).

- ill_taskq_dispatch loaded 0xffffff016bfdcf58 into %r13 and called
  mutex_enter on this mutex at ill_taskq_dispatch+0x7e.

- Between the call to mutex_exit() at ill_taskq_dispatch+0xb6 and the
  call to mutex_enter() at ill_taskq_dispatch+0xe5, there is only one
  potential transfer of control in the function:  to call assfail()
  from the basic block at ill_taskq_dispatch+0x1c2.  If this basic block
  were executed, a panic would be induced -- but even if this logic
  were somehow flawed, control would still return to ill_taskq_dispatch+0xe2
  to call mutex_enter() at ill_taskq_dispatch+0xe5.

We therefore know that the current (panicking) thread called mutex_enter()
at ill_taskq_dispatch+0x7e.  If it did indeed drop the lock at
ill_taskq_dispatch+0xb6, there is no path by which the lock is not regained
at ill_taskq_dispatch+0xe5.  We further know that if any other thread
attempted to (spuriously) drop the lock, the system would immediately panic.

All of this known, there become very few possibilities:

- The owner field was not properly recorded on the call to mutex_enter()
  ("can't happen").

- Control was transferred in such a way that the mutex_enter() at
  ill_taskq_dispatch+0xe5 never happened ("can't happen")

- The lock was not correctly regained in the cv_wait() ("can't happen").

- The lock field was zero'd by some other thread operating on the data
  structure.

Given that all of these but the last are "can't happen" conditions that
seem extraordinarily unlikely to be seen in the same body of code across
multiple builds (as has been the case with this panic), the last possibility
must be accepted as the only real possibility.

With this new disposition, the manipulation of ips_capab_taskq_lock elsewhere
was examined in greater detail -- quickly revealing the source of the
problem in ip_stack_init():

{code}
...
        /*
         * Create the taskq dispatcher thread and initialize related stuff.
         */
        ipst->ips_capab_taskq_thread = thread_create(NULL, 0,
            ill_taskq_dispatch, ipst, 0, &p0, TS_RUN, minclsyspri);
        mutex_init(&ipst->ips_capab_taskq_lock, NULL, MUTEX_DEFAULT, NULL);
        cv_init(&ipst->ips_capab_taskq_cv, NULL, CV_DEFAULT, NULL);
...
{code}

The bug is this code calls mutex_init() on the lock manipulated by
ill_taskq_dispatch() _after_ having created the thread that calls into
ill_taskq_dispatch()!  This potentially (and silently) zeroes the held
lock, which can lead to all sorts of nastiness -- notably the panic seen
here.  Sure enough, looking at the dump, there is another thread (running
on the other virtual CPU) that was recently in this code, and the t_start
of the panicking thread is the current time (that is, the thread was
just created, as we would expect if this were the bug).

Fortunately, the fix is simple:  the mutex_init() and cv_init() should be
before the thread_create(), not after it.

---8<---

That should flesh out the rationale for the fix -- with my apologies
for not having pushed this upstream myself earlier...

        - Bryan