[illumos-Developer] CODE REVIEW -> Illumos 916, compliments of Joyent

Thu Apr 21 18:03:37 PDT 2011

More flesh on this one:  Brendan's analysis, Jerry's fix.  Here is
Brendan's analysis, again for the record:

---8<---

Summary: I'm getting high contention on the tcps_bind_fanout hash
locks that is affecting TCP client performance when attempting
thousands of connections per second.  I'd suggest increasing the
number of hash locks (by default it is 512), and adding padding to the
hash lock to avoid cache line thrashing.

The lock is used in tcp_bindi():

{code}
                tf_t            *tbf;
[...]
                tbf = &tcps->tcps_bind_fanout[TCP_BIND_HASH(lport)];
{code}

tf_t is:

{code}
/*
 * The TCP Fanout structure for bind and acceptor hashes.
 * The hash tables and their linkage (tcp_*_hash, tcp_ptp*hn) are
 * protected by the per-bucket tf_lock.  Each tcp_t
 * inserted in the list points back at this lock using tcp_*_lockp.
 *
 * The bind and acceptor hash queues are lists of tcp_t.
 */
/* listener hash and acceptor hash queue head */
typedef struct tf_s {
        tcp_t           *tf_tcp;
        kmutex_t        tf_lock;
} tf_t;
{code}

Which is only 16 bytes.  There are 512 of these:

{code}
#define TCP_BIND_FANOUT_SIZE    512
#define TCP_BIND_HASH(lport) (ntohs(lport) & (TCP_BIND_FANOUT_SIZE - 1))
{code}

The ZFS arc.c has some neat code for padding hash locks, eg:

{code}
#define HT_LOCK_PAD     64

struct ht_lock {
        kmutex_t        ht_lock;
#ifdef _KERNEL
        unsigned char   pad[(HT_LOCK_PAD - sizeof (kmutex_t))];
#endif
};
{code}

(TCP doesn't need the ifdefs).

...

I hit this when testing node.js connection performance.  Here was the
connection rate (see the TCPaccept column):

{code}
/var/tmp/brendan]# ./tcpaccept.d
ListenDrop   ListenDropQ0 TCPaccept    TCPrefused   SCAccept     SCAF
0            0            2434         0            2785         353
0            0            27232        0            32550        5317
0            0            25262        0            31916        6721
0            0            17810        0            27054        9175
0            0            11078        0            23044        11967
0            0            6669         0            19185        12515
0            0            4256         0            17147        12892
0            0            3196         0            13347        10151
0            0            2646         0            10249        7602
0            0            1633         0            7732         6100
0            0            2825         0            9351         6551
0            0            26220        0            37283        11038
0            0            25146        0            32097        6951
0            0            17496        0            27529        10032
0            0            11285        0            20908        9624
0            0            7825         0            18819        10993
0            0            8019         0            17888        9870
0            0            5678         0            17726        12047
0            0            3872         0            15290        11418
0            0            2805         0            10752        7948
0            0            3402         0            11354        7953
0            0            25156        0            34692        9635
0            0            25376        0            31846        6370
0            0            17993        0            26556        8562
0            0            11100        0            21531        10433
0            0            7527         0            19100        11572
0            0            8392         0            17483        9091
0            0            6541         0            20777        14236
0            0            3326         0            14333        11007
0            0            3571         0            10253        6682
0            0            2892         0            11165        8276
0            0            23881        0            36870        13069
0            0            25153        0            32465        7232
0            0            18701        0            27845        9161
0            0            12086        0            23565        11459
0            0            6478         0            19332        12854
0            0            9615         0            23797        14182
{code}

The variability is large - between 1633 and 27232 accepts per second -
but was not obviously an issue on the server.

The client however had the following mpstat when the rate was high (good):

{code}
# mpstat 1
[...]
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0    0   453  102 15850   60  369 8172    0   885    1  41   0  58
  1   15   0    0    14    0 9525    7  783 2070    5 10743    2  12   0  86
  2    7   0    0    96   30 7984    1  512 1853    3  9128    1   9   0  90
  3    6   0    0    18    2 7536    1  520 1734    7  8495    1  10   0  89
  4   10   0    0    34    0 4056    0  170  906    3  4619    1   4   0  95
  5    6   0    0    20    0 3198    1  164  709    1  3707    1   5   0  94
  6    3   0    0    87   38 5039    0  254 1105    5  5754    1   7   0  92
  7    5   0    0    15    2 5906    2  361 1275    1  7132    1   8   0  91
  8    3   0    0   247    0 18452  137  136 1040    2  3121    0   6   0  94
  9    9   0    0    15    0 3412    4  161  742    0  3940    0   4   0  96
 10   11   0    0    15    4 5647    2  273 1318    3  6486    1   7   0  92
 11   11   0    0    10    2 5950    0  348 1414    6  6768    1   8   0  91
 12    1   0    0    17    1  859    0   91  187    1  1015    0   2   0  98
 13    3   0    0    14    1 4711    0  268 1012    1  5448    0   6   0  94
 14    2   0    0    10    4 2641    0  104  645    5  3069    0   4   0  96
 15    6   0    0    14    2 3221    1  168  663    3  3831    1   5   0  94
 16    9   0    0    12    2 5111    0  254 1199    3  5890    1   6   0  93
 17   11   0    0 16019 15992 5707   41  461 4969    1  6628    1  34   0  65
 18    8   0    0 15516 15494 5012   17  291 4610    1  5526    1  26   0  73
 19    2   0    0     5    0 3720    0  168  928    5  4262    1   5   0  94
 20    2   0    0    19    7 5425    2  290 1217    3  6229    0   6   0  94
 21    6   0    0    25    8 6224    2  383 1318    0  7155    1   8   0  91
 22    4   0    0   101   44 3415    1  113  736    1  3901    1   4   0  95
 23    2   0    0    13    3 2986    0  117  639    2  3456    0   4   0  96
{code}

and when the rate was low (bad):

{code}
# mpstat 1
[...]
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    1   0   33   537  106 6468  122  415 11126    0  1791    1  36   0  63
  1    1   0    0    77    1 2386   53  228 20277    0  2035    0  34   0  66
  2    3   0    0   132   23 2478   59  293 26326    0  2336    0  46   0  54
  3    0   0    0    63    0 2087   40  182 20169    1  2027    1  32   0  67
  4    4   0   11    82    9 2078   39  174 22191    0  1971    0  32   0  68
  5    2   0   11    72    1 1986   39  221 10603    0  2060    0  20   0  80
  6    5   0    0    79    4 2333   49  278 21119    0  2302    0  39   0  61
  7    4   0    0    58    0 2069   68  194 17313    0  1858    0  25   0  75
  8    3   0   33   252    1 6225  117  156 19189    0  1762    1  29   0  70
  9    6   0   11    78    2 1936   59  153 11705    0  1815    0  24   0  76
 10    3   0    0    70   11 1964   45  182 25487    0  1867    0  35   0  65
 11    0   0   33    45    0 2115   24  202 22668    0  1974    0  32   0  68
 12    2   0    0    77   16 1757   34  171 17417    0  1747    0  28   0  72
 13    1   0   33    52    3 1883   63  169 18895    0  2367    1  30   0  69
 14    2   0    0    73   11 2190   35  174 15868    0  2251    0  26   0  74
 15    3   0   22    50    3 1114   42  108 15119    0  1135    0  25   0  75
 16    3   0   22    89   12 2104   77  205 12225    0  2160    1  26   0  73
 17    1   0   22  9776 9694 2044   95  202 14395    0  1895    0  36   0  64
 18    1   0   11  8287 8186 2347   64  226 13584    0  2184    0  37   0  63
 19    1   0    0    56    3 1506   23  113 8344    0  1379    1  16   0  83
 20    1   0    0    82   11 2604   96  244 21009    0  2516    0  32   0  68
 21    1   0   11    56    5 1858   36  160 15028    0  1966    0  24   0  76
 22    2   0    0    65    7 1485   25  121 19981    0  1400    0  29   0  71
 23    1   0   11    49    2 1629   11   81 18719    0  1467    0  26   0  74
{code}

The mutex spins on the client corresponded with the drops in TCP
connection rate.  DTrace tracked these down to:

{code}
# dtrace -n 'profile-1234 { @[stack()] = count(); }'
dtrace: description 'profile-1234 ' matched 1 probe
[...]
              unix`mutex_delay_default+0x7
              unix`mutex_vector_enter+0x2ad
              ip`tcp_bindi+0x10c
              ip`tcp_connect_ipv4+0x11b
              ip`tcp_do_connect+0x20f
              ip`tcp_connect+0x9f
              sockfs`so_connect+0xf8
              sockfs`socket_connect+0x47
              sockfs`connect+0xec
              unix`_sys_sysenter_post_swapgs+0x149
            11047
[...]
{code}

Calculating ratio:

{code}
# dtrace -n 'profile-1234 { @[func(arg0)] = count(); @t["TOTAL:"] = count(); }'
dtrace: description 'profile-1234 ' matched 1 probe
[...]
  ip`tcp_input_data                                               235
  unix`mutex_exit                                                 312
  unix`mutex_owner_running                                        344
  unix`disp_getwork                                               393
  0x0                                                             551
  unix`tsc_read                                                   648
  ip`tcp_update_next_port                                         859
  unix`mutex_vector_enter                                        1443
  unix`mutex_enter                                               1505
  unix`default_lock_delay                                        2640
  unix`mutex_delay_default                                      13989
  ip`tcp_bindi                                                  36034
  unix`i86_mwait                                               169023
  TOTAL:                                                       238080
{code}

So the time in tcp_bindi() and mutex functions is about 64% of the
total non-idle CPU time (that's just the CPU consumption part; not
counting other performance effects).  This is worth tuning down, and
should help high TCP client connection rates.

---8<---

Most crushing to me personally is a follow-up comment that Brendan had:

"Bryan identified this issue using ::findfalse on page 84 (PDF page
14) of "Postmortem Object Type Identification"
(http://arxiv.org/pdf/cs/0309037v1). In 2003."

Obviously kicking myself for not having fixed this then!

        - Bryan