[illumos-Developer] CODE REVIEW -> Illumos 916, compliments of Joyent
Bryan Cantrill
bryancantrill at gmail.com
Thu Apr 21 18:03:37 PDT 2011
More flesh on this one: Brendan's analysis, Jerry's fix. Here is
Brendan's analysis, again for the record:
---8<---
Summary: I'm getting high contention on the tcps_bind_fanout hash
locks that is affecting TCP client performance when attempting
thousands of connections per second. I'd suggest increasing the
number of hash locks (by default it is 512), and adding padding to the
hash lock to avoid cache line thrashing.
The lock is used in tcp_bindi():
{code}
tf_t *tbf;
[...]
tbf = &tcps->tcps_bind_fanout[TCP_BIND_HASH(lport)];
{code}
tf_t is:
{code}
/*
* The TCP Fanout structure for bind and acceptor hashes.
* The hash tables and their linkage (tcp_*_hash, tcp_ptp*hn) are
* protected by the per-bucket tf_lock. Each tcp_t
* inserted in the list points back at this lock using tcp_*_lockp.
*
* The bind and acceptor hash queues are lists of tcp_t.
*/
/* listener hash and acceptor hash queue head */
typedef struct tf_s {
tcp_t *tf_tcp;
kmutex_t tf_lock;
} tf_t;
{code}
Which is only 16 bytes. There are 512 of these:
{code}
#define TCP_BIND_FANOUT_SIZE 512
#define TCP_BIND_HASH(lport) (ntohs(lport) & (TCP_BIND_FANOUT_SIZE - 1))
{code}
The ZFS arc.c has some neat code for padding hash locks, eg:
{code}
#define HT_LOCK_PAD 64
struct ht_lock {
kmutex_t ht_lock;
#ifdef _KERNEL
unsigned char pad[(HT_LOCK_PAD - sizeof (kmutex_t))];
#endif
};
{code}
(TCP doesn't need the ifdefs).
...
I hit this when testing node.js connection performance. Here was the
connection rate (see the TCPaccept column):
{code}
/var/tmp/brendan]# ./tcpaccept.d
ListenDrop ListenDropQ0 TCPaccept TCPrefused SCAccept SCAF
0 0 2434 0 2785 353
0 0 27232 0 32550 5317
0 0 25262 0 31916 6721
0 0 17810 0 27054 9175
0 0 11078 0 23044 11967
0 0 6669 0 19185 12515
0 0 4256 0 17147 12892
0 0 3196 0 13347 10151
0 0 2646 0 10249 7602
0 0 1633 0 7732 6100
0 0 2825 0 9351 6551
0 0 26220 0 37283 11038
0 0 25146 0 32097 6951
0 0 17496 0 27529 10032
0 0 11285 0 20908 9624
0 0 7825 0 18819 10993
0 0 8019 0 17888 9870
0 0 5678 0 17726 12047
0 0 3872 0 15290 11418
0 0 2805 0 10752 7948
0 0 3402 0 11354 7953
0 0 25156 0 34692 9635
0 0 25376 0 31846 6370
0 0 17993 0 26556 8562
0 0 11100 0 21531 10433
0 0 7527 0 19100 11572
0 0 8392 0 17483 9091
0 0 6541 0 20777 14236
0 0 3326 0 14333 11007
0 0 3571 0 10253 6682
0 0 2892 0 11165 8276
0 0 23881 0 36870 13069
0 0 25153 0 32465 7232
0 0 18701 0 27845 9161
0 0 12086 0 23565 11459
0 0 6478 0 19332 12854
0 0 9615 0 23797 14182
{code}
The variability is large - between 1633 and 27232 accepts per second -
but was not obviously an issue on the server.
The client however had the following mpstat when the rate was high (good):
{code}
# mpstat 1
[...]
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 0 0 0 453 102 15850 60 369 8172 0 885 1 41 0 58
1 15 0 0 14 0 9525 7 783 2070 5 10743 2 12 0 86
2 7 0 0 96 30 7984 1 512 1853 3 9128 1 9 0 90
3 6 0 0 18 2 7536 1 520 1734 7 8495 1 10 0 89
4 10 0 0 34 0 4056 0 170 906 3 4619 1 4 0 95
5 6 0 0 20 0 3198 1 164 709 1 3707 1 5 0 94
6 3 0 0 87 38 5039 0 254 1105 5 5754 1 7 0 92
7 5 0 0 15 2 5906 2 361 1275 1 7132 1 8 0 91
8 3 0 0 247 0 18452 137 136 1040 2 3121 0 6 0 94
9 9 0 0 15 0 3412 4 161 742 0 3940 0 4 0 96
10 11 0 0 15 4 5647 2 273 1318 3 6486 1 7 0 92
11 11 0 0 10 2 5950 0 348 1414 6 6768 1 8 0 91
12 1 0 0 17 1 859 0 91 187 1 1015 0 2 0 98
13 3 0 0 14 1 4711 0 268 1012 1 5448 0 6 0 94
14 2 0 0 10 4 2641 0 104 645 5 3069 0 4 0 96
15 6 0 0 14 2 3221 1 168 663 3 3831 1 5 0 94
16 9 0 0 12 2 5111 0 254 1199 3 5890 1 6 0 93
17 11 0 0 16019 15992 5707 41 461 4969 1 6628 1 34 0 65
18 8 0 0 15516 15494 5012 17 291 4610 1 5526 1 26 0 73
19 2 0 0 5 0 3720 0 168 928 5 4262 1 5 0 94
20 2 0 0 19 7 5425 2 290 1217 3 6229 0 6 0 94
21 6 0 0 25 8 6224 2 383 1318 0 7155 1 8 0 91
22 4 0 0 101 44 3415 1 113 736 1 3901 1 4 0 95
23 2 0 0 13 3 2986 0 117 639 2 3456 0 4 0 96
{code}
and when the rate was low (bad):
{code}
# mpstat 1
[...]
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 1 0 33 537 106 6468 122 415 11126 0 1791 1 36 0 63
1 1 0 0 77 1 2386 53 228 20277 0 2035 0 34 0 66
2 3 0 0 132 23 2478 59 293 26326 0 2336 0 46 0 54
3 0 0 0 63 0 2087 40 182 20169 1 2027 1 32 0 67
4 4 0 11 82 9 2078 39 174 22191 0 1971 0 32 0 68
5 2 0 11 72 1 1986 39 221 10603 0 2060 0 20 0 80
6 5 0 0 79 4 2333 49 278 21119 0 2302 0 39 0 61
7 4 0 0 58 0 2069 68 194 17313 0 1858 0 25 0 75
8 3 0 33 252 1 6225 117 156 19189 0 1762 1 29 0 70
9 6 0 11 78 2 1936 59 153 11705 0 1815 0 24 0 76
10 3 0 0 70 11 1964 45 182 25487 0 1867 0 35 0 65
11 0 0 33 45 0 2115 24 202 22668 0 1974 0 32 0 68
12 2 0 0 77 16 1757 34 171 17417 0 1747 0 28 0 72
13 1 0 33 52 3 1883 63 169 18895 0 2367 1 30 0 69
14 2 0 0 73 11 2190 35 174 15868 0 2251 0 26 0 74
15 3 0 22 50 3 1114 42 108 15119 0 1135 0 25 0 75
16 3 0 22 89 12 2104 77 205 12225 0 2160 1 26 0 73
17 1 0 22 9776 9694 2044 95 202 14395 0 1895 0 36 0 64
18 1 0 11 8287 8186 2347 64 226 13584 0 2184 0 37 0 63
19 1 0 0 56 3 1506 23 113 8344 0 1379 1 16 0 83
20 1 0 0 82 11 2604 96 244 21009 0 2516 0 32 0 68
21 1 0 11 56 5 1858 36 160 15028 0 1966 0 24 0 76
22 2 0 0 65 7 1485 25 121 19981 0 1400 0 29 0 71
23 1 0 11 49 2 1629 11 81 18719 0 1467 0 26 0 74
{code}
The mutex spins on the client corresponded with the drops in TCP
connection rate. DTrace tracked these down to:
{code}
# dtrace -n 'profile-1234 { @[stack()] = count(); }'
dtrace: description 'profile-1234 ' matched 1 probe
[...]
unix`mutex_delay_default+0x7
unix`mutex_vector_enter+0x2ad
ip`tcp_bindi+0x10c
ip`tcp_connect_ipv4+0x11b
ip`tcp_do_connect+0x20f
ip`tcp_connect+0x9f
sockfs`so_connect+0xf8
sockfs`socket_connect+0x47
sockfs`connect+0xec
unix`_sys_sysenter_post_swapgs+0x149
11047
[...]
{code}
Calculating ratio:
{code}
# dtrace -n 'profile-1234 { @[func(arg0)] = count(); @t["TOTAL:"] = count(); }'
dtrace: description 'profile-1234 ' matched 1 probe
[...]
ip`tcp_input_data 235
unix`mutex_exit 312
unix`mutex_owner_running 344
unix`disp_getwork 393
0x0 551
unix`tsc_read 648
ip`tcp_update_next_port 859
unix`mutex_vector_enter 1443
unix`mutex_enter 1505
unix`default_lock_delay 2640
unix`mutex_delay_default 13989
ip`tcp_bindi 36034
unix`i86_mwait 169023
TOTAL: 238080
{code}
So the time in tcp_bindi() and mutex functions is about 64% of the
total non-idle CPU time (that's just the CPU consumption part; not
counting other performance effects). This is worth tuning down, and
should help high TCP client connection rates.
---8<---
Most crushing to me personally is a follow-up comment that Brendan had:
"Bryan identified this issue using ::findfalse on page 84 (PDF page
14) of "Postmortem Object Type Identification"
(http://arxiv.org/pdf/cs/0309037v1). In 2003."
Obviously kicking myself for not having fixed this then!
- Bryan
More information about the Developer
mailing list