[illumos-Developer] DESIGN REVIEW & CODE REVIEW: 917, compliments of Joyent
Jerry Jelinek
jerry.jelinek at joyent.com
Tue Apr 26 12:54:14 PDT 2011
On Tue, Apr 26, 2011 at 11:56 AM, Dan McDonald <danmcd at nexenta.com> wrote:
> Last week I mentioned *four* Joyent-contributed TCP/IP fixes. Here's #3:
>
> http://www.kebe.com/~danmcd/webrevs/917/
>
> This was contributed by Jerry Jelinek. My own review is that the fix itself
> is mechanically fine, and I have no review comments on that front.
>
> What I would like in addition to one other reviewer is something not unlike
> what Bryan mailed out last week with 915 and 916 discussion: the analysis
> that lead to the change.
>
> I see that the former constant TCP_ISS_INCR is now an ndd(1m)-tunable
> variable, with an allowable range from 1 to TCP_ISS_INCR. I'd like to know:
>
> - What happened in the field to inspire this?
>
> - Why a range of 1..TCP_ISS_INCR?
>
> - Does this fix work with other settings of tcp_strong_iss?
> We default to 1 (random increment), and have 0 (constant
> increment) or 2 (MD5 hash + random increment). Apart from my
> own feelings that we should default to '2', did folks test these
> changes on systems with tcp_strong_iss set to different values?
Dan,
There are a lot of details about this issue in our internal bug report.
I have extracted what seems to be the most relevant information and
included it here. The analysis on this problem was performed by
Brendan Gregg.
If you still need more info on this, I can try to extract additional
details from our
bug report.
Let me know if you have any questions.
Thanks,
Jerry
Summary:
Attempting many TCP connections from the same remote host to the same local
port can result in connections taking multiple seconds to complete. Either
3.4s or 1.1s (depending on tcp_rexmit_interval_initial), or multiples thereof.
This has been hit by different customers, especially when using multiple tiers
of software on the same host, which communicate using frequent socket
connections to the same port (localhost). We hit this when testing node.js
with ab (Apache Bench), connecting over loopback.
A synopsis of the detailed analysis follows below.
In summary: TCP is cycling the remote ephemeral port (16 bit) quickly
(especially given TCP TIME-WAIT is 60 seconds), which risks new connections
being misidentified as old connections (same remote host and port, and same
local port). However, this should be ok since the initial sequence number
(ISN) should usually be positive and there is code to check this and accept
this as a new connection. But! The ISN is often negative, and Solaris thinks
this is an old connection, drops the packet (which was the SYN), resulting in
the remote host timing out and doing a retransmit. The bug here is that the ISN
is going negative on Solaris too often, which we believe is due to ISS_INCR.
The comment in tcp_impl.h doesn't say exactly where ISS_INCR is from
but cites RFC 793 and the original 1 per 4 us. At that original rate, the ISN
increases by 250000 per second. So, tcps_iss_incr_extra is tacking on half a
second worth of extra ISN (ISS_INCR >> 1).
This extra increment is a little suspicious, and it makes the ISNs cycle
through the 32-bit range rather quickly. I decreased it in the live code to a
few values, and found ISS_INCR=25000 worked best (too high and you hit
timeouts, too low and you hit RSTs more quickly).
This issue is going to keep hurting for high connection rates between two hosts.
This bug is to make ISS_INCR tunable so that it can be tuned when needed
to a value that supports high connection rate scalability.
The fix was tested on the problematic workload and shown to be effective.
---
Customer is seeing low levels of 3.4 seconds to make an http connection intra
zone. When the test was moved to another zone on a different global, they are
still seeing the delay.
At test less than 16k, .5% of the total requests are in the 3.4 second range.
At 61918 requests, 6% of the requests are over the usual less than 1 second.
Please determine why Solaris has outliers in the 3.4 second range for socket
creation.
For reference, DTrace was used to see what TCP was doing during the
tests (this output is global, including all zones):
# dtrace -n 'mib:::tcp* { @[probename] = count(); } tick-30s { exit(0); }'
CPU ID FUNCTION:NAME
1 80565 :tick-30s
tcpOutFastRetrans 2
tcpTimRetransDrop 3
tcpInDataUnorderBytes 5
tcpInDataUnorderSegs 5
tcpOutWinUpdate 7
tcpEstabResets 8
tcpInErrs 23
tcpOutSackRetransSegs 24
tcpInClosed 36
tcpTimRetrans 70
tcpRetransBytes 88
tcpRetransSegs 88
tcpOutRsts 99
tcpCurrEstab 197
tcpActiveOpens 333
tcpRttNoUpdate 389
tcpInSegs 551
tcpOutAckDelayed 564
tcpInDataDupBytes 1132
tcpInDataDupSegs 1132
tcpHCInSegs 4436
tcpHCOutSegs 4436
tcpPassiveOpens 12038
tcpInDupAck 12366
tcpOutControl 24902
tcpOutAck 31121
tcpRttUpdate 33004
tcpInDataInorderBytes 38363
tcpInDataInorderSegs 38363
tcpInAckBytes 45676
tcpInAckSegs 45676
tcpOutDataSegs 68109
tcpOutDataBytes 68533
# dtrace -n 'mib:::tcp* { @[probename] = count(); } tick-30s { exit(0); }'
CPU ID FUNCTION:NAME
1 18215 :tick-30s
tcpTimKeepalive 1
tcpInClosed 6
tcpOutSackRetransSegs 7
tcpTimRetransDrop 11
tcpAttemptFails 12
tcpInErrs 22
tcpEstabResets 23
tcpInDataDupBytes 24
tcpInDataDupSegs 24
tcpOutWinUpdate 42
tcpInWinUpdate 83
tcpOutWinProbe 85
tcpTimRetrans 92
tcpOutRsts 1143
tcpRetransBytes 1215
tcpRetransSegs 1215
tcpCurrEstab 1851
tcpPassiveOpens 2527
tcpOutAckDelayed 5180
tcpActiveOpens 12899
tcpInSegs 13775
tcpInDupAck 14058
tcpRttNoUpdate 15689
tcpHCInSegs 25262
tcpHCOutSegs 25262
tcpOutControl 33025
tcpRttUpdate 73152
tcpOutAck 73672
tcpInAckBytes 103657
tcpInAckSegs 103657
tcpInDataInorderBytes 144884
tcpInDataInorderSegs 144884
tcpOutDataSegs 158843
tcpOutDataBytes 158844
This produced a few leads (tcpInErr, tcpTimRetrans, ...). It was the presence
of tcpTimRetrans that led me to look at the retransmit tunable and the raw
times.
To start with, the usual TCP tunables were adjusted (tcp_conn_req_max_q,
tcp_conn_req_max_q0) but this did not solve the issue.
To look for locations dropping the SYNs, all freemsg() calls were traced using
DTrace:
# dtrace -n 'fbt::freemsg:entry { @[stack()] = count(); }'
dtrace: description 'fbt::freemsg:entry ' matched 1 probe
^C
[...]
sockfs`sotpi_getsockopt+0x5d0
sockfs`socket_getsockopt+0x34
sockfs`getsockopt+0xc7
unix`_sys_sysenter_post_swapgs+0x149
188
ip`tcp_time_wait_processing+0x70b
ip`tcp_rput_data+0x366a
ip`squeue_enter+0x336
ip`ip_input+0xc2e
dls`i_dls_link_rx+0x2dd
mac`mac_rx_deliver+0x5d
mac`mac_rx_srs_drain+0x187
mac`mac_rx_srs_process+0x1db
mac`mac_rx_srs_subflow_process+0x211
mac`mac_rx_classify+0x159
mac`mac_rx_flow+0x54
mac`mac_rx+0x1f6
mac`mac_rx_ring+0x4c
e1000g`e1000g_intr_pciexpress+0x17e
unix`av_dispatch_autovect+0x7c
unix`dispatch_hardint+0x33
unix`switch_sp_and_call+0x13
223
[...]
ip`tcp_conn_request+0xc0b
ip`squeue_drain+0x182
ip`squeue_enter+0x3fa
dls`i_dls_link_rx+0x2dd
mac`mac_rx_deliver+0x5d
mac`mac_rx_srs_drain+0x187
mac`mac_rx_srs_process+0x1db
mac`mac_rx_srs_subflow_process+0x211
mac`mac_rx_classify+0x159
mac`mac_rx_flow+0x54
mac`mac_rx+0x1f6
mac`mac_rx_ring+0x4c
e1000g`e1000g_intr_pciexpress+0x17e
unix`av_dispatch_autovect+0x7c
unix`dispatch_hardint+0x33
unix`switch_sp_and_call+0x13
13098
genunix`freemsgchain+0x21
e1000g`e1000g_recycle+0x253
e1000g`e1000g_intr_pciexpress+0x1ae
unix`av_dispatch_autovect+0x7c
unix`dispatch_hardint+0x33
unix`switch_sp_and_call+0x13
28650
While many stacks were identified, it was the tcp_time_wait_processing() that
seemed most likely - occuring with a count (223) similar to the number of long
latency connections. The source showed that this could drop packets (and ACK a
response) if they were sent to a TCP session that was still in the TIME-WAIT
state.
To get more information on how SYNs could be dropped in TIME-WAIT, snoop was
used to capture network packets during a 10 second run.
The first group of packets shows a normal TCP connection at time 8.08 seconds,
using the ephemeral port 59789. The connection finishes, and I assume it will
then remain in TIME-WAIT for 60 seconds (the current value of
tcp_time_wait_interval).
The second group of packets shows a TCP SYN at 9.73 seconds, using the same
ephemeral port, 59789 (which I suspect happens by random chance, since this
host is making thousands of connections in a short period of time). Since
this is the same connection hash (the IP classifier uses
source-IP:source-port:dest-port), it would be classified as belonging to the
connection that is still in TIME-WAIT. This would then take the
tcp_time_wait_processing() code-path, be treated as a duplicate packet (but
it's not, it's supposed to be a new connection), and not complete the TCP
handshake. Instead, that code-path responds with an ACK of the previous
sequence number, which the client responds to with a reset. These 2nd and 3rd
packets follow these two paragraphs from RFC 793:
3. If the connection is in a synchronized state (ESTABLISHED,
FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, LAST-ACK, TIME-WAIT),
any unacceptable segment (out of window sequence number or
unacceptible acknowledgment number) must elicit only an empty
acknowledgment segment containing the current send-sequence number
and an acknowledgment indicating the next sequence number expected
to be received, and the connection remains in the same state.
If an incoming segment has a security level, or compartment, or
precedence which does not exactly match the level, and compartment,
and precedence requested for the connection,a reset is sent and
connection goes to the CLOSED state. The reset takes its sequence
number from the ACK field of the incoming segment.
The third group of packets at time 13.12 seconds (3.39 seconds later, which
matches the latency seen and would be dominated by
tcp_rexmit_interval_initial), appears to be the retransmit (identical
details - port and seq), which does succeed (why it succeeds a second time I'm
not yet sure; the Solaris stack is supposed to be hardened against TCP
TIME-WAIT assassination).
So, it would seem that since load generation is from one client only, it is
recycling ephemeral ports while they are still in TIME-WAIT, causing the SYNs
to be dropped (mis-processed) and later retransmitted.
To test this theory, the tcp_smallest_anon_port tunable was lowered (from
its default of 32768) to give the client a wider range of ephemeral ports,
and the tcp_time_wait_interval was lowered on the app server (from its default
of 60000 ms), to close the TIME-WAIT sessions sooner. Both of these changes
improved the performance runs - reducing the number of 3+ second outliers from
over 100 to less than 50.
This discovery also explains another behaviour previously seen - if the test
was run twice within 60 seconds, the second test ran much more poorly.
Presumably because many sessions were still in TIME-WAIT, keeping the ephemeral
ports open. When tcp_time_wait_interval was reduced, the second run (after the
interval) was as fast as the first.
In summary, thousands of TCP connections from the same client to the same host
and port within the tcp_time_wait_interval (60 seconds) runs the risk of
ephemeral port recycling, the SYNs being dropped, and TCP retransmit timeouts.
This means that tcp_strong_iss = 0 does not behave as expected on Solaris
(expected being monotonically increasing), given other traffic on the system
(which can reset tcp_iss_incr_extra).
While researching this, I came across the following block comment in the
FreeBSD source code:
1479 * The ISNs in SYN packets must be monotonic; TIME_WAIT recycling
1480 * depends on this property. In addition, these ISNs should be
1481 * unguessable so as to prevent connection hijacking. To satisfy
1482 * the requirements of this situation, the algorithm outlined in
1483 * RFC 1948 is used, with only small modifications.
We've already learned this - our TIME-WAIT recycling is not always working,
and the sequence numbers in SYNs (ISNs) are not monotonic. I don't think
Solaris is oblivious to the issue - the role of tcps_iss_incr_extra should
help with monotonically increasing ISNs. But, in this case, they are not.
More information about the Developer
mailing list