[illumos-Developer] DESIGN REVIEW & CODE REVIEW: 917, compliments of Joyent

Tue Apr 26 12:54:14 PDT 2011

On Tue, Apr 26, 2011 at 11:56 AM, Dan McDonald <danmcd at nexenta.com> wrote:
> Last week I mentioned *four* Joyent-contributed TCP/IP fixes.  Here's #3:
>
>        http://www.kebe.com/~danmcd/webrevs/917/
>
> This was contributed by Jerry Jelinek.  My own review is that the fix itself
> is mechanically fine, and I have no review comments on that front.
>
> What I would like in addition to one other reviewer is something not unlike
> what Bryan mailed out last week with 915 and 916 discussion:  the analysis
> that lead to the change.
>
> I see that the former constant TCP_ISS_INCR is now an ndd(1m)-tunable
> variable, with an allowable range from 1 to TCP_ISS_INCR.  I'd like to know:
>
>        - What happened in the field to inspire this?
>
>        - Why a range of 1..TCP_ISS_INCR?
>
>        - Does this fix work with other settings of tcp_strong_iss?
>          We default to 1 (random increment), and have 0 (constant
>          increment) or 2 (MD5 hash + random increment).  Apart from my
>          own feelings that we should default to '2', did folks test these
>          changes on systems with tcp_strong_iss set to different values?

Dan,

There are a lot of details about this issue in our internal bug report.
I have extracted what seems to be the most relevant information and
included it here.  The analysis on this problem was performed by
Brendan Gregg.

If you still need more info on this, I can try to extract additional
details from our
bug report.

Let me know if you have any questions.

Thanks,
Jerry

Summary:

Attempting many TCP connections from the same remote host to the same local
port can result in connections taking multiple seconds to complete. Either
3.4s or 1.1s (depending on tcp_rexmit_interval_initial), or multiples thereof.
This has been hit by different customers, especially when using multiple tiers
of software on the same host, which communicate using frequent socket
connections to the same port (localhost). We hit this when testing node.js
with ab (Apache Bench), connecting over loopback.

A synopsis of the detailed analysis follows below.

In summary: TCP is cycling the remote ephemeral port (16 bit) quickly
(especially given TCP TIME-WAIT is 60 seconds), which risks new connections
being misidentified as old connections (same remote host and port, and same
local port). However, this should be ok since the initial sequence number
(ISN) should usually be positive and there is code to check this and accept
this as a new connection. But! The ISN is often negative, and Solaris thinks
this is an old connection, drops the packet (which was the SYN), resulting in
the remote host timing out and doing a retransmit. The bug here is that the ISN
is going negative on Solaris too often, which we believe is due to ISS_INCR.

The comment in tcp_impl.h doesn't say exactly where ISS_INCR is from
but cites RFC 793 and the original 1 per 4 us. At that original rate, the ISN
increases by 250000 per second. So, tcps_iss_incr_extra is tacking on half a
second worth of extra ISN (ISS_INCR >> 1).

This extra increment is a little suspicious, and it makes the ISNs cycle
through the 32-bit range rather quickly. I decreased it in the live code to a
few values, and found ISS_INCR=25000 worked best (too high and you hit
timeouts, too low and you hit RSTs more quickly).

This issue is going to keep hurting for high connection rates between two hosts.

This bug is to make ISS_INCR tunable so that it can be tuned when needed
to a value that supports high connection rate scalability.

The fix was tested on the problematic workload and shown to be effective.

---

Customer is seeing low levels of 3.4 seconds to make an http connection intra
zone.  When the test was moved to another zone on a different global, they are
still seeing the delay.

At test less than 16k, .5% of the total requests are in the 3.4 second range.
At 61918 requests, 6% of the requests are over the usual less than 1 second.

Please determine why Solaris has outliers in the 3.4 second range for socket
creation.

For reference, DTrace was used to see what TCP was doing during the
tests (this output is global, including all zones):

# dtrace -n 'mib:::tcp* { @[probename] = count(); } tick-30s { exit(0); }'
CPU     ID                    FUNCTION:NAME
  1  80565                        :tick-30s

  tcpOutFastRetrans                                                 2
  tcpTimRetransDrop                                                 3
  tcpInDataUnorderBytes                                             5
  tcpInDataUnorderSegs                                              5
  tcpOutWinUpdate                                                   7
  tcpEstabResets                                                    8
  tcpInErrs                                                        23
  tcpOutSackRetransSegs                                            24
  tcpInClosed                                                      36
  tcpTimRetrans                                                    70
  tcpRetransBytes                                                  88
  tcpRetransSegs                                                   88
  tcpOutRsts                                                       99
  tcpCurrEstab                                                    197
  tcpActiveOpens                                                  333
  tcpRttNoUpdate                                                  389
  tcpInSegs                                                       551
  tcpOutAckDelayed                                                564
  tcpInDataDupBytes                                              1132
  tcpInDataDupSegs                                               1132
  tcpHCInSegs                                                    4436
  tcpHCOutSegs                                                   4436
  tcpPassiveOpens                                               12038
  tcpInDupAck                                                   12366
  tcpOutControl                                                 24902
  tcpOutAck                                                     31121
  tcpRttUpdate                                                  33004
  tcpInDataInorderBytes                                         38363
  tcpInDataInorderSegs                                          38363
  tcpInAckBytes                                                 45676
  tcpInAckSegs                                                  45676
  tcpOutDataSegs                                                68109
  tcpOutDataBytes                                               68533
# dtrace -n 'mib:::tcp* { @[probename] = count(); } tick-30s { exit(0); }'
CPU     ID                    FUNCTION:NAME
  1  18215                        :tick-30s

  tcpTimKeepalive                                                   1
  tcpInClosed                                                       6
  tcpOutSackRetransSegs                                             7
  tcpTimRetransDrop                                                11
  tcpAttemptFails                                                  12
  tcpInErrs                                                        22
  tcpEstabResets                                                   23
  tcpInDataDupBytes                                                24
  tcpInDataDupSegs                                                 24
  tcpOutWinUpdate                                                  42
  tcpInWinUpdate                                                   83
  tcpOutWinProbe                                                   85
  tcpTimRetrans                                                    92
  tcpOutRsts                                                     1143
  tcpRetransBytes                                                1215
  tcpRetransSegs                                                 1215
  tcpCurrEstab                                                   1851
  tcpPassiveOpens                                                2527
  tcpOutAckDelayed                                               5180
  tcpActiveOpens                                                12899
  tcpInSegs                                                     13775
  tcpInDupAck                                                   14058
  tcpRttNoUpdate                                                15689
  tcpHCInSegs                                                   25262
  tcpHCOutSegs                                                  25262
  tcpOutControl                                                 33025
  tcpRttUpdate                                                  73152
  tcpOutAck                                                     73672
  tcpInAckBytes                                                103657
  tcpInAckSegs                                                 103657
  tcpInDataInorderBytes                                        144884
  tcpInDataInorderSegs                                         144884
  tcpOutDataSegs                                               158843
  tcpOutDataBytes                                              158844

This produced a few leads (tcpInErr, tcpTimRetrans, ...). It was the presence
of tcpTimRetrans that led me to look at the retransmit tunable and the raw
times.

To start with, the usual TCP tunables were adjusted (tcp_conn_req_max_q,
tcp_conn_req_max_q0) but this did not solve the issue.

To look for locations dropping the SYNs, all freemsg() calls were traced using
DTrace:

# dtrace -n 'fbt::freemsg:entry { @[stack()] = count(); }'
dtrace: description 'fbt::freemsg:entry ' matched 1 probe
^C
[...]
              sockfs`sotpi_getsockopt+0x5d0
              sockfs`socket_getsockopt+0x34
              sockfs`getsockopt+0xc7
              unix`_sys_sysenter_post_swapgs+0x149
              188

              ip`tcp_time_wait_processing+0x70b
              ip`tcp_rput_data+0x366a
              ip`squeue_enter+0x336
              ip`ip_input+0xc2e
              dls`i_dls_link_rx+0x2dd
              mac`mac_rx_deliver+0x5d
              mac`mac_rx_srs_drain+0x187
              mac`mac_rx_srs_process+0x1db
              mac`mac_rx_srs_subflow_process+0x211
              mac`mac_rx_classify+0x159
              mac`mac_rx_flow+0x54
              mac`mac_rx+0x1f6
              mac`mac_rx_ring+0x4c
              e1000g`e1000g_intr_pciexpress+0x17e
              unix`av_dispatch_autovect+0x7c
              unix`dispatch_hardint+0x33
              unix`switch_sp_and_call+0x13
              223
[...]

              ip`tcp_conn_request+0xc0b
              ip`squeue_drain+0x182
              ip`squeue_enter+0x3fa
              dls`i_dls_link_rx+0x2dd
              mac`mac_rx_deliver+0x5d
              mac`mac_rx_srs_drain+0x187
              mac`mac_rx_srs_process+0x1db
              mac`mac_rx_srs_subflow_process+0x211
              mac`mac_rx_classify+0x159
              mac`mac_rx_flow+0x54
              mac`mac_rx+0x1f6
              mac`mac_rx_ring+0x4c
              e1000g`e1000g_intr_pciexpress+0x17e
              unix`av_dispatch_autovect+0x7c
              unix`dispatch_hardint+0x33
              unix`switch_sp_and_call+0x13
            13098

              genunix`freemsgchain+0x21
              e1000g`e1000g_recycle+0x253
              e1000g`e1000g_intr_pciexpress+0x1ae
              unix`av_dispatch_autovect+0x7c
              unix`dispatch_hardint+0x33
              unix`switch_sp_and_call+0x13
            28650

While many stacks were identified, it was the tcp_time_wait_processing() that
seemed most likely - occuring with a count (223) similar to the number of long
latency connections. The source showed that this could drop packets (and ACK a
response) if they were sent to a TCP session that was still in the TIME-WAIT
state.

To get more information on how SYNs could be dropped in TIME-WAIT, snoop was
used to capture network packets during a 10 second run.

The first group of packets shows a normal TCP connection at time 8.08 seconds,
using the ephemeral port 59789. The connection finishes, and I assume it will
then remain in TIME-WAIT for 60 seconds (the current value of
tcp_time_wait_interval).

The second group of packets shows a TCP SYN at 9.73 seconds, using the same
ephemeral port, 59789 (which I suspect happens by random chance, since this
host is making thousands of connections in a short period of time). Since
this is the same connection hash (the IP classifier uses
source-IP:source-port:dest-port), it would be classified as belonging to the
connection that is still in TIME-WAIT. This would then take the
tcp_time_wait_processing() code-path, be treated as a duplicate packet (but
it's not, it's supposed to be a new connection), and not complete the TCP
handshake. Instead, that code-path responds with an ACK of the previous
sequence number, which the client responds to with a reset. These 2nd and 3rd
packets follow these two paragraphs from RFC 793:

3.  If the connection is in a synchronized state (ESTABLISHED,
    FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, LAST-ACK, TIME-WAIT),
    any unacceptable segment (out of window sequence number or
    unacceptible acknowledgment number) must elicit only an empty
    acknowledgment segment containing the current send-sequence number
    and an acknowledgment indicating the next sequence number expected
    to be received, and the connection remains in the same state.

    If an incoming segment has a security level, or compartment, or
    precedence which does not exactly match the level, and compartment,
    and precedence requested for the connection,a reset is sent and
    connection goes to the CLOSED state.  The reset takes its sequence
    number from the ACK field of the incoming segment.

The third group of packets at time 13.12 seconds (3.39 seconds later, which
matches the latency seen and would be dominated by
tcp_rexmit_interval_initial), appears to be the retransmit (identical
details - port and seq), which does succeed (why it succeeds a second time I'm
not yet sure; the Solaris stack is supposed to be hardened against TCP
TIME-WAIT assassination).

So, it would seem that since load generation is from one client only, it is
recycling ephemeral ports while they are still in TIME-WAIT, causing the SYNs
to be dropped (mis-processed) and later retransmitted.

To test this theory, the tcp_smallest_anon_port tunable was lowered (from
its default of 32768) to give the client a wider range of ephemeral ports,
and the tcp_time_wait_interval was lowered on the app server (from its default
of 60000 ms), to close the TIME-WAIT sessions sooner. Both of these changes
improved the performance runs - reducing the number of 3+ second outliers from
over 100 to less than 50.

This discovery also explains another behaviour previously seen - if the test
was run twice within 60 seconds, the second test ran much more poorly.
Presumably because many sessions were still in TIME-WAIT, keeping the ephemeral
ports open. When tcp_time_wait_interval was reduced, the second run (after the
interval) was as fast as the first.

In summary, thousands of TCP connections from the same client to the same host
and port within the tcp_time_wait_interval (60 seconds) runs the risk of
ephemeral port recycling, the SYNs being dropped, and TCP retransmit timeouts.

This means that tcp_strong_iss = 0 does not behave as expected on Solaris
(expected being monotonically increasing), given other traffic on the system
(which can reset tcp_iss_incr_extra).

While researching this, I came across the following block comment in the
FreeBSD source code:

 1479  * The ISNs in SYN packets must be monotonic; TIME_WAIT recycling
 1480  * depends on this property.  In addition, these ISNs should be
 1481  * unguessable so as to prevent connection hijacking.  To satisfy
 1482  * the requirements of this situation, the algorithm outlined in
 1483  * RFC 1948 is used, with only small modifications.

We've already learned this - our TIME-WAIT recycling is not always working,
and the sequence numbers in SYNs (ISNs) are not monotonic. I don't think
Solaris is oblivious to the issue - the role of tcps_iss_incr_extra should
help with monotonically increasing ISNs. But, in this case, they are not.