--- 1/draft-ietf-tcpm-rfc2581bis-00.txt 2006-06-23 01:14:36.000000000 +0200 +++ 2/draft-ietf-tcpm-rfc2581bis-01.txt 2006-06-23 01:14:36.000000000 +0200 @@ -1,20 +1,18 @@ Network Working Group M. Allman Internet-Draft V. Paxson -Expires: July 2006 ICIR / ICSI +Expires: December 2006 ICIR / ICSI E. Blanton Purdue University - January 2006 - TCP Congestion Control - draft-ietf-tcpm-rfc2581bis-00.txt + draft-ietf-tcpm-rfc2581bis-01.txt Status of this Memo By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that @@ -118,31 +116,30 @@ RESTART WINDOW (RW): The restart window is the size of the congestion window after a TCP restarts transmission after an idle period (if the slow start algorithm is used; see section 4.1 for more discussion). FLIGHT SIZE: The amount of data that has been sent but not yet acknowledged. DUPLICATE ACKNOWLEDGMENT: An acknowledgment is considered a - "duplicate" in the following algorithms when (a) the - receiver of the ACK has outstanding data, (b) the incoming - acknowledgment carries no data, (c) the SYN and FIN bits are - both off, (d) the acknowledgment number is equal to the greatest - acknowledgment received on the given connection (TCP.UNA from - [RFC793]) and (e) the advertised window in the incoming - acknowledgment equals the advertised window in the last incoming - acknowledgment. Alternatively, a TCP that utilizes selective - acknowledgments [RFC2018] can determine an incoming ACK is a - "duplicate" if the ACK contains previously unknown SACK - information. + "duplicate" in the following algorithms when (a) the receiver of + the ACK has outstanding data, (b) the incoming acknowledgment + carries no data, (c) the SYN and FIN bits are both off, (d) the + acknowledgment number is equal to the greatest acknowledgment + received on the given connection (TCP.UNA from [RFC793]) and (e) + the advertised window in the incoming acknowledgment equals the + advertised window in the last incoming acknowledgment. + Alternatively, a TCP that utilizes selective acknowledgments + [RFC2018,RFC2883] can determine an incoming ACK is a "duplicate" + if the ACK contains previously unknown SACK information. 3. Congestion Control Algorithms This section defines the four congestion control algorithms: slow start, congestion avoidance, fast retransmit and fast recovery, developed in [Jac88] and [Jac90]. In some situations it may be beneficial for a TCP sender to be more conservative than the algorithms allow, however a TCP MUST NOT be more aggressive than the following algorithms allow (that is, MUST NOT send data when the value of cwnd computed by the following algorithms would not allow @@ -174,33 +171,45 @@ IW, the initial value of cwnd, MUST be set using the following guidelines as an upper bound. If SMSS > 2190 bytes: IW = 2 * SMSS bytes and MUST NOT be more than 2 segments If (SMSS > 1095 bytes) and (SMSS <= 2190 bytes): IW = 3 * SMSS bytes and MUST NOT be more than 3 segments if SMSS <= 1095 bytes: IW = 4 * SMSS bytes and MUST NOT be more than 4 segments + As specified in [RFC3390], the SYN/ACK and the acknowledgment of the + SYN/ACK MUST NOT increase the size of the congestion window. + Further, if the SYN or SYN/ACK is lost, the initial window used by a + sender after a correctly transmitted SYN MUST be one segment + consisting of at most SMSS bytes. + A detailed rationale and discussion of the IW setting is provided in [RFC3390]. When larger initial windows are implemented along with Path MTU Discovery [RFC1191], and the MSS being used is found to be too large, the congestion window cwnd SHOULD be reduced to prevent large bursts of smaller segments. Specifically, cwnd SHOULD be reduced by the ratio of the old segment size to the new segment size. - The initial value of ssthresh SHOULD be arbitrarily high (for - example, some implementations use the size of the advertised - window), but ssthresh MUST be reduced in response to congestion. + The initial value of ssthresh SHOULD be arbitrarily high (e.g., to + the size of the largest possible advertised window), but ssthresh + MUST be reduced in response to congestion. Setting ssthresh as high + as possible allows the network conditions, rather than some + arbitrary host limit, to dictate the sending rate. In cases where + the end systems have a solid understanding of the network path, more + carefully setting the initial ssthresh value may have merit (e.g., + such that the end host does not create congestion along the path). + The slow start algorithm is used when cwnd < ssthresh, while the congestion avoidance algorithm is used when cwnd > ssthresh. When cwnd and ssthresh are equal the sender may use either slow start or congestion avoidance. During slow start, a TCP increments cwnd by at most SMSS bytes for each ACK received that acknowledges new data. Slow start ends when cwnd exceeds ssthresh (or, optionally, when it reaches it, as noted above) or when congestion is observed. While traditionally TCP implementations have increased cwnd by precisely SMSS bytes upon @@ -301,20 +309,28 @@ incidentally increase well beyond rwnd. Furthermore, upon a timeout (as specified in [RFC2988]) cwnd MUST be set to no more than the loss window, LW, which equals 1 full-sized segment (regardless of the value of IW). Therefore, after retransmitting the dropped segment the TCP sender uses the slow start algorithm to increase the window from 1 full-sized segment to the new value of ssthresh, at which point congestion avoidance again takes over. + As shown in [FF96,RFC3782], slow start-based loss recovery after a + timeout can cause spurious retransmissions that trigger duplicate + acknowledgments. The reaction to the arrival of these duplicate + ACKs in TCP implementations varies widely. This document does not + specify how to treat such acknowledgments, but does note this as an + area that may benefit from additional attention, experimentation and + specification. + 3.2 Fast Retransmit/Fast Recovery A TCP receiver SHOULD send an immediate duplicate ACK when an out- of-order segment arrives. The purpose of this ACK is to inform the sender that a segment was received out-of-order and which sequence number is expected. From the sender's perspective, duplicate ACKs can be caused by a number of network problems. First, they can be caused by dropped segments. In this case, all segments after the dropped segment will trigger duplicate ACKs until the loss is repaired. Second, duplicate ACKs can be caused by the re-ordering @@ -322,26 +338,26 @@ paths [Pax97]). Finally, duplicate ACKs can be caused by replication of ACK or data segments by the network. In addition, a TCP receiver SHOULD send an immediate ACK when the incoming segment fills in all or part of a gap in the sequence space. This will generate more timely information for a sender recovering from a loss through a retransmission timeout, a fast retransmit, or an advanced loss recovery algorithm, as outlined in section 4.3. The TCP sender SHOULD use the "fast retransmit" algorithm to detect and repair loss, based on incoming duplicate ACKs. The fast - retransmit algorithm uses the arrival of 3 duplicate ACKs (4 - identical ACKs without the arrival of any other intervening packets) - as an indication that a segment has been lost. After receiving 3 - duplicate ACKs, TCP performs a retransmission of what appears to be - the missing segment, without waiting for the retransmission timer to - expire. + retransmit algorithm uses the arrival of 3 duplicate ACKs (as + defined in section 2, without any intervening ACKs which move + SND.UNA) as an indication that a segment has been lost. After + receiving 3 duplicate ACKs, TCP performs a retransmission of what + appears to be the missing segment, without waiting for the + retransmission timer to expire. After the fast retransmit algorithm sends what appears to be the missing segment, the "fast recovery" algorithm governs the transmission of new data until a non-duplicate ACK arrives. The reason for not performing slow start is that the receipt of the duplicate ACKs not only indicates that a segment has been lost, but also that segments are most likely leaving the network (although a massive segment duplication by the network can invalidate this conclusion). In other words, since the receiver can only generate a duplicate ACK when a segment has arrived, that segment has left the @@ -554,21 +571,27 @@ that were not discussed in detail in 2001. Specifically, this document suggests what TCP connections should do after a relatively long idle period, as well as specifying and clarifying some of the issues pertaining to TCP ACK generation. Finally, the allowable upper bound for the initial congestion window has also been raised from one to two segments. 7. Changes Relative to RFC 2581 A specific definition for "duplicate acknowledgment" has been - added, based on the definition used by BSD TCP. + added, based on the definition used by BSD TCP. In addition, the + definition explicitly does not take into account the presence (or + absence) of DSACK [RFC2883] information. + + The document now notes that what to do with duplicate ACKs after the + retransmission timer has fired is future work and explicitly + unspecified in this document. The initial window requirements were changed to allow Larger Initial Windows as standardized in [RFC3390]. Additionally, the steps to take when an initial window is discovered to be too large due to Path MTU Discovery [RFC1191] are detailed. The recommended initial value for ssthresh has been changed to say that it SHOULD be arbitrarily high, where it was previously MAY. This is to provide additional guidance to implementors on the matter. @@ -616,22 +639,23 @@ We wish to emphasize that the shortcomings and mistakes of this document are solely the responsibility of the current authors. Some of the text from this document is taken from "TCP/IP Illustrated, Volume 1: The Protocols" by W. Richard Stevens (Addison-Wesley, 1994) and "TCP/IP Illustrated, Volume 2: The Implementation" by Gary R. Wright and W. Richard Stevens (Addison- Wesley, 1995). This material is used with the permission of Addison-Wesley. - Neal Cardwell, Noritoshi Demizu, Kevin Fall, Sally Floyd, Craig - Partridge and Joe Touch contributed a number of helpful suggestions. + Steve Arden, Neal Cardwell, Noritoshi Demizu, Kevin Fall, John + Heffner, Sally Floyd, Reiner Ludwig, Matt Mathis, Craig Partridge + and Joe Touch contributed a number of helpful suggestions. Normative References [RFC793] Postel, J., "Transmission Control Protocol", STD 7, RFC 793, September 1981. [RFC1122] Braden, R., "Requirements for Internet Hosts -- Communication Layers", STD 3, RFC 1122, October 1989. [RFC1191] Mogul, J. and S. Deering, "Path MTU Discovery", RFC 1191, @@ -694,20 +718,24 @@ [RFC2414] Allman, M., Floyd, S. and C. Partridge, "Increasing TCP's Initial Window Size", RFC 2414, September 1998. [RFC2525] Paxson, V., Allman, M., Dawson, S., Fenner, W., Griner, J., Heavens, I., Lahey, K., Semke, J. and B. Volz, "Known TCP Implementation Problems", RFC 2525, March 1999. [RFC2581] Allman, M., Paxson, V., W. Stevens, TCP Congestion Control, RFC 2581, April 1999. + [RFC2883] Floyd, S., J. Mahdavi, M. Mathis, M. Podolsky, An + Extension to the Selective Acknowledgement (SACK) Option for + TCP, RFC 2883, July 2000. + [RFC2988] V. Paxson and M. Allman, "Computing TCP's Retransmission Timer", RFC 2988, November 2000. [RFC3042] Allman, M., Balakrishnan, H. and S. Floyd, "Enhancing TCP's Loss Recovery Using Limited Transmit", RFC 3042, January 2001. [RFC3465] Mark Allman, TCP Congestion Control with Appropriate Byte Counting (ABC), RFC 3465, February 2003. @@ -729,21 +757,21 @@ [WS95] Wright, G. and W. Stevens, "TCP/IP Illustrated, Volume 2: The Implementation", Addison-Wesley, 1995. Authors' Addresses Mark Allman ICIR / ICSI 1947 Center Street Suite 600 Berkeley, CA 94704-1198 - Phone: +1 440 243 7361 + Phone: +1 440 235 1792 EMail: mallman@icir.org http://www.icir.org/mallman/ Vern Paxson ICIR / ICSI 1947 Center Street Suite 600 Berkeley, CA 94704-1198 Phone: +1 510/642-4274 x302 EMail: vern@icir.org