draft-ietf-tcpm-rfc2581bis-01.txt   draft-ietf-tcpm-rfc2581bis-02.txt 
Network Working Group M. Allman Network Working Group M. Allman
Internet-Draft V. Paxson Internet-Draft V. Paxson
Expires: December 2006 ICIR / ICSI Expires: August 2007 ICIR / ICSI
E. Blanton E. Blanton
Purdue University Purdue University
February 2007
TCP Congestion Control TCP Congestion Control
draft-ietf-tcpm-rfc2581bis-01.txt draft-ietf-tcpm-rfc2581bis-02.txt
Status of this Memo Status of this Memo
By submitting this Internet-Draft, each author represents that any By submitting this Internet-Draft, each author represents that any
applicable patent or other IPR claims of which he or she is aware applicable patent or other IPR claims of which he or she is aware
have been or will be disclosed, and any of which he or she becomes have been or will be disclosed, and any of which he or she becomes
aware will be disclosed, in accordance with Section 6 of BCP 79. aware will be disclosed, in accordance with Section 6 of BCP 79.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that Task Force (IETF), its areas, and its working groups. Note that
skipping to change at page 1, line 36 skipping to change at page 1, line 38
reference material or to cite them other than as "work in progress." reference material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt. http://www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html. http://www.ietf.org/shadow.html.
Copyright Notice Copyright Notice
Copyright (C) The Internet Society (2006). Copyright (C) The Internet Society (2007).
Abstract Abstract
This document defines TCP's four intertwined congestion control This document defines TCP's four intertwined congestion control
algorithms: slow start, congestion avoidance, fast retransmit, and algorithms: slow start, congestion avoidance, fast retransmit, and
fast recovery. In addition, the document specifies how TCP should fast recovery. In addition, the document specifies how TCP should
begin transmission after a relatively long idle period, as well as begin transmission after a relatively long idle period, as well as
discussing various acknowledgment generation methods. discussing various acknowledgment generation methods.
1. Introduction 1. Introduction
This document specifies four TCP [RFC793] congestion control This document specifies four TCP [RFC793] congestion control
algorithms: slow start, congestion avoidance, fast retransmit and algorithms: slow start, congestion avoidance, fast retransmit and
fast recovery. These algorithms were devised in [Jac88] and fast recovery. These algorithms were devised in [Jac88] and
[Jac90]. Their use with TCP is standardized in [RFC1122]. Additional [Jac90]. Their use with TCP is standardized in [RFC1122]. Additional
early work in additive-increase, multiplicative-decrease congestion early work in additive-increase, multiplicative-decrease congestion
control is given in [CJ89]. control is given in [CJ89].
This document is an update of [RFC2001] and [RFC2581]. This document obsoletes [RFC2581] which in turned obsoleted
[RFC2001].
In addition to specifying the congestion control algorithms, this In addition to specifying the congestion control algorithms, this
document specifies what TCP connections should do after a relatively document specifies what TCP connections should do after a relatively
long idle period, as well as specifying and clarifying some of the long idle period, as well as specifying and clarifying some of the
issues pertaining to TCP ACK generation. issues pertaining to TCP ACK generation.
Note that [Ste94] provides examples of these algorithms in action Note that [Ste94] provides examples of these algorithms in action
and [WS95] provides an explanation of the source code for the BSD and [WS95] provides an explanation of the source code for the BSD
implementation of these algorithms. implementation of these algorithms.
skipping to change at page 2, line 53 skipping to change at page 2, line 54
largest segment the receiver is willing to accept. This is the largest segment the receiver is willing to accept. This is the
value specified in the MSS option sent by the receiver during value specified in the MSS option sent by the receiver during
connection startup. Or, if the MSS option is not used, 536 connection startup. Or, if the MSS option is not used, 536
bytes [RFC1122]. The size does not include the TCP/IP headers and bytes [RFC1122]. The size does not include the TCP/IP headers and
options. options.
FULL-SIZED SEGMENT: A segment that contains the maximum number of FULL-SIZED SEGMENT: A segment that contains the maximum number of
data bytes permitted (i.e., a segment containing SMSS bytes of data bytes permitted (i.e., a segment containing SMSS bytes of
data). data).
RECEIVER WINDOW (rwnd) The most recently advertised receiver window. RECEIVER WINDOW (rwnd): The most recently advertised receiver
window.
CONGESTION WINDOW (cwnd): A TCP state variable that limits the CONGESTION WINDOW (cwnd): A TCP state variable that limits the
amount of data a TCP can send. At any given time, a TCP MUST amount of data a TCP can send. At any given time, a TCP MUST
NOT send data with a sequence number higher than the sum of the NOT send data with a sequence number higher than the sum of the
highest acknowledged sequence number and the minimum of cwnd and highest acknowledged sequence number and the minimum of cwnd and
rwnd. rwnd.
INITIAL WINDOW (IW): The initial window is the size of the sender's INITIAL WINDOW (IW): The initial window is the size of the sender's
congestion window after the three-way handshake is completed. congestion window after the three-way handshake is completed.
skipping to change at page 4, line 39 skipping to change at page 4, line 42
A detailed rationale and discussion of the IW setting is provided in A detailed rationale and discussion of the IW setting is provided in
[RFC3390]. [RFC3390].
When larger initial windows are implemented along with Path MTU When larger initial windows are implemented along with Path MTU
Discovery [RFC1191], and the MSS being used is found to be too Discovery [RFC1191], and the MSS being used is found to be too
large, the congestion window cwnd SHOULD be reduced to prevent large, the congestion window cwnd SHOULD be reduced to prevent
large bursts of smaller segments. Specifically, cwnd SHOULD be large bursts of smaller segments. Specifically, cwnd SHOULD be
reduced by the ratio of the old segment size to the new segment reduced by the ratio of the old segment size to the new segment
size. size.
The initial value of ssthresh SHOULD be arbitrarily high (e.g., to The initial value of ssthresh SHOULD be set arbitrarily high (e.g.,
the size of the largest possible advertised window), but ssthresh to the size of the largest possible advertised window), but ssthresh
MUST be reduced in response to congestion. Setting ssthresh as high MUST be reduced in response to congestion. Setting ssthresh as high
as possible allows the network conditions, rather than some as possible allows the network conditions, rather than some
arbitrary host limit, to dictate the sending rate. In cases where arbitrary host limit, to dictate the sending rate. In cases where
the end systems have a solid understanding of the network path, more the end systems have a solid understanding of the network path, more
carefully setting the initial ssthresh value may have merit (e.g., carefully setting the initial ssthresh value may have merit (e.g.,
such that the end host does not create congestion along the path). such that the end host does not create congestion along the path).
The slow start algorithm is used when cwnd < ssthresh, while the The slow start algorithm is used when cwnd < ssthresh, while the
congestion avoidance algorithm is used when cwnd > ssthresh. When congestion avoidance algorithm is used when cwnd > ssthresh. When
cwnd and ssthresh are equal the sender may use either slow start or cwnd and ssthresh are equal the sender may use either slow start or
skipping to change at page 6, line 4 skipping to change at page 6, line 6
cwnd += SMSS*SMSS/cwnd (3) cwnd += SMSS*SMSS/cwnd (3)
This adjustment is executed on every incoming ACK that acknowledges This adjustment is executed on every incoming ACK that acknowledges
new data. new data.
Equation (3) provides an acceptable approximation to the underlying Equation (3) provides an acceptable approximation to the underlying
principle of increasing cwnd by 1 full-sized segment per RTT. (Note principle of increasing cwnd by 1 full-sized segment per RTT. (Note
that for a connection in which the receiver is acknowledging that for a connection in which the receiver is acknowledging
every-other packet, (3) is less aggressive than allowed -- roughly every-other packet, (3) is less aggressive than allowed -- roughly
increasing cwnd every second RTT.) increasing cwnd every second RTT.)
Implementation Note: Since integer arithmetic is usually used in TCP Implementation Note: Since integer arithmetic is usually used in TCP
implementations, the formula given in equation 3 can fail to implementations, the formula given in equation 3 can fail to
increase cwnd when the congestion window is larger than SMSS*SMSS. increase cwnd when the congestion window is larger than SMSS*SMSS.
If the above formula yields 0, the result SHOULD be rounded up to 1 If the above formula yields 0, the result SHOULD be rounded up to 1
byte. byte.
Implementation Note: older implementations have an additional Implementation Note: Older implementations have an additional
additive constant on the right-hand side of equation (3). This is additive constant on the right-hand side of equation (3). This is
incorrect and can actually lead to diminished performance [RFC2525]. incorrect and can actually lead to diminished performance [RFC2525].
Implementation Note: some implementations maintain cwnd in units of Implementation Note: Some implementations maintain cwnd in units of
bytes, while others in units of full-sized segments. The latter bytes, while others in units of full-sized segments. The latter
will find equation (3) difficult to use, and may prefer to use the will find equation (3) difficult to use, and may prefer to use the
counting approach discussed in the previous paragraph. counting approach discussed in the previous paragraph.
When a TCP sender detects segment loss using the retransmission When a TCP sender detects segment loss using the retransmission
timer and the given segment has not yet been retransmitted, the timer and the given segment has not yet been retransmitted, the
value of ssthresh MUST be set to no more than the value given in value of ssthresh MUST be set to no more than the value given in
equation 4: equation 4:
ssthresh = max (FlightSize / 2, 2*SMSS) (4) ssthresh = max (FlightSize / 2, 2*SMSS) (4)
where, as discussed above, FlightSize is the amount of outstanding where, as discussed above, FlightSize is the amount of outstanding
data in the network. data in the network.
On the other hand, when a TCP sender detects segment loss using the On the other hand, when a TCP sender detects segment loss using the
retransmission timer and the given segment has already been retransmission timer and the given segment has already been
retransmitted at least once, the value of ssthresh MUST be set to no retransmitted at least once, the value of ssthresh is held
more than the value given in equation 5: constant.
ssthresh = max (ssthresh / 2, 2*SMSS) (5)
In other words, upon the first retransmission of a segment the value
of ssthresh should be set to half the amount of outstanding data in
the network, whereas on subsequent retransmissions the value of
ssthresh should simply be halved.
Implementation Note: an easy mistake to make is to simply use cwnd, Implementation Note: An easy mistake to make is to simply use cwnd,
rather than FlightSize, which in some implementations may rather than FlightSize, which in some implementations may
incidentally increase well beyond rwnd. incidentally increase well beyond rwnd.
Furthermore, upon a timeout (as specified in [RFC2988]) cwnd MUST be Furthermore, upon a timeout (as specified in [RFC2988]) cwnd MUST be
set to no more than the loss window, LW, which equals 1 full-sized set to no more than the loss window, LW, which equals 1 full-sized
segment (regardless of the value of IW). Therefore, after segment (regardless of the value of IW). Therefore, after
retransmitting the dropped segment the TCP sender uses the slow retransmitting the dropped segment the TCP sender uses the slow
start algorithm to increase the window from 1 full-sized segment to start algorithm to increase the window from 1 full-sized segment to
the new value of ssthresh, at which point congestion avoidance again the new value of ssthresh, at which point congestion avoidance again
takes over. takes over.
skipping to change at page 8, line 23 skipping to change at page 8, line 18
3. The lost segment MUST be retransmitted and cwnd set to 3. The lost segment MUST be retransmitted and cwnd set to
ssthresh plus 3*SMSS. This artificially "inflates" the ssthresh plus 3*SMSS. This artificially "inflates" the
congestion window by the number of segments (three) that have congestion window by the number of segments (three) that have
left the network and which the receiver has buffered. left the network and which the receiver has buffered.
4. For each additional duplicate ACK received (after the third), 4. For each additional duplicate ACK received (after the third),
cwnd MUST be incremented by SMSS. This artificially inflates cwnd MUST be incremented by SMSS. This artificially inflates
the congestion window in order to reflect the additional segment the congestion window in order to reflect the additional segment
that has left the network. that has left the network.
Note: [SCWA99] discusses a receiver-based attack whereby many
bogus duplicate ACKs are sent to the data sender in order to
artificially inflate cwnd and cause a higher than appropriate
sending rate to be used. A TCP MAY therefore limit the number
of times cwnd is artificially inflated during loss recovery
to the number of outstanding segments (or, an approximation
thereof).
5. Transmit a segment, if allowed by the new value of cwnd and the 5. Transmit a segment, if allowed by the new value of cwnd and the
receiver's advertised window. receiver's advertised window.
6. When the next ACK arrives that acknowledges new data, a TCP 6. When the next ACK arrives that acknowledges previously
MUST set cwnd to ssthresh (the value set in step 1). This is unacknowledged data, a TCP MUST set cwnd to ssthresh (the value
termed "deflating" the window. set in step 2). This is termed "deflating" the window.
This ACK should be the acknowledgment elicited by the This ACK should be the acknowledgment elicited by the
retransmission from step 1, one RTT after the retransmission retransmission from step 3, one RTT after the retransmission
(though it may arrive sooner in the presence of significant out- (though it may arrive sooner in the presence of significant out-
of-order delivery of data segments at the of-order delivery of data segments at the receiver).
receiver). Additionally, this ACK should acknowledge all the Additionally, this ACK should acknowledge all the intermediate
intermediate segments sent between the lost segment and the segments sent between the lost segment and the receipt of the
receipt of the third duplicate ACK, if none of these were lost. third duplicate ACK, if none of these were lost.
Note: This algorithm is known to generally not recover efficiently Note: This algorithm is known to generally not recover efficiently
from multiple losses in a single flight of packets [FF96]. Section from multiple losses in a single flight of packets [FF96]. Section
4.3 below addresses such cases. 4.3 below addresses such cases.
4. Additional Considerations 4. Additional Considerations
4.1 Re-starting Idle Connections 4.1 Re-starting Idle Connections
A known problem with the TCP congestion control algorithms described A known problem with the TCP congestion control algorithms described
skipping to change at page 11, line 47 skipping to change at page 11, line 49
that were not discussed in detail in 2001. Specifically, this that were not discussed in detail in 2001. Specifically, this
document suggests what TCP connections should do after a relatively document suggests what TCP connections should do after a relatively
long idle period, as well as specifying and clarifying some of the long idle period, as well as specifying and clarifying some of the
issues pertaining to TCP ACK generation. Finally, the allowable issues pertaining to TCP ACK generation. Finally, the allowable
upper bound for the initial congestion window has also been raised upper bound for the initial congestion window has also been raised
from one to two segments. from one to two segments.
7. Changes Relative to RFC 2581 7. Changes Relative to RFC 2581
A specific definition for "duplicate acknowledgment" has been A specific definition for "duplicate acknowledgment" has been
added, based on the definition used by BSD TCP. In addition, the added, based on the definition used by BSD TCP.
definition explicitly does not take into account the presence (or
absence) of DSACK [RFC2883] information.
The document now notes that what to do with duplicate ACKs after the The document now notes that what to do with duplicate ACKs after the
retransmission timer has fired is future work and explicitly retransmission timer has fired is future work and explicitly
unspecified in this document. unspecified in this document.
The initial window requirements were changed to allow Larger The initial window requirements were changed to allow Larger
Initial Windows as standardized in [RFC3390]. Additionally, the Initial Windows as standardized in [RFC3390]. Additionally, the
steps to take when an initial window is discovered to be too large steps to take when an initial window is discovered to be too large
due to Path MTU Discovery [RFC1191] are detailed. due to Path MTU Discovery [RFC1191] are detailed.
skipping to change at page 12, line 17 skipping to change at page 12, line 18
This is to provide additional guidance to implementors on the This is to provide additional guidance to implementors on the
matter. matter.
During slow start, the usage of Appropriate Byte Counting [RFC3465] During slow start, the usage of Appropriate Byte Counting [RFC3465]
with L=1*SMSS is explicitly recommended. The method of increasing with L=1*SMSS is explicitly recommended. The method of increasing
cwnd given in [RFC2581] is still explicitly allowed. Byte counting cwnd given in [RFC2581] is still explicitly allowed. Byte counting
during congestion avoidance is also recommended, while the method during congestion avoidance is also recommended, while the method
from [RFC2581] and other safe methods are still allowed. from [RFC2581] and other safe methods are still allowed.
The treatment of ssthresh on retransmission timeout was clarified. The treatment of ssthresh on retransmission timeout was clarified.
Specifically, Equation (3) from [RFC2581] was split into Equations In particular, ssthresh must be set to half the FlightSize on the
(4) and (5) in this document. first retransmission of a given segment and then is held constant on
subsequent retransmissions of the same segment.
The description of fast retransmit and fast recovery has been The description of fast retransmit and fast recovery has been
clarified, and the use of Limited Transmit [RFC3042] is now clarified, and the use of Limited Transmit [RFC3042] is now
recommended. recommended.
TCPs now MAY limit the number of duplicate ACKs that artificially
inflate cwnd during loss recovery to the number of segments
outstanding to avoid the duplicate ACK spoofing attack described in
[SCWA99].
The restart window has been changed to min(IW,cwnd) from IW. This The restart window has been changed to min(IW,cwnd) from IW. This
behavior was described as "experimental" in [RFC2581]. behavior was described as "experimental" in [RFC2581].
It is now recommended that TCP implementors implement an advanced It is now recommended that TCP implementors implement an advanced
loss recovery algorithm conforming to the principles outlined in loss recovery algorithm conforming to the principles outlined in
this document. this document.
The security considerations have been updated to discuss ACK The security considerations have been updated to discuss ACK
division and recommend byte counting as a counter to this attack. division and recommend byte counting as a counter to this attack.
skipping to change at page 13, line 7 skipping to change at page 13, line 14
document are solely the responsibility of the current authors. document are solely the responsibility of the current authors.
Some of the text from this document is taken from "TCP/IP Some of the text from this document is taken from "TCP/IP
Illustrated, Volume 1: The Protocols" by W. Richard Stevens Illustrated, Volume 1: The Protocols" by W. Richard Stevens
(Addison-Wesley, 1994) and "TCP/IP Illustrated, Volume 2: The (Addison-Wesley, 1994) and "TCP/IP Illustrated, Volume 2: The
Implementation" by Gary R. Wright and W. Richard Stevens (Addison- Implementation" by Gary R. Wright and W. Richard Stevens (Addison-
Wesley, 1995). This material is used with the permission of Wesley, 1995). This material is used with the permission of
Addison-Wesley. Addison-Wesley.
Steve Arden, Neal Cardwell, Noritoshi Demizu, Kevin Fall, John Steve Arden, Neal Cardwell, Noritoshi Demizu, Kevin Fall, John
Heffner, Sally Floyd, Reiner Ludwig, Matt Mathis, Craig Partridge Heffner, Alfred Hoenes, Sally Floyd, Reiner Ludwig, Matt Mathis,
and Joe Touch contributed a number of helpful suggestions. Craig Partridge and Joe Touch contributed a number of helpful
suggestions.
Normative References Normative References
[RFC793] Postel, J., "Transmission Control Protocol", STD 7, RFC [RFC793] Postel, J., "Transmission Control Protocol", STD 7, RFC
793, September 1981. 793, September 1981.
[RFC1122] Braden, R., "Requirements for Internet Hosts -- [RFC1122] Braden, R., "Requirements for Internet Hosts --
Communication Layers", STD 3, RFC 1122, October 1989. Communication Layers", STD 3, RFC 1122, October 1989.
[RFC1191] Mogul, J. and S. Deering, "Path MTU Discovery", RFC 1191, [RFC1191] Mogul, J. and S. Deering, "Path MTU Discovery", RFC 1191,
skipping to change at page 14, line 42 skipping to change at page 14, line 51
Extension to the Selective Acknowledgement (SACK) Option for Extension to the Selective Acknowledgement (SACK) Option for
TCP, RFC 2883, July 2000. TCP, RFC 2883, July 2000.
[RFC2988] V. Paxson and M. Allman, "Computing TCP's Retransmission [RFC2988] V. Paxson and M. Allman, "Computing TCP's Retransmission
Timer", RFC 2988, November 2000. Timer", RFC 2988, November 2000.
[RFC3042] Allman, M., Balakrishnan, H. and S. Floyd, "Enhancing [RFC3042] Allman, M., Balakrishnan, H. and S. Floyd, "Enhancing
TCP's Loss Recovery Using Limited Transmit", RFC 3042, January TCP's Loss Recovery Using Limited Transmit", RFC 3042, January
2001. 2001.
[RFC3390] Allman, M., Floyd, S., C. Partridge, "Increasing TCP's
Initial Window", RFC 3390, October 2002.
[RFC3465] Mark Allman, TCP Congestion Control with Appropriate Byte [RFC3465] Mark Allman, TCP Congestion Control with Appropriate Byte
Counting (ABC), RFC 3465, February 2003. Counting (ABC), RFC 3465, February 2003.
[RFC3517] Ethan Blanton, Mark Allman, Kevin Fall, Lili Wang, A [RFC3517] Ethan Blanton, Mark Allman, Kevin Fall, Lili Wang, A
Conservative Selective Acknowledgment (SACK)-based Loss Recovery Conservative Selective Acknowledgment (SACK)-based Loss Recovery
Algorithm for TCP, RFC 3517, April 2003. Algorithm for TCP, RFC 3517, April 2003.
[RFC3782] Sally Floyd, Tom Henderson, Andrei Gurtov, The NewReno [RFC3782] Sally Floyd, Tom Henderson, Andrei Gurtov, The NewReno
Modification to TCP's Fast Recovery Algorithm, RFC 3782, April Modification to TCP's Fast Recovery Algorithm, RFC 3782, April
2004. 2004.
skipping to change at page 16, line 10 skipping to change at page 16, line 21
at http://www.ietf.org/ipr. at http://www.ietf.org/ipr.
The IETF invites any interested party to bring to its attention any The IETF invites any interested party to bring to its attention any
copyrights, patents or patent applications, or other proprietary copyrights, patents or patent applications, or other proprietary
rights that may cover technology that may be required to implement rights that may cover technology that may be required to implement
this standard. Please address the information to the IETF at this standard. Please address the information to the IETF at
ietf-ipr@ietf.org. ietf-ipr@ietf.org.
Disclaimer of Validity Disclaimer of Validity
This document and the information contained herein are provided on This document and the information contained herein are provided
an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE
REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE
INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IETF TRUST AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL
IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY
THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE
WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS
FOR A PARTICULAR PURPOSE.
Copyright Statement Copyright Statement
Copyright (C) The Internet Society (2006). This document is subject Copyright (C) The IETF Trust (2007). This document is subject to
to the rights, licenses and restrictions contained in BCP 78, and the rights, licenses and restrictions contained in BCP 78, and
except as set forth therein, the authors retain all their rights. except as set forth therein, the authors retain all their rights.
Acknowledgment Acknowledgment
Funding for the RFC Editor function is currently provided by the Funding for the RFC Editor function is currently provided by the
Internet Society. Internet Society.
 End of changes. 23 change blocks. 
43 lines changed or deleted 58 lines changed or added

This html diff was produced by rfcdiff 1.33. The latest version is available from http://tools.ietf.org/tools/rfcdiff/