draft-ietf-tcpm-rfc2581bis-00.txt | draft-ietf-tcpm-rfc2581bis-01.txt | |||
---|---|---|---|---|
Network Working Group M. Allman | Network Working Group M. Allman | |||
Internet-Draft V. Paxson | Internet-Draft V. Paxson | |||
Expires: July 2006 ICIR / ICSI | Expires: December 2006 ICIR / ICSI | |||
E. Blanton | E. Blanton | |||
Purdue University | Purdue University | |||
January 2006 | ||||
TCP Congestion Control | TCP Congestion Control | |||
draft-ietf-tcpm-rfc2581bis-00.txt | draft-ietf-tcpm-rfc2581bis-01.txt | |||
Status of this Memo | Status of this Memo | |||
By submitting this Internet-Draft, each author represents that any | By submitting this Internet-Draft, each author represents that any | |||
applicable patent or other IPR claims of which he or she is aware | applicable patent or other IPR claims of which he or she is aware | |||
have been or will be disclosed, and any of which he or she becomes | have been or will be disclosed, and any of which he or she becomes | |||
aware will be disclosed, in accordance with Section 6 of BCP 79. | aware will be disclosed, in accordance with Section 6 of BCP 79. | |||
Internet-Drafts are working documents of the Internet Engineering | Internet-Drafts are working documents of the Internet Engineering | |||
Task Force (IETF), its areas, and its working groups. Note that | Task Force (IETF), its areas, and its working groups. Note that | |||
skipping to change at page 3, line 22 | skipping to change at page 3, line 22 | |||
RESTART WINDOW (RW): The restart window is the size of the | RESTART WINDOW (RW): The restart window is the size of the | |||
congestion window after a TCP restarts transmission after an | congestion window after a TCP restarts transmission after an | |||
idle period (if the slow start algorithm is used; see section | idle period (if the slow start algorithm is used; see section | |||
4.1 for more discussion). | 4.1 for more discussion). | |||
FLIGHT SIZE: The amount of data that has been sent but not yet | FLIGHT SIZE: The amount of data that has been sent but not yet | |||
acknowledged. | acknowledged. | |||
DUPLICATE ACKNOWLEDGMENT: An acknowledgment is considered a | DUPLICATE ACKNOWLEDGMENT: An acknowledgment is considered a | |||
"duplicate" in the following algorithms when (a) the | "duplicate" in the following algorithms when (a) the receiver of | |||
receiver of the ACK has outstanding data, (b) the incoming | the ACK has outstanding data, (b) the incoming acknowledgment | |||
acknowledgment carries no data, (c) the SYN and FIN bits are | carries no data, (c) the SYN and FIN bits are both off, (d) the | |||
both off, (d) the acknowledgment number is equal to the greatest | acknowledgment number is equal to the greatest acknowledgment | |||
acknowledgment received on the given connection (TCP.UNA from | received on the given connection (TCP.UNA from [RFC793]) and (e) | |||
[RFC793]) and (e) the advertised window in the incoming | the advertised window in the incoming acknowledgment equals the | |||
acknowledgment equals the advertised window in the last incoming | advertised window in the last incoming acknowledgment. | |||
acknowledgment. Alternatively, a TCP that utilizes selective | Alternatively, a TCP that utilizes selective acknowledgments | |||
acknowledgments [RFC2018] can determine an incoming ACK is a | [RFC2018,RFC2883] can determine an incoming ACK is a "duplicate" | |||
"duplicate" if the ACK contains previously unknown SACK | if the ACK contains previously unknown SACK information. | |||
information. | ||||
3. Congestion Control Algorithms | 3. Congestion Control Algorithms | |||
This section defines the four congestion control algorithms: slow | This section defines the four congestion control algorithms: slow | |||
start, congestion avoidance, fast retransmit and fast recovery, | start, congestion avoidance, fast retransmit and fast recovery, | |||
developed in [Jac88] and [Jac90]. In some situations it may be | developed in [Jac88] and [Jac90]. In some situations it may be | |||
beneficial for a TCP sender to be more conservative than the | beneficial for a TCP sender to be more conservative than the | |||
algorithms allow, however a TCP MUST NOT be more aggressive than the | algorithms allow, however a TCP MUST NOT be more aggressive than the | |||
following algorithms allow (that is, MUST NOT send data when the | following algorithms allow (that is, MUST NOT send data when the | |||
value of cwnd computed by the following algorithms would not allow | value of cwnd computed by the following algorithms would not allow | |||
skipping to change at page 4, line 24 | skipping to change at page 4, line 23 | |||
IW, the initial value of cwnd, MUST be set using the following | IW, the initial value of cwnd, MUST be set using the following | |||
guidelines as an upper bound. | guidelines as an upper bound. | |||
If SMSS > 2190 bytes: | If SMSS > 2190 bytes: | |||
IW = 2 * SMSS bytes and MUST NOT be more than 2 segments | IW = 2 * SMSS bytes and MUST NOT be more than 2 segments | |||
If (SMSS > 1095 bytes) and (SMSS <= 2190 bytes): | If (SMSS > 1095 bytes) and (SMSS <= 2190 bytes): | |||
IW = 3 * SMSS bytes and MUST NOT be more than 3 segments | IW = 3 * SMSS bytes and MUST NOT be more than 3 segments | |||
if SMSS <= 1095 bytes: | if SMSS <= 1095 bytes: | |||
IW = 4 * SMSS bytes and MUST NOT be more than 4 segments | IW = 4 * SMSS bytes and MUST NOT be more than 4 segments | |||
As specified in [RFC3390], the SYN/ACK and the acknowledgment of the | ||||
SYN/ACK MUST NOT increase the size of the congestion window. | ||||
Further, if the SYN or SYN/ACK is lost, the initial window used by a | ||||
sender after a correctly transmitted SYN MUST be one segment | ||||
consisting of at most SMSS bytes. | ||||
A detailed rationale and discussion of the IW setting is provided in | A detailed rationale and discussion of the IW setting is provided in | |||
[RFC3390]. | [RFC3390]. | |||
When larger initial windows are implemented along with Path MTU | When larger initial windows are implemented along with Path MTU | |||
Discovery [RFC1191], and the MSS being used is found to be too | Discovery [RFC1191], and the MSS being used is found to be too | |||
large, the congestion window cwnd SHOULD be reduced to prevent | large, the congestion window cwnd SHOULD be reduced to prevent | |||
large bursts of smaller segments. Specifically, cwnd SHOULD be | large bursts of smaller segments. Specifically, cwnd SHOULD be | |||
reduced by the ratio of the old segment size to the new segment | reduced by the ratio of the old segment size to the new segment | |||
size. | size. | |||
The initial value of ssthresh SHOULD be arbitrarily high (for | The initial value of ssthresh SHOULD be arbitrarily high (e.g., to | |||
example, some implementations use the size of the advertised | the size of the largest possible advertised window), but ssthresh | |||
window), but ssthresh MUST be reduced in response to congestion. | MUST be reduced in response to congestion. Setting ssthresh as high | |||
as possible allows the network conditions, rather than some | ||||
arbitrary host limit, to dictate the sending rate. In cases where | ||||
the end systems have a solid understanding of the network path, more | ||||
carefully setting the initial ssthresh value may have merit (e.g., | ||||
such that the end host does not create congestion along the path). | ||||
The slow start algorithm is used when cwnd < ssthresh, while the | The slow start algorithm is used when cwnd < ssthresh, while the | |||
congestion avoidance algorithm is used when cwnd > ssthresh. When | congestion avoidance algorithm is used when cwnd > ssthresh. When | |||
cwnd and ssthresh are equal the sender may use either slow start or | cwnd and ssthresh are equal the sender may use either slow start or | |||
congestion avoidance. | congestion avoidance. | |||
During slow start, a TCP increments cwnd by at most SMSS bytes for | During slow start, a TCP increments cwnd by at most SMSS bytes for | |||
each ACK received that acknowledges new data. Slow start ends when | each ACK received that acknowledges new data. Slow start ends when | |||
cwnd exceeds ssthresh (or, optionally, when it reaches it, as noted | cwnd exceeds ssthresh (or, optionally, when it reaches it, as noted | |||
above) or when congestion is observed. While traditionally TCP | above) or when congestion is observed. While traditionally TCP | |||
implementations have increased cwnd by precisely SMSS bytes upon | implementations have increased cwnd by precisely SMSS bytes upon | |||
skipping to change at page 6, line 44 | skipping to change at page 6, line 53 | |||
incidentally increase well beyond rwnd. | incidentally increase well beyond rwnd. | |||
Furthermore, upon a timeout (as specified in [RFC2988]) cwnd MUST be | Furthermore, upon a timeout (as specified in [RFC2988]) cwnd MUST be | |||
set to no more than the loss window, LW, which equals 1 full-sized | set to no more than the loss window, LW, which equals 1 full-sized | |||
segment (regardless of the value of IW). Therefore, after | segment (regardless of the value of IW). Therefore, after | |||
retransmitting the dropped segment the TCP sender uses the slow | retransmitting the dropped segment the TCP sender uses the slow | |||
start algorithm to increase the window from 1 full-sized segment to | start algorithm to increase the window from 1 full-sized segment to | |||
the new value of ssthresh, at which point congestion avoidance again | the new value of ssthresh, at which point congestion avoidance again | |||
takes over. | takes over. | |||
As shown in [FF96,RFC3782], slow start-based loss recovery after a | ||||
timeout can cause spurious retransmissions that trigger duplicate | ||||
acknowledgments. The reaction to the arrival of these duplicate | ||||
ACKs in TCP implementations varies widely. This document does not | ||||
specify how to treat such acknowledgments, but does note this as an | ||||
area that may benefit from additional attention, experimentation and | ||||
specification. | ||||
3.2 Fast Retransmit/Fast Recovery | 3.2 Fast Retransmit/Fast Recovery | |||
A TCP receiver SHOULD send an immediate duplicate ACK when an out- | A TCP receiver SHOULD send an immediate duplicate ACK when an out- | |||
of-order segment arrives. The purpose of this ACK is to inform the | of-order segment arrives. The purpose of this ACK is to inform the | |||
sender that a segment was received out-of-order and which sequence | sender that a segment was received out-of-order and which sequence | |||
number is expected. From the sender's perspective, duplicate ACKs | number is expected. From the sender's perspective, duplicate ACKs | |||
can be caused by a number of network problems. First, they can be | can be caused by a number of network problems. First, they can be | |||
caused by dropped segments. In this case, all segments after the | caused by dropped segments. In this case, all segments after the | |||
dropped segment will trigger duplicate ACKs until the loss is | dropped segment will trigger duplicate ACKs until the loss is | |||
repaired. Second, duplicate ACKs can be caused by the re-ordering | repaired. Second, duplicate ACKs can be caused by the re-ordering | |||
skipping to change at page 7, line 10 | skipping to change at page 7, line 29 | |||
paths [Pax97]). Finally, duplicate ACKs can be caused by | paths [Pax97]). Finally, duplicate ACKs can be caused by | |||
replication of ACK or data segments by the network. In addition, a | replication of ACK or data segments by the network. In addition, a | |||
TCP receiver SHOULD send an immediate ACK when the incoming segment | TCP receiver SHOULD send an immediate ACK when the incoming segment | |||
fills in all or part of a gap in the sequence space. This will | fills in all or part of a gap in the sequence space. This will | |||
generate more timely information for a sender recovering from a loss | generate more timely information for a sender recovering from a loss | |||
through a retransmission timeout, a fast retransmit, or an advanced | through a retransmission timeout, a fast retransmit, or an advanced | |||
loss recovery algorithm, as outlined in section 4.3. | loss recovery algorithm, as outlined in section 4.3. | |||
The TCP sender SHOULD use the "fast retransmit" algorithm to detect | The TCP sender SHOULD use the "fast retransmit" algorithm to detect | |||
and repair loss, based on incoming duplicate ACKs. The fast | and repair loss, based on incoming duplicate ACKs. The fast | |||
retransmit algorithm uses the arrival of 3 duplicate ACKs (4 | retransmit algorithm uses the arrival of 3 duplicate ACKs (as | |||
identical ACKs without the arrival of any other intervening packets) | defined in section 2, without any intervening ACKs which move | |||
as an indication that a segment has been lost. After receiving 3 | SND.UNA) as an indication that a segment has been lost. After | |||
duplicate ACKs, TCP performs a retransmission of what appears to be | receiving 3 duplicate ACKs, TCP performs a retransmission of what | |||
the missing segment, without waiting for the retransmission timer to | appears to be the missing segment, without waiting for the | |||
expire. | retransmission timer to expire. | |||
After the fast retransmit algorithm sends what appears to be the | After the fast retransmit algorithm sends what appears to be the | |||
missing segment, the "fast recovery" algorithm governs the | missing segment, the "fast recovery" algorithm governs the | |||
transmission of new data until a non-duplicate ACK arrives. The | transmission of new data until a non-duplicate ACK arrives. The | |||
reason for not performing slow start is that the receipt of the | reason for not performing slow start is that the receipt of the | |||
duplicate ACKs not only indicates that a segment has been lost, but | duplicate ACKs not only indicates that a segment has been lost, but | |||
also that segments are most likely leaving the network (although a | also that segments are most likely leaving the network (although a | |||
massive segment duplication by the network can invalidate this | massive segment duplication by the network can invalidate this | |||
conclusion). In other words, since the receiver can only generate a | conclusion). In other words, since the receiver can only generate a | |||
duplicate ACK when a segment has arrived, that segment has left the | duplicate ACK when a segment has arrived, that segment has left the | |||
skipping to change at page 11, line 27 | skipping to change at page 11, line 47 | |||
that were not discussed in detail in 2001. Specifically, this | that were not discussed in detail in 2001. Specifically, this | |||
document suggests what TCP connections should do after a relatively | document suggests what TCP connections should do after a relatively | |||
long idle period, as well as specifying and clarifying some of the | long idle period, as well as specifying and clarifying some of the | |||
issues pertaining to TCP ACK generation. Finally, the allowable | issues pertaining to TCP ACK generation. Finally, the allowable | |||
upper bound for the initial congestion window has also been raised | upper bound for the initial congestion window has also been raised | |||
from one to two segments. | from one to two segments. | |||
7. Changes Relative to RFC 2581 | 7. Changes Relative to RFC 2581 | |||
A specific definition for "duplicate acknowledgment" has been | A specific definition for "duplicate acknowledgment" has been | |||
added, based on the definition used by BSD TCP. | added, based on the definition used by BSD TCP. In addition, the | |||
definition explicitly does not take into account the presence (or | ||||
absence) of DSACK [RFC2883] information. | ||||
The document now notes that what to do with duplicate ACKs after the | ||||
retransmission timer has fired is future work and explicitly | ||||
unspecified in this document. | ||||
The initial window requirements were changed to allow Larger | The initial window requirements were changed to allow Larger | |||
Initial Windows as standardized in [RFC3390]. Additionally, the | Initial Windows as standardized in [RFC3390]. Additionally, the | |||
steps to take when an initial window is discovered to be too large | steps to take when an initial window is discovered to be too large | |||
due to Path MTU Discovery [RFC1191] are detailed. | due to Path MTU Discovery [RFC1191] are detailed. | |||
The recommended initial value for ssthresh has been changed to say | The recommended initial value for ssthresh has been changed to say | |||
that it SHOULD be arbitrarily high, where it was previously MAY. | that it SHOULD be arbitrarily high, where it was previously MAY. | |||
This is to provide additional guidance to implementors on the | This is to provide additional guidance to implementors on the | |||
matter. | matter. | |||
skipping to change at page 12, line 35 | skipping to change at page 13, line 6 | |||
We wish to emphasize that the shortcomings and mistakes of this | We wish to emphasize that the shortcomings and mistakes of this | |||
document are solely the responsibility of the current authors. | document are solely the responsibility of the current authors. | |||
Some of the text from this document is taken from "TCP/IP | Some of the text from this document is taken from "TCP/IP | |||
Illustrated, Volume 1: The Protocols" by W. Richard Stevens | Illustrated, Volume 1: The Protocols" by W. Richard Stevens | |||
(Addison-Wesley, 1994) and "TCP/IP Illustrated, Volume 2: The | (Addison-Wesley, 1994) and "TCP/IP Illustrated, Volume 2: The | |||
Implementation" by Gary R. Wright and W. Richard Stevens (Addison- | Implementation" by Gary R. Wright and W. Richard Stevens (Addison- | |||
Wesley, 1995). This material is used with the permission of | Wesley, 1995). This material is used with the permission of | |||
Addison-Wesley. | Addison-Wesley. | |||
Neal Cardwell, Noritoshi Demizu, Kevin Fall, Sally Floyd, Craig | Steve Arden, Neal Cardwell, Noritoshi Demizu, Kevin Fall, John | |||
Partridge and Joe Touch contributed a number of helpful suggestions. | Heffner, Sally Floyd, Reiner Ludwig, Matt Mathis, Craig Partridge | |||
and Joe Touch contributed a number of helpful suggestions. | ||||
Normative References | Normative References | |||
[RFC793] Postel, J., "Transmission Control Protocol", STD 7, RFC | [RFC793] Postel, J., "Transmission Control Protocol", STD 7, RFC | |||
793, September 1981. | 793, September 1981. | |||
[RFC1122] Braden, R., "Requirements for Internet Hosts -- | [RFC1122] Braden, R., "Requirements for Internet Hosts -- | |||
Communication Layers", STD 3, RFC 1122, October 1989. | Communication Layers", STD 3, RFC 1122, October 1989. | |||
[RFC1191] Mogul, J. and S. Deering, "Path MTU Discovery", RFC 1191, | [RFC1191] Mogul, J. and S. Deering, "Path MTU Discovery", RFC 1191, | |||
skipping to change at page 14, line 5 | skipping to change at page 14, line 31 | |||
[RFC2414] Allman, M., Floyd, S. and C. Partridge, "Increasing TCP's | [RFC2414] Allman, M., Floyd, S. and C. Partridge, "Increasing TCP's | |||
Initial Window Size", RFC 2414, September 1998. | Initial Window Size", RFC 2414, September 1998. | |||
[RFC2525] Paxson, V., Allman, M., Dawson, S., Fenner, W., Griner, J., | [RFC2525] Paxson, V., Allman, M., Dawson, S., Fenner, W., Griner, J., | |||
Heavens, I., Lahey, K., Semke, J. and B. Volz, "Known TCP | Heavens, I., Lahey, K., Semke, J. and B. Volz, "Known TCP | |||
Implementation Problems", RFC 2525, March 1999. | Implementation Problems", RFC 2525, March 1999. | |||
[RFC2581] Allman, M., Paxson, V., W. Stevens, TCP Congestion | [RFC2581] Allman, M., Paxson, V., W. Stevens, TCP Congestion | |||
Control, RFC 2581, April 1999. | Control, RFC 2581, April 1999. | |||
[RFC2883] Floyd, S., J. Mahdavi, M. Mathis, M. Podolsky, An | ||||
Extension to the Selective Acknowledgement (SACK) Option for | ||||
TCP, RFC 2883, July 2000. | ||||
[RFC2988] V. Paxson and M. Allman, "Computing TCP's Retransmission | [RFC2988] V. Paxson and M. Allman, "Computing TCP's Retransmission | |||
Timer", RFC 2988, November 2000. | Timer", RFC 2988, November 2000. | |||
[RFC3042] Allman, M., Balakrishnan, H. and S. Floyd, "Enhancing | [RFC3042] Allman, M., Balakrishnan, H. and S. Floyd, "Enhancing | |||
TCP's Loss Recovery Using Limited Transmit", RFC 3042, January | TCP's Loss Recovery Using Limited Transmit", RFC 3042, January | |||
2001. | 2001. | |||
[RFC3465] Mark Allman, TCP Congestion Control with Appropriate Byte | [RFC3465] Mark Allman, TCP Congestion Control with Appropriate Byte | |||
Counting (ABC), RFC 3465, February 2003. | Counting (ABC), RFC 3465, February 2003. | |||
skipping to change at page 14, line 40 | skipping to change at page 15, line 16 | |||
[WS95] Wright, G. and W. Stevens, "TCP/IP Illustrated, Volume 2: The | [WS95] Wright, G. and W. Stevens, "TCP/IP Illustrated, Volume 2: The | |||
Implementation", Addison-Wesley, 1995. | Implementation", Addison-Wesley, 1995. | |||
Authors' Addresses | Authors' Addresses | |||
Mark Allman | Mark Allman | |||
ICIR / ICSI | ICIR / ICSI | |||
1947 Center Street | 1947 Center Street | |||
Suite 600 | Suite 600 | |||
Berkeley, CA 94704-1198 | Berkeley, CA 94704-1198 | |||
Phone: +1 440 243 7361 | Phone: +1 440 235 1792 | |||
EMail: mallman@icir.org | EMail: mallman@icir.org | |||
http://www.icir.org/mallman/ | http://www.icir.org/mallman/ | |||
Vern Paxson | Vern Paxson | |||
ICIR / ICSI | ICIR / ICSI | |||
1947 Center Street | 1947 Center Street | |||
Suite 600 | Suite 600 | |||
Berkeley, CA 94704-1198 | Berkeley, CA 94704-1198 | |||
Phone: +1 510/642-4274 x302 | Phone: +1 510/642-4274 x302 | |||
EMail: vern@icir.org | EMail: vern@icir.org | |||
End of changes. 12 change blocks. | ||||
28 lines changed or deleted | 56 lines changed or added | |||
This html diff was produced by rfcdiff 1.32. The latest version is available from http://www.levkowetz.com/ietf/tools/rfcdiff/ |