draft-ietf-tcpm-rfc2581bis-00.txt   draft-ietf-tcpm-rfc2581bis-01.txt 
Network Working Group M. Allman Network Working Group M. Allman
Internet-Draft V. Paxson Internet-Draft V. Paxson
Expires: July 2006 ICIR / ICSI Expires: December 2006 ICIR / ICSI
E. Blanton E. Blanton
Purdue University Purdue University
January 2006
TCP Congestion Control TCP Congestion Control
draft-ietf-tcpm-rfc2581bis-00.txt draft-ietf-tcpm-rfc2581bis-01.txt
Status of this Memo Status of this Memo
By submitting this Internet-Draft, each author represents that any By submitting this Internet-Draft, each author represents that any
applicable patent or other IPR claims of which he or she is aware applicable patent or other IPR claims of which he or she is aware
have been or will be disclosed, and any of which he or she becomes have been or will be disclosed, and any of which he or she becomes
aware will be disclosed, in accordance with Section 6 of BCP 79. aware will be disclosed, in accordance with Section 6 of BCP 79.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that Task Force (IETF), its areas, and its working groups. Note that
skipping to change at page 3, line 22 skipping to change at page 3, line 22
RESTART WINDOW (RW): The restart window is the size of the RESTART WINDOW (RW): The restart window is the size of the
congestion window after a TCP restarts transmission after an congestion window after a TCP restarts transmission after an
idle period (if the slow start algorithm is used; see section idle period (if the slow start algorithm is used; see section
4.1 for more discussion). 4.1 for more discussion).
FLIGHT SIZE: The amount of data that has been sent but not yet FLIGHT SIZE: The amount of data that has been sent but not yet
acknowledged. acknowledged.
DUPLICATE ACKNOWLEDGMENT: An acknowledgment is considered a DUPLICATE ACKNOWLEDGMENT: An acknowledgment is considered a
"duplicate" in the following algorithms when (a) the "duplicate" in the following algorithms when (a) the receiver of
receiver of the ACK has outstanding data, (b) the incoming the ACK has outstanding data, (b) the incoming acknowledgment
acknowledgment carries no data, (c) the SYN and FIN bits are carries no data, (c) the SYN and FIN bits are both off, (d) the
both off, (d) the acknowledgment number is equal to the greatest acknowledgment number is equal to the greatest acknowledgment
acknowledgment received on the given connection (TCP.UNA from received on the given connection (TCP.UNA from [RFC793]) and (e)
[RFC793]) and (e) the advertised window in the incoming the advertised window in the incoming acknowledgment equals the
acknowledgment equals the advertised window in the last incoming advertised window in the last incoming acknowledgment.
acknowledgment. Alternatively, a TCP that utilizes selective Alternatively, a TCP that utilizes selective acknowledgments
acknowledgments [RFC2018] can determine an incoming ACK is a [RFC2018,RFC2883] can determine an incoming ACK is a "duplicate"
"duplicate" if the ACK contains previously unknown SACK if the ACK contains previously unknown SACK information.
information.
3. Congestion Control Algorithms 3. Congestion Control Algorithms
This section defines the four congestion control algorithms: slow This section defines the four congestion control algorithms: slow
start, congestion avoidance, fast retransmit and fast recovery, start, congestion avoidance, fast retransmit and fast recovery,
developed in [Jac88] and [Jac90]. In some situations it may be developed in [Jac88] and [Jac90]. In some situations it may be
beneficial for a TCP sender to be more conservative than the beneficial for a TCP sender to be more conservative than the
algorithms allow, however a TCP MUST NOT be more aggressive than the algorithms allow, however a TCP MUST NOT be more aggressive than the
following algorithms allow (that is, MUST NOT send data when the following algorithms allow (that is, MUST NOT send data when the
value of cwnd computed by the following algorithms would not allow value of cwnd computed by the following algorithms would not allow
skipping to change at page 4, line 24 skipping to change at page 4, line 23
IW, the initial value of cwnd, MUST be set using the following IW, the initial value of cwnd, MUST be set using the following
guidelines as an upper bound. guidelines as an upper bound.
If SMSS > 2190 bytes: If SMSS > 2190 bytes:
IW = 2 * SMSS bytes and MUST NOT be more than 2 segments IW = 2 * SMSS bytes and MUST NOT be more than 2 segments
If (SMSS > 1095 bytes) and (SMSS <= 2190 bytes): If (SMSS > 1095 bytes) and (SMSS <= 2190 bytes):
IW = 3 * SMSS bytes and MUST NOT be more than 3 segments IW = 3 * SMSS bytes and MUST NOT be more than 3 segments
if SMSS <= 1095 bytes: if SMSS <= 1095 bytes:
IW = 4 * SMSS bytes and MUST NOT be more than 4 segments IW = 4 * SMSS bytes and MUST NOT be more than 4 segments
As specified in [RFC3390], the SYN/ACK and the acknowledgment of the
SYN/ACK MUST NOT increase the size of the congestion window.
Further, if the SYN or SYN/ACK is lost, the initial window used by a
sender after a correctly transmitted SYN MUST be one segment
consisting of at most SMSS bytes.
A detailed rationale and discussion of the IW setting is provided in A detailed rationale and discussion of the IW setting is provided in
[RFC3390]. [RFC3390].
When larger initial windows are implemented along with Path MTU When larger initial windows are implemented along with Path MTU
Discovery [RFC1191], and the MSS being used is found to be too Discovery [RFC1191], and the MSS being used is found to be too
large, the congestion window cwnd SHOULD be reduced to prevent large, the congestion window cwnd SHOULD be reduced to prevent
large bursts of smaller segments. Specifically, cwnd SHOULD be large bursts of smaller segments. Specifically, cwnd SHOULD be
reduced by the ratio of the old segment size to the new segment reduced by the ratio of the old segment size to the new segment
size. size.
The initial value of ssthresh SHOULD be arbitrarily high (for The initial value of ssthresh SHOULD be arbitrarily high (e.g., to
example, some implementations use the size of the advertised the size of the largest possible advertised window), but ssthresh
window), but ssthresh MUST be reduced in response to congestion. MUST be reduced in response to congestion. Setting ssthresh as high
as possible allows the network conditions, rather than some
arbitrary host limit, to dictate the sending rate. In cases where
the end systems have a solid understanding of the network path, more
carefully setting the initial ssthresh value may have merit (e.g.,
such that the end host does not create congestion along the path).
The slow start algorithm is used when cwnd < ssthresh, while the The slow start algorithm is used when cwnd < ssthresh, while the
congestion avoidance algorithm is used when cwnd > ssthresh. When congestion avoidance algorithm is used when cwnd > ssthresh. When
cwnd and ssthresh are equal the sender may use either slow start or cwnd and ssthresh are equal the sender may use either slow start or
congestion avoidance. congestion avoidance.
During slow start, a TCP increments cwnd by at most SMSS bytes for During slow start, a TCP increments cwnd by at most SMSS bytes for
each ACK received that acknowledges new data. Slow start ends when each ACK received that acknowledges new data. Slow start ends when
cwnd exceeds ssthresh (or, optionally, when it reaches it, as noted cwnd exceeds ssthresh (or, optionally, when it reaches it, as noted
above) or when congestion is observed. While traditionally TCP above) or when congestion is observed. While traditionally TCP
implementations have increased cwnd by precisely SMSS bytes upon implementations have increased cwnd by precisely SMSS bytes upon
skipping to change at page 6, line 44 skipping to change at page 6, line 53
incidentally increase well beyond rwnd. incidentally increase well beyond rwnd.
Furthermore, upon a timeout (as specified in [RFC2988]) cwnd MUST be Furthermore, upon a timeout (as specified in [RFC2988]) cwnd MUST be
set to no more than the loss window, LW, which equals 1 full-sized set to no more than the loss window, LW, which equals 1 full-sized
segment (regardless of the value of IW). Therefore, after segment (regardless of the value of IW). Therefore, after
retransmitting the dropped segment the TCP sender uses the slow retransmitting the dropped segment the TCP sender uses the slow
start algorithm to increase the window from 1 full-sized segment to start algorithm to increase the window from 1 full-sized segment to
the new value of ssthresh, at which point congestion avoidance again the new value of ssthresh, at which point congestion avoidance again
takes over. takes over.
As shown in [FF96,RFC3782], slow start-based loss recovery after a
timeout can cause spurious retransmissions that trigger duplicate
acknowledgments. The reaction to the arrival of these duplicate
ACKs in TCP implementations varies widely. This document does not
specify how to treat such acknowledgments, but does note this as an
area that may benefit from additional attention, experimentation and
specification.
3.2 Fast Retransmit/Fast Recovery 3.2 Fast Retransmit/Fast Recovery
A TCP receiver SHOULD send an immediate duplicate ACK when an out- A TCP receiver SHOULD send an immediate duplicate ACK when an out-
of-order segment arrives. The purpose of this ACK is to inform the of-order segment arrives. The purpose of this ACK is to inform the
sender that a segment was received out-of-order and which sequence sender that a segment was received out-of-order and which sequence
number is expected. From the sender's perspective, duplicate ACKs number is expected. From the sender's perspective, duplicate ACKs
can be caused by a number of network problems. First, they can be can be caused by a number of network problems. First, they can be
caused by dropped segments. In this case, all segments after the caused by dropped segments. In this case, all segments after the
dropped segment will trigger duplicate ACKs until the loss is dropped segment will trigger duplicate ACKs until the loss is
repaired. Second, duplicate ACKs can be caused by the re-ordering repaired. Second, duplicate ACKs can be caused by the re-ordering
skipping to change at page 7, line 10 skipping to change at page 7, line 29
paths [Pax97]). Finally, duplicate ACKs can be caused by paths [Pax97]). Finally, duplicate ACKs can be caused by
replication of ACK or data segments by the network. In addition, a replication of ACK or data segments by the network. In addition, a
TCP receiver SHOULD send an immediate ACK when the incoming segment TCP receiver SHOULD send an immediate ACK when the incoming segment
fills in all or part of a gap in the sequence space. This will fills in all or part of a gap in the sequence space. This will
generate more timely information for a sender recovering from a loss generate more timely information for a sender recovering from a loss
through a retransmission timeout, a fast retransmit, or an advanced through a retransmission timeout, a fast retransmit, or an advanced
loss recovery algorithm, as outlined in section 4.3. loss recovery algorithm, as outlined in section 4.3.
The TCP sender SHOULD use the "fast retransmit" algorithm to detect The TCP sender SHOULD use the "fast retransmit" algorithm to detect
and repair loss, based on incoming duplicate ACKs. The fast and repair loss, based on incoming duplicate ACKs. The fast
retransmit algorithm uses the arrival of 3 duplicate ACKs (4 retransmit algorithm uses the arrival of 3 duplicate ACKs (as
identical ACKs without the arrival of any other intervening packets) defined in section 2, without any intervening ACKs which move
as an indication that a segment has been lost. After receiving 3 SND.UNA) as an indication that a segment has been lost. After
duplicate ACKs, TCP performs a retransmission of what appears to be receiving 3 duplicate ACKs, TCP performs a retransmission of what
the missing segment, without waiting for the retransmission timer to appears to be the missing segment, without waiting for the
expire. retransmission timer to expire.
After the fast retransmit algorithm sends what appears to be the After the fast retransmit algorithm sends what appears to be the
missing segment, the "fast recovery" algorithm governs the missing segment, the "fast recovery" algorithm governs the
transmission of new data until a non-duplicate ACK arrives. The transmission of new data until a non-duplicate ACK arrives. The
reason for not performing slow start is that the receipt of the reason for not performing slow start is that the receipt of the
duplicate ACKs not only indicates that a segment has been lost, but duplicate ACKs not only indicates that a segment has been lost, but
also that segments are most likely leaving the network (although a also that segments are most likely leaving the network (although a
massive segment duplication by the network can invalidate this massive segment duplication by the network can invalidate this
conclusion). In other words, since the receiver can only generate a conclusion). In other words, since the receiver can only generate a
duplicate ACK when a segment has arrived, that segment has left the duplicate ACK when a segment has arrived, that segment has left the
skipping to change at page 11, line 27 skipping to change at page 11, line 47
that were not discussed in detail in 2001. Specifically, this that were not discussed in detail in 2001. Specifically, this
document suggests what TCP connections should do after a relatively document suggests what TCP connections should do after a relatively
long idle period, as well as specifying and clarifying some of the long idle period, as well as specifying and clarifying some of the
issues pertaining to TCP ACK generation. Finally, the allowable issues pertaining to TCP ACK generation. Finally, the allowable
upper bound for the initial congestion window has also been raised upper bound for the initial congestion window has also been raised
from one to two segments. from one to two segments.
7. Changes Relative to RFC 2581 7. Changes Relative to RFC 2581
A specific definition for "duplicate acknowledgment" has been A specific definition for "duplicate acknowledgment" has been
added, based on the definition used by BSD TCP. added, based on the definition used by BSD TCP. In addition, the
definition explicitly does not take into account the presence (or
absence) of DSACK [RFC2883] information.
The document now notes that what to do with duplicate ACKs after the
retransmission timer has fired is future work and explicitly
unspecified in this document.
The initial window requirements were changed to allow Larger The initial window requirements were changed to allow Larger
Initial Windows as standardized in [RFC3390]. Additionally, the Initial Windows as standardized in [RFC3390]. Additionally, the
steps to take when an initial window is discovered to be too large steps to take when an initial window is discovered to be too large
due to Path MTU Discovery [RFC1191] are detailed. due to Path MTU Discovery [RFC1191] are detailed.
The recommended initial value for ssthresh has been changed to say The recommended initial value for ssthresh has been changed to say
that it SHOULD be arbitrarily high, where it was previously MAY. that it SHOULD be arbitrarily high, where it was previously MAY.
This is to provide additional guidance to implementors on the This is to provide additional guidance to implementors on the
matter. matter.
skipping to change at page 12, line 35 skipping to change at page 13, line 6
We wish to emphasize that the shortcomings and mistakes of this We wish to emphasize that the shortcomings and mistakes of this
document are solely the responsibility of the current authors. document are solely the responsibility of the current authors.
Some of the text from this document is taken from "TCP/IP Some of the text from this document is taken from "TCP/IP
Illustrated, Volume 1: The Protocols" by W. Richard Stevens Illustrated, Volume 1: The Protocols" by W. Richard Stevens
(Addison-Wesley, 1994) and "TCP/IP Illustrated, Volume 2: The (Addison-Wesley, 1994) and "TCP/IP Illustrated, Volume 2: The
Implementation" by Gary R. Wright and W. Richard Stevens (Addison- Implementation" by Gary R. Wright and W. Richard Stevens (Addison-
Wesley, 1995). This material is used with the permission of Wesley, 1995). This material is used with the permission of
Addison-Wesley. Addison-Wesley.
Neal Cardwell, Noritoshi Demizu, Kevin Fall, Sally Floyd, Craig Steve Arden, Neal Cardwell, Noritoshi Demizu, Kevin Fall, John
Partridge and Joe Touch contributed a number of helpful suggestions. Heffner, Sally Floyd, Reiner Ludwig, Matt Mathis, Craig Partridge
and Joe Touch contributed a number of helpful suggestions.
Normative References Normative References
[RFC793] Postel, J., "Transmission Control Protocol", STD 7, RFC [RFC793] Postel, J., "Transmission Control Protocol", STD 7, RFC
793, September 1981. 793, September 1981.
[RFC1122] Braden, R., "Requirements for Internet Hosts -- [RFC1122] Braden, R., "Requirements for Internet Hosts --
Communication Layers", STD 3, RFC 1122, October 1989. Communication Layers", STD 3, RFC 1122, October 1989.
[RFC1191] Mogul, J. and S. Deering, "Path MTU Discovery", RFC 1191, [RFC1191] Mogul, J. and S. Deering, "Path MTU Discovery", RFC 1191,
skipping to change at page 14, line 5 skipping to change at page 14, line 31
[RFC2414] Allman, M., Floyd, S. and C. Partridge, "Increasing TCP's [RFC2414] Allman, M., Floyd, S. and C. Partridge, "Increasing TCP's
Initial Window Size", RFC 2414, September 1998. Initial Window Size", RFC 2414, September 1998.
[RFC2525] Paxson, V., Allman, M., Dawson, S., Fenner, W., Griner, J., [RFC2525] Paxson, V., Allman, M., Dawson, S., Fenner, W., Griner, J.,
Heavens, I., Lahey, K., Semke, J. and B. Volz, "Known TCP Heavens, I., Lahey, K., Semke, J. and B. Volz, "Known TCP
Implementation Problems", RFC 2525, March 1999. Implementation Problems", RFC 2525, March 1999.
[RFC2581] Allman, M., Paxson, V., W. Stevens, TCP Congestion [RFC2581] Allman, M., Paxson, V., W. Stevens, TCP Congestion
Control, RFC 2581, April 1999. Control, RFC 2581, April 1999.
[RFC2883] Floyd, S., J. Mahdavi, M. Mathis, M. Podolsky, An
Extension to the Selective Acknowledgement (SACK) Option for
TCP, RFC 2883, July 2000.
[RFC2988] V. Paxson and M. Allman, "Computing TCP's Retransmission [RFC2988] V. Paxson and M. Allman, "Computing TCP's Retransmission
Timer", RFC 2988, November 2000. Timer", RFC 2988, November 2000.
[RFC3042] Allman, M., Balakrishnan, H. and S. Floyd, "Enhancing [RFC3042] Allman, M., Balakrishnan, H. and S. Floyd, "Enhancing
TCP's Loss Recovery Using Limited Transmit", RFC 3042, January TCP's Loss Recovery Using Limited Transmit", RFC 3042, January
2001. 2001.
[RFC3465] Mark Allman, TCP Congestion Control with Appropriate Byte [RFC3465] Mark Allman, TCP Congestion Control with Appropriate Byte
Counting (ABC), RFC 3465, February 2003. Counting (ABC), RFC 3465, February 2003.
skipping to change at page 14, line 40 skipping to change at page 15, line 16
[WS95] Wright, G. and W. Stevens, "TCP/IP Illustrated, Volume 2: The [WS95] Wright, G. and W. Stevens, "TCP/IP Illustrated, Volume 2: The
Implementation", Addison-Wesley, 1995. Implementation", Addison-Wesley, 1995.
Authors' Addresses Authors' Addresses
Mark Allman Mark Allman
ICIR / ICSI ICIR / ICSI
1947 Center Street 1947 Center Street
Suite 600 Suite 600
Berkeley, CA 94704-1198 Berkeley, CA 94704-1198
Phone: +1 440 243 7361 Phone: +1 440 235 1792
EMail: mallman@icir.org EMail: mallman@icir.org
http://www.icir.org/mallman/ http://www.icir.org/mallman/
Vern Paxson Vern Paxson
ICIR / ICSI ICIR / ICSI
1947 Center Street 1947 Center Street
Suite 600 Suite 600
Berkeley, CA 94704-1198 Berkeley, CA 94704-1198
Phone: +1 510/642-4274 x302 Phone: +1 510/642-4274 x302
EMail: vern@icir.org EMail: vern@icir.org
 End of changes. 12 change blocks. 
28 lines changed or deleted 56 lines changed or added

This html diff was produced by rfcdiff 1.32. The latest version is available from http://www.levkowetz.com/ietf/tools/rfcdiff/