draft-ietf-tcpm-rfc2581bis-03.txt   draft-ietf-tcpm-rfc2581bis-04.txt 
Network Working Group M. Allman Network Working Group M. Allman
Internet-Draft V. Paxson Internet-Draft V. Paxson
Expires: March 2008 ICIR / ICSI Expires: October 2008 ICSI
E. Blanton E. Blanton
Purdue University Purdue University
September 2007
TCP Congestion Control TCP Congestion Control
draft-ietf-tcpm-rfc2581bis-03.txt draft-ietf-tcpm-rfc2581bis-04.txt
Status of this Memo Status of this Memo
By submitting this Internet-Draft, each author represents that any By submitting this Internet-Draft, each author represents that any
applicable patent or other IPR claims of which he or she is aware applicable patent or other IPR claims of which he or she is aware
have been or will be disclosed, and any of which he or she becomes have been or will be disclosed, and any of which he or she becomes
aware will be disclosed, in accordance with Section 6 of BCP 79. aware will be disclosed, in accordance with Section 6 of BCP 79.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that Task Force (IETF), its areas, and its working groups. Note that
skipping to change at page 1, line 36 skipping to change at page 1, line 34
months and may be updated, replaced, or obsoleted by other documents months and may be updated, replaced, or obsoleted by other documents
at any time. It is inappropriate to use Internet-Drafts as at any time. It is inappropriate to use Internet-Drafts as
reference material or to cite them other than as "work in progress." reference material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt. http://www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html. http://www.ietf.org/shadow.html.
Copyright Notice
Copyright (C) The Internet Society (2007).
Abstract Abstract
This document defines TCP's four intertwined congestion control This document defines TCP's four intertwined congestion control
algorithms: slow start, congestion avoidance, fast retransmit, and algorithms: slow start, congestion avoidance, fast retransmit, and
fast recovery. In addition, the document specifies how TCP should fast recovery. In addition, the document specifies how TCP should
begin transmission after a relatively long idle period, as well as begin transmission after a relatively long idle period, as well as
discussing various acknowledgment generation methods. discussing various acknowledgment generation methods.
1. Introduction 1. Introduction
This document specifies four TCP [RFC793] congestion control This document specifies four TCP [RFC793] congestion control
algorithms: slow start, congestion avoidance, fast retransmit and algorithms: slow start, congestion avoidance, fast retransmit and
fast recovery. These algorithms were devised in [Jac88] and fast recovery. These algorithms were devised in [Jac88] and
[Jac90]. Their use with TCP is standardized in [RFC1122]. Additional [Jac90]. Their use with TCP is standardized in [RFC1122].
early work in additive-increase, multiplicative-decrease congestion Additional early work in additive-increase, multiplicative-decrease
control is given in [CJ89]. congestion control is given in [CJ89].
This document obsoletes [RFC2581] which in turned obsoleted This document obsoletes [RFC2581] which in turned obsoleted
[RFC2001]. [RFC2001].
In addition to specifying the congestion control algorithms, this In addition to specifying the congestion control algorithms, this
document specifies what TCP connections should do after a relatively document specifies what TCP connections should do after a relatively
long idle period, as well as specifying and clarifying some of the long idle period, as well as specifying and clarifying some of the
issues pertaining to TCP ACK generation. issues pertaining to TCP ACK generation.
Note that [Ste94] provides examples of these algorithms in action Note that [Ste94] provides examples of these algorithms in action
skipping to change at page 2, line 39 skipping to change at page 2, line 34
This section provides the definition of several terms that will be This section provides the definition of several terms that will be
used throughout the remainder of this document. used throughout the remainder of this document.
SEGMENT: A segment is ANY TCP/IP data or acknowledgment packet (or SEGMENT: A segment is ANY TCP/IP data or acknowledgment packet (or
both). both).
SENDER MAXIMUM SEGMENT SIZE (SMSS): The SMSS is the size of the SENDER MAXIMUM SEGMENT SIZE (SMSS): The SMSS is the size of the
largest segment that the sender can transmit. This value can be largest segment that the sender can transmit. This value can be
based on the maximum transmission unit of the network, the path based on the maximum transmission unit of the network, the path
MTU discovery [RFC1191] algorithm, RMSS (see next item), or other MTU discovery [RFC1191,RFC4821] algorithm, RMSS (see next item),
factors. The size does not include the TCP/IP headers and or other factors. The size does not include the TCP/IP headers
options. and options.
RECEIVER MAXIMUM SEGMENT SIZE (RMSS): The RMSS is the size of the RECEIVER MAXIMUM SEGMENT SIZE (RMSS): The RMSS is the size of the
largest segment the receiver is willing to accept. This is the largest segment the receiver is willing to accept. This is the
value specified in the MSS option sent by the receiver during value specified in the MSS option sent by the receiver during
connection startup. Or, if the MSS option is not used, 536 connection startup. Or, if the MSS option is not used, 536
bytes [RFC1122]. The size does not include the TCP/IP headers and bytes [RFC1122]. The size does not include the TCP/IP headers
options. and options.
FULL-SIZED SEGMENT: A segment that contains the maximum number of FULL-SIZED SEGMENT: A segment that contains the maximum number of
data bytes permitted (i.e., a segment containing SMSS bytes of data bytes permitted (i.e., a segment containing SMSS bytes of
data). data).
RECEIVER WINDOW (rwnd): The most recently advertised receiver RECEIVER WINDOW (rwnd): The most recently advertised receiver
window. window.
CONGESTION WINDOW (cwnd): A TCP state variable that limits the CONGESTION WINDOW (cwnd): A TCP state variable that limits the
amount of data a TCP can send. At any given time, a TCP MUST amount of data a TCP can send. At any given time, a TCP MUST
skipping to change at page 3, line 21 skipping to change at page 3, line 18
LOSS WINDOW (LW): The loss window is the size of the congestion LOSS WINDOW (LW): The loss window is the size of the congestion
window after a TCP sender detects loss using its retransmission window after a TCP sender detects loss using its retransmission
timer. timer.
RESTART WINDOW (RW): The restart window is the size of the RESTART WINDOW (RW): The restart window is the size of the
congestion window after a TCP restarts transmission after an congestion window after a TCP restarts transmission after an
idle period (if the slow start algorithm is used; see section idle period (if the slow start algorithm is used; see section
4.1 for more discussion). 4.1 for more discussion).
FLIGHT SIZE: The amount of data that has been sent but not yet FLIGHT SIZE: The amount of data that has been sent but not yet
acknowledged. cumulatively acknowledged.
DUPLICATE ACKNOWLEDGMENT: An acknowledgment is considered a DUPLICATE ACKNOWLEDGMENT: An acknowledgment is considered a
"duplicate" in the following algorithms when (a) the receiver of "duplicate" in the following algorithms when (a) the receiver of
the ACK has outstanding data, (b) the incoming acknowledgment the ACK has outstanding data, (b) the incoming acknowledgment
carries no data, (c) the SYN and FIN bits are both off, (d) the carries no data, (c) the SYN and FIN bits are both off, (d) the
acknowledgment number is equal to the greatest acknowledgment acknowledgment number is equal to the greatest acknowledgment
received on the given connection (TCP.UNA from [RFC793]) and (e) received on the given connection (TCP.UNA from [RFC793]) and (e)
the advertised window in the incoming acknowledgment equals the the advertised window in the incoming acknowledgment equals the
advertised window in the last incoming acknowledgment. advertised window in the last incoming acknowledgment.
Alternatively, a TCP that utilizes selective acknowledgments Alternatively, a TCP that utilizes selective acknowledgments
[RFC2018,RFC2883] can determine an incoming ACK is a "duplicate" [RFC2018,RFC2883] can leverage the SACK information to determine
if the ACK contains previously unknown SACK information. when an incoming ACK is a "duplicate" (e.g., if the ACK contains
previously unknown SACK information).
3. Congestion Control Algorithms 3. Congestion Control Algorithms
This section defines the four congestion control algorithms: slow This section defines the four congestion control algorithms: slow
start, congestion avoidance, fast retransmit and fast recovery, start, congestion avoidance, fast retransmit and fast recovery,
developed in [Jac88] and [Jac90]. In some situations it may be developed in [Jac88] and [Jac90]. In some situations it may be
beneficial for a TCP sender to be more conservative than the beneficial for a TCP sender to be more conservative than the
algorithms allow, however a TCP MUST NOT be more aggressive than the algorithms allow, however a TCP MUST NOT be more aggressive than the
following algorithms allow (that is, MUST NOT send data when the following algorithms allow (that is, MUST NOT send data when the
value of cwnd computed by the following algorithms would not allow value of cwnd computed by the following algorithms would not allow
the data to be sent). the data to be sent).
Also note that the algorithms specified in this document work in
terms of using loss as the signal of congestion. Explicit
Congestion Notification (ECN) could also be used as specified in
[RFC3168].
3.1 Slow Start and Congestion Avoidance 3.1 Slow Start and Congestion Avoidance
The slow start and congestion avoidance algorithms MUST be used by a The slow start and congestion avoidance algorithms MUST be used by a
TCP sender to control the amount of outstanding data being injected TCP sender to control the amount of outstanding data being injected
into the network. To implement these algorithms, two variables are into the network. To implement these algorithms, two variables are
added to the TCP per-connection state. The congestion window (cwnd) added to the TCP per-connection state. The congestion window (cwnd)
is a sender-side limit on the amount of data the sender can transmit is a sender-side limit on the amount of data the sender can transmit
into the network before receiving an acknowledgment (ACK), while the into the network before receiving an acknowledgment (ACK), while the
receiver's advertised window (rwnd) is a receiver-side limit on the receiver's advertised window (rwnd) is a receiver-side limit on the
amount of outstanding data. The minimum of cwnd and rwnd governs amount of outstanding data. The minimum of cwnd and rwnd governs
skipping to change at page 4, line 14 skipping to change at page 4, line 16
Another state variable, the slow start threshold (ssthresh), is used Another state variable, the slow start threshold (ssthresh), is used
to determine whether the slow start or congestion avoidance to determine whether the slow start or congestion avoidance
algorithm is used to control data transmission, as discussed below. algorithm is used to control data transmission, as discussed below.
Beginning transmission into a network with unknown conditions Beginning transmission into a network with unknown conditions
requires TCP to slowly probe the network to determine the available requires TCP to slowly probe the network to determine the available
capacity, in order to avoid congesting the network with an capacity, in order to avoid congesting the network with an
inappropriately large burst of data. The slow start algorithm is inappropriately large burst of data. The slow start algorithm is
used for this purpose at the beginning of a transfer, or after used for this purpose at the beginning of a transfer, or after
repairing loss detected by the retransmission timer. repairing loss detected by the retransmission timer. Slow start
additionally serves to start the "ACK clock" used by the TCP sender
to release data into the network in the slow start, congestion
avoidance, and loss recovery algorithms.
IW, the initial value of cwnd, MUST be set using the following IW, the initial value of cwnd, MUST be set using the following
guidelines as an upper bound. guidelines as an upper bound.
If SMSS > 2190 bytes: If SMSS > 2190 bytes:
IW = 2 * SMSS bytes and MUST NOT be more than 2 segments IW = 2 * SMSS bytes and MUST NOT be more than 2 segments
If (SMSS > 1095 bytes) and (SMSS <= 2190 bytes): If (SMSS > 1095 bytes) and (SMSS <= 2190 bytes):
IW = 3 * SMSS bytes and MUST NOT be more than 3 segments IW = 3 * SMSS bytes and MUST NOT be more than 3 segments
if SMSS <= 1095 bytes: if SMSS <= 1095 bytes:
IW = 4 * SMSS bytes and MUST NOT be more than 4 segments IW = 4 * SMSS bytes and MUST NOT be more than 4 segments
As specified in [RFC3390], the SYN/ACK and the acknowledgment of the As specified in [RFC3390], the SYN/ACK and the acknowledgment of the
SYN/ACK MUST NOT increase the size of the congestion window. SYN/ACK MUST NOT increase the size of the congestion window.
Further, if the SYN or SYN/ACK is lost, the initial window used by a Further, if the SYN or SYN/ACK is lost, the initial window used by a
sender after a correctly transmitted SYN MUST be one segment sender after a correctly transmitted SYN MUST be one segment
consisting of at most SMSS bytes. consisting of at most SMSS bytes.
A detailed rationale and discussion of the IW setting is provided in A detailed rationale and discussion of the IW setting is provided in
[RFC3390]. [RFC3390].
When larger initial windows are implemented along with Path MTU When initial congestion windows of more than one segment are
Discovery [RFC1191], and the MSS being used is found to be too implemented along with Path MTU Discovery [RFC1191], and the MSS
large, the congestion window cwnd SHOULD be reduced to prevent being used is found to be too large, the congestion window cwnd
large bursts of smaller segments. Specifically, cwnd SHOULD be SHOULD be reduced to prevent large bursts of smaller segments.
reduced by the ratio of the old segment size to the new segment Specifically, cwnd SHOULD be reduced by the ratio of the old segment
size. size to the new segment size.
The initial value of ssthresh SHOULD be set arbitrarily high (e.g., The initial value of ssthresh SHOULD be set arbitrarily high (e.g.,
to the size of the largest possible advertised window), but ssthresh to the size of the largest possible advertised window), but ssthresh
MUST be reduced in response to congestion. Setting ssthresh as high MUST be reduced in response to congestion. Setting ssthresh as high
as possible allows the network conditions, rather than some as possible allows the network conditions, rather than some
arbitrary host limit, to dictate the sending rate. In cases where arbitrary host limit, to dictate the sending rate. In cases where
the end systems have a solid understanding of the network path, more the end systems have a solid understanding of the network path, more
carefully setting the initial ssthresh value may have merit (e.g., carefully setting the initial ssthresh value may have merit (e.g.,
such that the end host does not create congestion along the path). such that the end host does not create congestion along the path).
The slow start algorithm is used when cwnd < ssthresh, while the The slow start algorithm is used when cwnd < ssthresh, while the
congestion avoidance algorithm is used when cwnd > ssthresh. When congestion avoidance algorithm is used when cwnd > ssthresh. When
cwnd and ssthresh are equal the sender may use either slow start or cwnd and ssthresh are equal the sender may use either slow start or
congestion avoidance. congestion avoidance.
During slow start, a TCP increments cwnd by at most SMSS bytes for During slow start, a TCP increments cwnd by at most SMSS bytes for
each ACK received that acknowledges new data. Slow start ends when each ACK received that cumulatively acknowledges new data. Slow
cwnd exceeds ssthresh (or, optionally, when it reaches it, as noted start ends when cwnd exceeds ssthresh (or, optionally, when it
above) or when congestion is observed. While traditionally TCP reaches it, as noted above) or when congestion is observed. While
implementations have increased cwnd by precisely SMSS bytes upon traditionally TCP implementations have increased cwnd by precisely
receipt of an ACK covering new data, we RECOMMEND that TCP SMSS bytes upon receipt of an ACK covering new data, we RECOMMEND
implementations increase cwnd, per: that TCP implementations increase cwnd, per:
cwnd += min (N, SMSS) (2) cwnd += min (N, SMSS) (2)
where N is the number of previously unacknowledged bytes where N is the number of previously unacknowledged bytes
acknowledged in the incoming ACK. This adjustment is part of acknowledged in the incoming ACK. This adjustment is part of
Appropriate Byte Counting [RFC3465] and provides robustness against Appropriate Byte Counting [RFC3465] and provides robustness against
misbehaving receivers which may attempt to induce a sender to misbehaving receivers which may attempt to induce a sender to
artificially inflate cwnd using a mechanism known as "ACK Division" artificially inflate cwnd using a mechanism known as "ACK Division"
[SCWA99]. ACK Division consists of a receiver sending multiple ACKs [SCWA99]. ACK Division consists of a receiver sending multiple ACKs
for a single TCP data segment, each acknowledging only a portion of for a single TCP data segment, each acknowledging only a portion of
skipping to change at page 5, line 29 skipping to change at page 5, line 35
inappropriately inflate the amount of data injected into the inappropriately inflate the amount of data injected into the
network. network.
During congestion avoidance, cwnd is incremented by roughly 1 During congestion avoidance, cwnd is incremented by roughly 1
full-sized segment per round-trip time (RTT). Congestion avoidance full-sized segment per round-trip time (RTT). Congestion avoidance
continues until congestion is detected. The basic guidelines for continues until congestion is detected. The basic guidelines for
incrementing cwnd during congestion avoidance are: incrementing cwnd during congestion avoidance are:
* MAY increment cwnd by SMSS bytes * MAY increment cwnd by SMSS bytes
* SHOULD increment cwnd per equation (2) * SHOULD increment cwnd per equation (2) once per RTT
* MUST NOT increment cwnd by more than SMSS bytes * MUST NOT increment cwnd by more than SMSS bytes
We note that [RFC3465] allows for cwnd increases of more than SMSS We note that [RFC3465] allows for cwnd increases of more than SMSS
bytes for incoming acknowledgments during slow start on an bytes for incoming acknowledgments during slow start on an
experimental basis, however such behavior is not allowed as part of experimental basis, however such behavior is not allowed as part of
the standard. the standard.
The RECOMMENDED way to increase cwnd during congestion avoidance is The RECOMMENDED way to increase cwnd during congestion avoidance is
to count the number of bytes that have been acknowledged by ACKs for to count the number of bytes that have been acknowledged by ACKs for
skipping to change at page 5, line 52 skipping to change at page 6, line 4
acknowledged reaches cwnd, then cwnd can be incremented by up to acknowledged reaches cwnd, then cwnd can be incremented by up to
SMSS bytes. Note that during congestion avoidance, cwnd MUST NOT be SMSS bytes. Note that during congestion avoidance, cwnd MUST NOT be
increased by more than SMSS bytes per RTT. This method both allows increased by more than SMSS bytes per RTT. This method both allows
TCPs to increase cwnd by one segment per RTT in the face of delayed TCPs to increase cwnd by one segment per RTT in the face of delayed
ACKs and provides robustness against ACK Division attacks. ACKs and provides robustness against ACK Division attacks.
Another common formula that a TCP MAY use to update cwnd during Another common formula that a TCP MAY use to update cwnd during
congestion avoidance is given in equation 3: congestion avoidance is given in equation 3:
cwnd += SMSS*SMSS/cwnd (3) cwnd += SMSS*SMSS/cwnd (3)
This adjustment is executed on every incoming ACK that acknowledges This adjustment is executed on every incoming ACK that acknowledges
new data. new data. Equation (3) provides an acceptable approximation to the
Equation (3) provides an acceptable approximation to the underlying underlying principle of increasing cwnd by 1 full-sized segment per
principle of increasing cwnd by 1 full-sized segment per RTT. (Note RTT. (Note that for a connection in which the receiver is
that for a connection in which the receiver is acknowledging acknowledging every-other packet, (3) is less aggressive than
every-other packet, (3) is less aggressive than allowed -- roughly allowed -- roughly increasing cwnd every second RTT.)
increasing cwnd every second RTT.)
Implementation Note: Since integer arithmetic is usually used in TCP Implementation Note: Since integer arithmetic is usually used in TCP
implementations, the formula given in equation 3 can fail to implementations, the formula given in equation 3 can fail to
increase cwnd when the congestion window is larger than SMSS*SMSS. increase cwnd when the congestion window is larger than SMSS*SMSS.
If the above formula yields 0, the result SHOULD be rounded up to 1 If the above formula yields 0, the result SHOULD be rounded up to 1
byte. byte.
Implementation Note: Older implementations have an additional Implementation Note: Older implementations have an additional
additive constant on the right-hand side of equation (3). This is additive constant on the right-hand side of equation (3). This is
incorrect and can actually lead to diminished performance [RFC2525]. incorrect and can actually lead to diminished performance [RFC2525].
skipping to change at page 6, line 34 skipping to change at page 6, line 38
value of ssthresh MUST be set to no more than the value given in value of ssthresh MUST be set to no more than the value given in
equation 4: equation 4:
ssthresh = max (FlightSize / 2, 2*SMSS) (4) ssthresh = max (FlightSize / 2, 2*SMSS) (4)
where, as discussed above, FlightSize is the amount of outstanding where, as discussed above, FlightSize is the amount of outstanding
data in the network. data in the network.
On the other hand, when a TCP sender detects segment loss using the On the other hand, when a TCP sender detects segment loss using the
retransmission timer and the given segment has already been retransmission timer and the given segment has already been
retransmitted at least once, the value of ssthresh is held retransmitted by way of the retransmission timer at least once, the
constant. value of ssthresh is held constant.
Implementation Note: An easy mistake to make is to simply use cwnd, Implementation Note: An easy mistake to make is to simply use cwnd,
rather than FlightSize, which in some implementations may rather than FlightSize, which in some implementations may
incidentally increase well beyond rwnd. incidentally increase well beyond rwnd.
Furthermore, upon a timeout (as specified in [RFC2988]) cwnd MUST be Furthermore, upon a timeout (as specified in [RFC2988]) cwnd MUST be
set to no more than the loss window, LW, which equals 1 full-sized set to no more than the loss window, LW, which equals 1 full-sized
segment (regardless of the value of IW). Therefore, after segment (regardless of the value of IW). Therefore, after
retransmitting the dropped segment the TCP sender uses the slow retransmitting the dropped segment the TCP sender uses the slow
start algorithm to increase the window from 1 full-sized segment to start algorithm to increase the window from 1 full-sized segment to
skipping to change at page 8, line 6 skipping to change at page 8, line 11
TCP SHOULD send a segment of previously unsent data per TCP SHOULD send a segment of previously unsent data per
[RFC3042] provided that the receiver's advertised window allows, [RFC3042] provided that the receiver's advertised window allows,
the total FlightSize would remain less than or equal to cwnd the total FlightSize would remain less than or equal to cwnd
plus 2*SMSS, and that new data is available for transmission. plus 2*SMSS, and that new data is available for transmission.
Further, the TCP sender MUST NOT change cwnd to reflect these Further, the TCP sender MUST NOT change cwnd to reflect these
two segments [RFC3042]. Note that a sender using SACK [RFC2018] two segments [RFC3042]. Note that a sender using SACK [RFC2018]
MUST NOT send new data unless the incoming duplicate MUST NOT send new data unless the incoming duplicate
acknowledgment contains new SACK information. acknowledgment contains new SACK information.
2. When the third duplicate ACK is received, a TCP MUST set 2. When the third duplicate ACK is received, a TCP MUST set
ssthresh to no more than the value given in equation 4. ssthresh to no more than the value given in equation 4. When
[RFC3042] is in use, additional data sent in limited transmit
MUST NOT be included in this calculation.
3. The lost segment MUST be retransmitted and cwnd set to 3. The lost segment starting at SND.UNA MUST be retransmitted and
ssthresh plus 3*SMSS. This artificially "inflates" the cwnd set to ssthresh plus 3*SMSS. This artificially "inflates"
congestion window by the number of segments (three) that have the congestion window by the number of segments (three) that
left the network and which the receiver has buffered. have left the network and which the receiver has buffered.
4. For each additional duplicate ACK received (after the third), 4. For each additional duplicate ACK received (after the third),
cwnd MUST be incremented by SMSS. This artificially inflates cwnd MUST be incremented by SMSS. This artificially inflates
the congestion window in order to reflect the additional segment the congestion window in order to reflect the additional segment
that has left the network. that has left the network.
Note: [SCWA99] discusses a receiver-based attack whereby many Note: [SCWA99] discusses a receiver-based attack whereby many
bogus duplicate ACKs are sent to the data sender in order to bogus duplicate ACKs are sent to the data sender in order to
artificially inflate cwnd and cause a higher than appropriate artificially inflate cwnd and cause a higher than appropriate
sending rate to be used. A TCP MAY therefore limit the number sending rate to be used. A TCP MAY therefore limit the number
of times cwnd is artificially inflated during loss recovery of times cwnd is artificially inflated during loss recovery
to the number of outstanding segments (or, an approximation to the number of outstanding segments (or, an approximation
thereof). thereof).
5. Transmit a segment, if allowed by the new value of cwnd and the 5. When previously unsent data is available and the new value of
receiver's advertised window. cwnd and the receiver's advertised window allow, a TCP SHOULD
send 1*SMSS bytes of previously unsent data.
6. When the next ACK arrives that acknowledges previously 6. When the next ACK arrives that acknowledges previously
unacknowledged data, a TCP MUST set cwnd to ssthresh (the value unacknowledged data, a TCP MUST set cwnd to ssthresh (the value
set in step 2). This is termed "deflating" the window. set in step 2). This is termed "deflating" the window.
This ACK should be the acknowledgment elicited by the This ACK should be the acknowledgment elicited by the
retransmission from step 3, one RTT after the retransmission retransmission from step 3, one RTT after the retransmission
(though it may arrive sooner in the presence of significant out- (though it may arrive sooner in the presence of significant out-
of-order delivery of data segments at the receiver). of-order delivery of data segments at the receiver).
Additionally, this ACK should acknowledge all the intermediate Additionally, this ACK should acknowledge all the intermediate
skipping to change at page 8, line 56 skipping to change at page 9, line 10
4.1 Re-starting Idle Connections 4.1 Re-starting Idle Connections
A known problem with the TCP congestion control algorithms described A known problem with the TCP congestion control algorithms described
above is that they allow a potentially inappropriate burst of above is that they allow a potentially inappropriate burst of
traffic to be transmitted after TCP has been idle for a relatively traffic to be transmitted after TCP has been idle for a relatively
long period of time. After an idle period, TCP cannot use the ACK long period of time. After an idle period, TCP cannot use the ACK
clock to strobe new segments into the network, as all the ACKs have clock to strobe new segments into the network, as all the ACKs have
drained from the network. Therefore, as specified above, TCP can drained from the network. Therefore, as specified above, TCP can
potentially send a cwnd-size line-rate burst into the network after potentially send a cwnd-size line-rate burst into the network after
an idle period. an idle period. In addition, changing network conditions may have
rendered TCP's notion of the available end-to-end network capacity
between two endpoints, as estimated by cwnd, inaccurate during the
course of a long idle period.
[Jac88] recommends that a TCP use slow start to restart [Jac88] recommends that a TCP use slow start to restart
transmission after a relatively long idle period. Slow start transmission after a relatively long idle period. Slow start
serves to restart the ACK clock, just as it does at the beginning serves to restart the ACK clock, just as it does at the beginning
of a transfer. This mechanism has been widely deployed in the of a transfer. This mechanism has been widely deployed in the
following manner. When TCP has not received a segment for more following manner. When TCP has not received a segment for more
than one retransmission timeout, cwnd is reduced to the value of than one retransmission timeout, cwnd is reduced to the value of
the restart window (RW) before transmission begins. the restart window (RW) before transmission begins.
For the purposes of this standard, we define RW = min(IW,cwnd). For the purposes of this standard, we define RW = min(IW,cwnd).
skipping to change at page 9, line 42 skipping to change at page 9, line 52
generated for at least every second full-sized segment, and MUST be generated for at least every second full-sized segment, and MUST be
generated within 500 ms of the arrival of the first unacknowledged generated within 500 ms of the arrival of the first unacknowledged
packet. packet.
The requirement that an ACK "SHOULD" be generated for at least every The requirement that an ACK "SHOULD" be generated for at least every
second full-sized segment is listed in [RFC1122] in one place as a second full-sized segment is listed in [RFC1122] in one place as a
SHOULD and another as a MUST. Here we unambiguously state it is a SHOULD and another as a MUST. Here we unambiguously state it is a
SHOULD. We also emphasize that this is a SHOULD, meaning that an SHOULD. We also emphasize that this is a SHOULD, meaning that an
implementor should indeed only deviate from this requirement after implementor should indeed only deviate from this requirement after
careful consideration of the implications. See the discussion of careful consideration of the implications. See the discussion of
"Stretch ACK violation" in [RFC2525] and the references therein for a "Stretch ACK violation" in [RFC2525] and the references therein for
discussion of the possible performance problems with generating ACKs a discussion of the possible performance problems with generating
less frequently than every second full-sized segment. ACKs less frequently than every second full-sized segment.
In some cases, the sender and receiver may not agree on what In some cases, the sender and receiver may not agree on what
constitutes a full-sized segment. An implementation is deemed to constitutes a full-sized segment. An implementation is deemed to
comply with this requirement if it sends at least one acknowledgment comply with this requirement if it sends at least one acknowledgment
every time it receives 2*RMSS bytes of new data from the sender, every time it receives 2*RMSS bytes of new data from the sender,
where RMSS is the Maximum Segment Size specified by the receiver to where RMSS is the Maximum Segment Size specified by the receiver to
the sender (or the default value of 536 bytes, per [RFC1122], if the the sender (or the default value of 536 bytes, per [RFC1122], if the
receiver does not specify an MSS option during connection receiver does not specify an MSS option during connection
establishment). The sender may be forced to use a segment size less establishment). The sender may be forced to use a segment size less
than RMSS due to the maximum transmission unit (MTU), the path MTU than RMSS due to the maximum transmission unit (MTU), the path MTU
skipping to change at page 11, line 35 skipping to change at page 11, line 46
The Internet to a considerable degree relies on the correct The Internet to a considerable degree relies on the correct
implementation of these algorithms in order to preserve network implementation of these algorithms in order to preserve network
stability and avoid congestion collapse. An attacker could cause stability and avoid congestion collapse. An attacker could cause
TCP endpoints to respond more aggressively in the face of congestion TCP endpoints to respond more aggressively in the face of congestion
by forging excessive duplicate acknowledgments or excessive by forging excessive duplicate acknowledgments or excessive
acknowledgments for new data. Conceivably, such an attack could acknowledgments for new data. Conceivably, such an attack could
drive a portion of the network into congestion collapse. drive a portion of the network into congestion collapse.
6. Changes Between RFC 2001 and RFC 2581 6. Changes Between RFC 2001 and RFC 2581
This document has been extensively rewritten editorially and it is [RFC2001] has been extensively rewritten editorially and it is not
not feasible to itemize the list of changes between the two feasible to itemize the list of changes between [RFC2001] and
documents. The intention of this document is not to change any of [RFC2581]. The intention of [RFC2581] is to not change any of the
the recommendations given in RFC 2001, but to further clarify cases recommendations given in [RFC2001], but to further clarify cases
that were not discussed in detail in 2001. Specifically, this that were not discussed in detail in [RFC2001]. Specifically,
document suggests what TCP connections should do after a relatively [RFC2581] suggests what TCP connections should do after a relatively
long idle period, as well as specifying and clarifying some of the long idle period, as well as specifying and clarifying some of the
issues pertaining to TCP ACK generation. Finally, the allowable issues pertaining to TCP ACK generation. Finally, the allowable
upper bound for the initial congestion window has also been raised upper bound for the initial congestion window has also been raised
from one to two segments. from one to two segments.
7. Changes Relative to RFC 2581 7. Changes Relative to RFC 2581
A specific definition for "duplicate acknowledgment" has been A specific definition for "duplicate acknowledgment" has been
added, based on the definition used by BSD TCP. added, based on the definition used by BSD TCP.
The document now notes that what to do with duplicate ACKs after the The document now notes that what to do with duplicate ACKs after the
retransmission timer has fired is future work and explicitly retransmission timer has fired is future work and explicitly
unspecified in this document. unspecified in this document.
The initial window requirements were changed to allow Larger The initial window requirements were changed to allow Larger
Initial Windows as standardized in [RFC3390]. Additionally, the Initial Windows as standardized in [RFC3390]. Additionally, the
steps to take when an initial window is discovered to be too large steps to take when an initial window is discovered to be too large
skipping to change at page 12, line 41 skipping to change at page 12, line 51
The restart window has been changed to min(IW,cwnd) from IW. This The restart window has been changed to min(IW,cwnd) from IW. This
behavior was described as "experimental" in [RFC2581]. behavior was described as "experimental" in [RFC2581].
It is now recommended that TCP implementors implement an advanced It is now recommended that TCP implementors implement an advanced
loss recovery algorithm conforming to the principles outlined in loss recovery algorithm conforming to the principles outlined in
this document. this document.
The security considerations have been updated to discuss ACK The security considerations have been updated to discuss ACK
division and recommend byte counting as a counter to this attack. division and recommend byte counting as a counter to this attack.
Acknowledgments 8. IANA Considerations
This document contains no IANA considerations, but apparently an
Internet *Draft* can no longer be published without this section.
Acknowledgments
The core algorithms we describe were developed by Van Jacobson The core algorithms we describe were developed by Van Jacobson
[Jac88, Jac90]. In addition, Limited Transmit [RFC3042] was [Jac88, Jac90]. In addition, Limited Transmit [RFC3042] was
developed in conjunction with Hari Balakrishnan and Sally Floyd. developed in conjunction with Hari Balakrishnan and Sally Floyd.
The initial congestion window size specified in this document is a The initial congestion window size specified in this document is a
result of work with Sally Floyd and Craig Partridge result of work with Sally Floyd and Craig Partridge
[RFC2414,RFC3390]. [RFC2414,RFC3390].
W. Richard ("Rich") Stevens wrote the first version of this document W. Richard ("Rich") Stevens wrote the first version of this document
[RFC2001] and co-authored the second version [RFC2581]. This [RFC2001] and co-authored the second version [RFC2581]. This
present version much benefits from his clarity and thoughtfulness of present version much benefits from his clarity and thoughtfulness of
skipping to change at page 13, line 13 skipping to change at page 13, line 28
We wish to emphasize that the shortcomings and mistakes of this We wish to emphasize that the shortcomings and mistakes of this
document are solely the responsibility of the current authors. document are solely the responsibility of the current authors.
Some of the text from this document is taken from "TCP/IP Some of the text from this document is taken from "TCP/IP
Illustrated, Volume 1: The Protocols" by W. Richard Stevens Illustrated, Volume 1: The Protocols" by W. Richard Stevens
(Addison-Wesley, 1994) and "TCP/IP Illustrated, Volume 2: The (Addison-Wesley, 1994) and "TCP/IP Illustrated, Volume 2: The
Implementation" by Gary R. Wright and W. Richard Stevens (Addison- Implementation" by Gary R. Wright and W. Richard Stevens (Addison-
Wesley, 1995). This material is used with the permission of Wesley, 1995). This material is used with the permission of
Addison-Wesley. Addison-Wesley.
Steve Arden, Neal Cardwell, Noritoshi Demizu, Kevin Fall, John Anil Agarwal, Steve Arden, Neal Cardwell, Noritoshi Demizu, Gorry
Heffner, Alfred Hoenes, Sally Floyd, Reiner Ludwig, Matt Mathis, Fairhurst, Kevin Fall, John Heffner, Alfred Hoenes, Sally Floyd,
Craig Partridge and Joe Touch contributed a number of helpful Reiner Ludwig, Matt Mathis, Craig Partridge and Joe Touch
suggestions. contributed a number of helpful suggestions.
Normative References Normative References
[RFC793] Postel, J., "Transmission Control Protocol", STD 7, RFC [RFC793] Postel, J., "Transmission Control Protocol", STD 7, RFC
793, September 1981. 793, September 1981.
[RFC1122] Braden, R., "Requirements for Internet Hosts -- [RFC1122] Braden, R., "Requirements for Internet Hosts --
Communication Layers", STD 3, RFC 1122, October 1989. Communication Layers", STD 3, RFC 1122, October 1989.
[RFC1191] Mogul, J. and S. Deering, "Path MTU Discovery", RFC 1191, [RFC1191] Mogul, J. and S. Deering, "Path MTU Discovery", RFC 1191,
skipping to change at page 14, line 33 skipping to change at page 14, line 49
[RFC2018] Mathis, M., Mahdavi, J., Floyd, S. and A. Romanow, "TCP [RFC2018] Mathis, M., Mahdavi, J., Floyd, S. and A. Romanow, "TCP
Selective Acknowledgement Options", RFC 2018, October 1996. Selective Acknowledgement Options", RFC 2018, October 1996.
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, March 1997. Requirement Levels", BCP 14, RFC 2119, March 1997.
[RFC2414] Allman, M., Floyd, S. and C. Partridge, "Increasing TCP's [RFC2414] Allman, M., Floyd, S. and C. Partridge, "Increasing TCP's
Initial Window Size", RFC 2414, September 1998. Initial Window Size", RFC 2414, September 1998.
[RFC2525] Paxson, V., Allman, M., Dawson, S., Fenner, W., Griner, J., [RFC2525] Paxson, V., Allman, M., Dawson, S., Fenner, W., Griner,
Heavens, I., Lahey, K., Semke, J. and B. Volz, "Known TCP J., Heavens, I., Lahey, K., Semke, J. and B. Volz, "Known TCP
Implementation Problems", RFC 2525, March 1999. Implementation Problems", RFC 2525, March 1999.
[RFC2581] Allman, M., Paxson, V., W. Stevens, TCP Congestion [RFC2581] Allman, M., Paxson, V., W. Stevens, TCP Congestion
Control, RFC 2581, April 1999. Control, RFC 2581, April 1999.
[RFC2883] Floyd, S., J. Mahdavi, M. Mathis, M. Podolsky, An [RFC2883] Floyd, S., J. Mahdavi, M. Mathis, M. Podolsky, An
Extension to the Selective Acknowledgement (SACK) Option for Extension to the Selective Acknowledgement (SACK) Option for
TCP, RFC 2883, July 2000. TCP, RFC 2883, July 2000.
[RFC2988] V. Paxson and M. Allman, "Computing TCP's Retransmission [RFC2988] V. Paxson and M. Allman, "Computing TCP's Retransmission
Timer", RFC 2988, November 2000. Timer", RFC 2988, November 2000.
[RFC3042] Allman, M., Balakrishnan, H. and S. Floyd, "Enhancing [RFC3042] Allman, M., Balakrishnan, H. and S. Floyd, "Enhancing
TCP's Loss Recovery Using Limited Transmit", RFC 3042, January TCP's Loss Recovery Using Limited Transmit", RFC 3042, January
2001. 2001.
[RFC3168] K. Ramakrishnan, S. Floyd, D. Black, "The Addition of
Explicit Congestion Notification (ECN) to IP", RFC 3168,
September 2001.
[RFC3390] Allman, M., Floyd, S., C. Partridge, "Increasing TCP's [RFC3390] Allman, M., Floyd, S., C. Partridge, "Increasing TCP's
Initial Window", RFC 3390, October 2002. Initial Window", RFC 3390, October 2002.
[RFC3465] Mark Allman, TCP Congestion Control with Appropriate Byte [RFC3465] Mark Allman, TCP Congestion Control with Appropriate Byte
Counting (ABC), RFC 3465, February 2003. Counting (ABC), RFC 3465, February 2003.
[RFC3517] Ethan Blanton, Mark Allman, Kevin Fall, Lili Wang, A [RFC3517] Ethan Blanton, Mark Allman, Kevin Fall, Lili Wang, A
Conservative Selective Acknowledgment (SACK)-based Loss Recovery Conservative Selective Acknowledgment (SACK)-based Loss Recovery
Algorithm for TCP, RFC 3517, April 2003. Algorithm for TCP, RFC 3517, April 2003.
[RFC3782] Sally Floyd, Tom Henderson, Andrei Gurtov, The NewReno [RFC3782] Sally Floyd, Tom Henderson, Andrei Gurtov, The NewReno
Modification to TCP's Fast Recovery Algorithm, RFC 3782, April Modification to TCP's Fast Recovery Algorithm, RFC 3782, April
2004. 2004.
[RFC4821] Matt Mathis, John Heffner, Packetization Layer Path MTU
Discovery, RFC 4821, March 2007.
[SCWA99] Savage, S., Cardwell, N., Wetherall, D., and T. Anderson, [SCWA99] Savage, S., Cardwell, N., Wetherall, D., and T. Anderson,
"TCP Congestion Control With a Misbehaving Receiver", ACM "TCP Congestion Control With a Misbehaving Receiver", ACM
Computer Communication Review, 29(5), October 1999. Computer Communication Review, 29(5), October 1999.
[Ste94] Stevens, W., "TCP/IP Illustrated, Volume 1: The Protocols", [Ste94] Stevens, W., "TCP/IP Illustrated, Volume 1: The Protocols",
Addison-Wesley, 1994. Addison-Wesley, 1994.
[WS95] Wright, G. and W. Stevens, "TCP/IP Illustrated, Volume 2: The [WS95] Wright, G. and W. Stevens, "TCP/IP Illustrated, Volume 2: The
Implementation", Addison-Wesley, 1995. Implementation", Addison-Wesley, 1995.
Authors' Addresses Authors' Addresses
Mark Allman Mark Allman
ICIR / ICSI International Computer Science Institute (ICSI)
1947 Center Street 1947 Center Street
Suite 600 Suite 600
Berkeley, CA 94704-1198 Berkeley, CA 94704-1198
Phone: +1 440 235 1792 Phone: +1 440 235 1792
EMail: mallman@icir.org EMail: mallman@icir.org
http://www.icir.org/mallman/ http://www.icir.org/mallman/
Vern Paxson Vern Paxson
ICIR / ICSI International Computer Science Institute (ICSI)
1947 Center Street 1947 Center Street
Suite 600 Suite 600
Berkeley, CA 94704-1198 Berkeley, CA 94704-1198
Phone: +1 510/642-4274 x302 Phone: +1 510/642-4274 x302
EMail: vern@icir.org EMail: vern@icir.org
http://www.icir.org/vern/ http://www.icir.org/vern/
Ethan Blanton Ethan Blanton
Purdue University Computer Sciences Purdue University Computer Sciences
1398 Computer Science Building 1398 Computer Science Building
skipping to change at page 16, line 32 skipping to change at page 16, line 54
on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE
REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE
IETF TRUST AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL IETF TRUST AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL
WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY
WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE
ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS
FOR A PARTICULAR PURPOSE. FOR A PARTICULAR PURPOSE.
Copyright Statement Copyright Statement
Copyright (C) The IETF Trust (2007). This document is subject to Copyright (C) The IETF Trust (2008). This document is subject to
the rights, licenses and restrictions contained in BCP 78, and the rights, licenses and restrictions contained in BCP 78, and
except as set forth therein, the authors retain all their rights. except as set forth therein, the authors retain all their rights.
Acknowledgment Acknowledgment
Funding for the RFC Editor function is currently provided by the Funding for the RFC Editor function is currently provided by the
Internet Society. Internet Society.
 End of changes. 34 change blocks. 
70 lines changed or deleted 88 lines changed or added

This html diff was produced by rfcdiff 1.34. The latest version is available from http://tools.ietf.org/tools/rfcdiff/