draft-ietf-tcpm-rfc2581bis-03.txt | draft-ietf-tcpm-rfc2581bis-04.txt | |||
---|---|---|---|---|
Network Working Group M. Allman | Network Working Group M. Allman | |||
Internet-Draft V. Paxson | Internet-Draft V. Paxson | |||
Expires: March 2008 ICIR / ICSI | Expires: October 2008 ICSI | |||
E. Blanton | E. Blanton | |||
Purdue University | Purdue University | |||
September 2007 | ||||
TCP Congestion Control | TCP Congestion Control | |||
draft-ietf-tcpm-rfc2581bis-03.txt | draft-ietf-tcpm-rfc2581bis-04.txt | |||
Status of this Memo | Status of this Memo | |||
By submitting this Internet-Draft, each author represents that any | By submitting this Internet-Draft, each author represents that any | |||
applicable patent or other IPR claims of which he or she is aware | applicable patent or other IPR claims of which he or she is aware | |||
have been or will be disclosed, and any of which he or she becomes | have been or will be disclosed, and any of which he or she becomes | |||
aware will be disclosed, in accordance with Section 6 of BCP 79. | aware will be disclosed, in accordance with Section 6 of BCP 79. | |||
Internet-Drafts are working documents of the Internet Engineering | Internet-Drafts are working documents of the Internet Engineering | |||
Task Force (IETF), its areas, and its working groups. Note that | Task Force (IETF), its areas, and its working groups. Note that | |||
skipping to change at page 1, line 36 | skipping to change at page 1, line 34 | |||
months and may be updated, replaced, or obsoleted by other documents | months and may be updated, replaced, or obsoleted by other documents | |||
at any time. It is inappropriate to use Internet-Drafts as | at any time. It is inappropriate to use Internet-Drafts as | |||
reference material or to cite them other than as "work in progress." | reference material or to cite them other than as "work in progress." | |||
The list of current Internet-Drafts can be accessed at | The list of current Internet-Drafts can be accessed at | |||
http://www.ietf.org/ietf/1id-abstracts.txt. | http://www.ietf.org/ietf/1id-abstracts.txt. | |||
The list of Internet-Draft Shadow Directories can be accessed at | The list of Internet-Draft Shadow Directories can be accessed at | |||
http://www.ietf.org/shadow.html. | http://www.ietf.org/shadow.html. | |||
Copyright Notice | ||||
Copyright (C) The Internet Society (2007). | ||||
Abstract | Abstract | |||
This document defines TCP's four intertwined congestion control | This document defines TCP's four intertwined congestion control | |||
algorithms: slow start, congestion avoidance, fast retransmit, and | algorithms: slow start, congestion avoidance, fast retransmit, and | |||
fast recovery. In addition, the document specifies how TCP should | fast recovery. In addition, the document specifies how TCP should | |||
begin transmission after a relatively long idle period, as well as | begin transmission after a relatively long idle period, as well as | |||
discussing various acknowledgment generation methods. | discussing various acknowledgment generation methods. | |||
1. Introduction | 1. Introduction | |||
This document specifies four TCP [RFC793] congestion control | This document specifies four TCP [RFC793] congestion control | |||
algorithms: slow start, congestion avoidance, fast retransmit and | algorithms: slow start, congestion avoidance, fast retransmit and | |||
fast recovery. These algorithms were devised in [Jac88] and | fast recovery. These algorithms were devised in [Jac88] and | |||
[Jac90]. Their use with TCP is standardized in [RFC1122]. Additional | [Jac90]. Their use with TCP is standardized in [RFC1122]. | |||
early work in additive-increase, multiplicative-decrease congestion | Additional early work in additive-increase, multiplicative-decrease | |||
control is given in [CJ89]. | congestion control is given in [CJ89]. | |||
This document obsoletes [RFC2581] which in turned obsoleted | This document obsoletes [RFC2581] which in turned obsoleted | |||
[RFC2001]. | [RFC2001]. | |||
In addition to specifying the congestion control algorithms, this | In addition to specifying the congestion control algorithms, this | |||
document specifies what TCP connections should do after a relatively | document specifies what TCP connections should do after a relatively | |||
long idle period, as well as specifying and clarifying some of the | long idle period, as well as specifying and clarifying some of the | |||
issues pertaining to TCP ACK generation. | issues pertaining to TCP ACK generation. | |||
Note that [Ste94] provides examples of these algorithms in action | Note that [Ste94] provides examples of these algorithms in action | |||
skipping to change at page 2, line 39 | skipping to change at page 2, line 34 | |||
This section provides the definition of several terms that will be | This section provides the definition of several terms that will be | |||
used throughout the remainder of this document. | used throughout the remainder of this document. | |||
SEGMENT: A segment is ANY TCP/IP data or acknowledgment packet (or | SEGMENT: A segment is ANY TCP/IP data or acknowledgment packet (or | |||
both). | both). | |||
SENDER MAXIMUM SEGMENT SIZE (SMSS): The SMSS is the size of the | SENDER MAXIMUM SEGMENT SIZE (SMSS): The SMSS is the size of the | |||
largest segment that the sender can transmit. This value can be | largest segment that the sender can transmit. This value can be | |||
based on the maximum transmission unit of the network, the path | based on the maximum transmission unit of the network, the path | |||
MTU discovery [RFC1191] algorithm, RMSS (see next item), or other | MTU discovery [RFC1191,RFC4821] algorithm, RMSS (see next item), | |||
factors. The size does not include the TCP/IP headers and | or other factors. The size does not include the TCP/IP headers | |||
options. | and options. | |||
RECEIVER MAXIMUM SEGMENT SIZE (RMSS): The RMSS is the size of the | RECEIVER MAXIMUM SEGMENT SIZE (RMSS): The RMSS is the size of the | |||
largest segment the receiver is willing to accept. This is the | largest segment the receiver is willing to accept. This is the | |||
value specified in the MSS option sent by the receiver during | value specified in the MSS option sent by the receiver during | |||
connection startup. Or, if the MSS option is not used, 536 | connection startup. Or, if the MSS option is not used, 536 | |||
bytes [RFC1122]. The size does not include the TCP/IP headers and | bytes [RFC1122]. The size does not include the TCP/IP headers | |||
options. | and options. | |||
FULL-SIZED SEGMENT: A segment that contains the maximum number of | FULL-SIZED SEGMENT: A segment that contains the maximum number of | |||
data bytes permitted (i.e., a segment containing SMSS bytes of | data bytes permitted (i.e., a segment containing SMSS bytes of | |||
data). | data). | |||
RECEIVER WINDOW (rwnd): The most recently advertised receiver | RECEIVER WINDOW (rwnd): The most recently advertised receiver | |||
window. | window. | |||
CONGESTION WINDOW (cwnd): A TCP state variable that limits the | CONGESTION WINDOW (cwnd): A TCP state variable that limits the | |||
amount of data a TCP can send. At any given time, a TCP MUST | amount of data a TCP can send. At any given time, a TCP MUST | |||
skipping to change at page 3, line 21 | skipping to change at page 3, line 18 | |||
LOSS WINDOW (LW): The loss window is the size of the congestion | LOSS WINDOW (LW): The loss window is the size of the congestion | |||
window after a TCP sender detects loss using its retransmission | window after a TCP sender detects loss using its retransmission | |||
timer. | timer. | |||
RESTART WINDOW (RW): The restart window is the size of the | RESTART WINDOW (RW): The restart window is the size of the | |||
congestion window after a TCP restarts transmission after an | congestion window after a TCP restarts transmission after an | |||
idle period (if the slow start algorithm is used; see section | idle period (if the slow start algorithm is used; see section | |||
4.1 for more discussion). | 4.1 for more discussion). | |||
FLIGHT SIZE: The amount of data that has been sent but not yet | FLIGHT SIZE: The amount of data that has been sent but not yet | |||
acknowledged. | cumulatively acknowledged. | |||
DUPLICATE ACKNOWLEDGMENT: An acknowledgment is considered a | DUPLICATE ACKNOWLEDGMENT: An acknowledgment is considered a | |||
"duplicate" in the following algorithms when (a) the receiver of | "duplicate" in the following algorithms when (a) the receiver of | |||
the ACK has outstanding data, (b) the incoming acknowledgment | the ACK has outstanding data, (b) the incoming acknowledgment | |||
carries no data, (c) the SYN and FIN bits are both off, (d) the | carries no data, (c) the SYN and FIN bits are both off, (d) the | |||
acknowledgment number is equal to the greatest acknowledgment | acknowledgment number is equal to the greatest acknowledgment | |||
received on the given connection (TCP.UNA from [RFC793]) and (e) | received on the given connection (TCP.UNA from [RFC793]) and (e) | |||
the advertised window in the incoming acknowledgment equals the | the advertised window in the incoming acknowledgment equals the | |||
advertised window in the last incoming acknowledgment. | advertised window in the last incoming acknowledgment. | |||
Alternatively, a TCP that utilizes selective acknowledgments | Alternatively, a TCP that utilizes selective acknowledgments | |||
[RFC2018,RFC2883] can determine an incoming ACK is a "duplicate" | [RFC2018,RFC2883] can leverage the SACK information to determine | |||
if the ACK contains previously unknown SACK information. | when an incoming ACK is a "duplicate" (e.g., if the ACK contains | |||
previously unknown SACK information). | ||||
3. Congestion Control Algorithms | 3. Congestion Control Algorithms | |||
This section defines the four congestion control algorithms: slow | This section defines the four congestion control algorithms: slow | |||
start, congestion avoidance, fast retransmit and fast recovery, | start, congestion avoidance, fast retransmit and fast recovery, | |||
developed in [Jac88] and [Jac90]. In some situations it may be | developed in [Jac88] and [Jac90]. In some situations it may be | |||
beneficial for a TCP sender to be more conservative than the | beneficial for a TCP sender to be more conservative than the | |||
algorithms allow, however a TCP MUST NOT be more aggressive than the | algorithms allow, however a TCP MUST NOT be more aggressive than the | |||
following algorithms allow (that is, MUST NOT send data when the | following algorithms allow (that is, MUST NOT send data when the | |||
value of cwnd computed by the following algorithms would not allow | value of cwnd computed by the following algorithms would not allow | |||
the data to be sent). | the data to be sent). | |||
Also note that the algorithms specified in this document work in | ||||
terms of using loss as the signal of congestion. Explicit | ||||
Congestion Notification (ECN) could also be used as specified in | ||||
[RFC3168]. | ||||
3.1 Slow Start and Congestion Avoidance | 3.1 Slow Start and Congestion Avoidance | |||
The slow start and congestion avoidance algorithms MUST be used by a | The slow start and congestion avoidance algorithms MUST be used by a | |||
TCP sender to control the amount of outstanding data being injected | TCP sender to control the amount of outstanding data being injected | |||
into the network. To implement these algorithms, two variables are | into the network. To implement these algorithms, two variables are | |||
added to the TCP per-connection state. The congestion window (cwnd) | added to the TCP per-connection state. The congestion window (cwnd) | |||
is a sender-side limit on the amount of data the sender can transmit | is a sender-side limit on the amount of data the sender can transmit | |||
into the network before receiving an acknowledgment (ACK), while the | into the network before receiving an acknowledgment (ACK), while the | |||
receiver's advertised window (rwnd) is a receiver-side limit on the | receiver's advertised window (rwnd) is a receiver-side limit on the | |||
amount of outstanding data. The minimum of cwnd and rwnd governs | amount of outstanding data. The minimum of cwnd and rwnd governs | |||
skipping to change at page 4, line 14 | skipping to change at page 4, line 16 | |||
Another state variable, the slow start threshold (ssthresh), is used | Another state variable, the slow start threshold (ssthresh), is used | |||
to determine whether the slow start or congestion avoidance | to determine whether the slow start or congestion avoidance | |||
algorithm is used to control data transmission, as discussed below. | algorithm is used to control data transmission, as discussed below. | |||
Beginning transmission into a network with unknown conditions | Beginning transmission into a network with unknown conditions | |||
requires TCP to slowly probe the network to determine the available | requires TCP to slowly probe the network to determine the available | |||
capacity, in order to avoid congesting the network with an | capacity, in order to avoid congesting the network with an | |||
inappropriately large burst of data. The slow start algorithm is | inappropriately large burst of data. The slow start algorithm is | |||
used for this purpose at the beginning of a transfer, or after | used for this purpose at the beginning of a transfer, or after | |||
repairing loss detected by the retransmission timer. | repairing loss detected by the retransmission timer. Slow start | |||
additionally serves to start the "ACK clock" used by the TCP sender | ||||
to release data into the network in the slow start, congestion | ||||
avoidance, and loss recovery algorithms. | ||||
IW, the initial value of cwnd, MUST be set using the following | IW, the initial value of cwnd, MUST be set using the following | |||
guidelines as an upper bound. | guidelines as an upper bound. | |||
If SMSS > 2190 bytes: | If SMSS > 2190 bytes: | |||
IW = 2 * SMSS bytes and MUST NOT be more than 2 segments | IW = 2 * SMSS bytes and MUST NOT be more than 2 segments | |||
If (SMSS > 1095 bytes) and (SMSS <= 2190 bytes): | If (SMSS > 1095 bytes) and (SMSS <= 2190 bytes): | |||
IW = 3 * SMSS bytes and MUST NOT be more than 3 segments | IW = 3 * SMSS bytes and MUST NOT be more than 3 segments | |||
if SMSS <= 1095 bytes: | if SMSS <= 1095 bytes: | |||
IW = 4 * SMSS bytes and MUST NOT be more than 4 segments | IW = 4 * SMSS bytes and MUST NOT be more than 4 segments | |||
As specified in [RFC3390], the SYN/ACK and the acknowledgment of the | As specified in [RFC3390], the SYN/ACK and the acknowledgment of the | |||
SYN/ACK MUST NOT increase the size of the congestion window. | SYN/ACK MUST NOT increase the size of the congestion window. | |||
Further, if the SYN or SYN/ACK is lost, the initial window used by a | Further, if the SYN or SYN/ACK is lost, the initial window used by a | |||
sender after a correctly transmitted SYN MUST be one segment | sender after a correctly transmitted SYN MUST be one segment | |||
consisting of at most SMSS bytes. | consisting of at most SMSS bytes. | |||
A detailed rationale and discussion of the IW setting is provided in | A detailed rationale and discussion of the IW setting is provided in | |||
[RFC3390]. | [RFC3390]. | |||
When larger initial windows are implemented along with Path MTU | When initial congestion windows of more than one segment are | |||
Discovery [RFC1191], and the MSS being used is found to be too | implemented along with Path MTU Discovery [RFC1191], and the MSS | |||
large, the congestion window cwnd SHOULD be reduced to prevent | being used is found to be too large, the congestion window cwnd | |||
large bursts of smaller segments. Specifically, cwnd SHOULD be | SHOULD be reduced to prevent large bursts of smaller segments. | |||
reduced by the ratio of the old segment size to the new segment | Specifically, cwnd SHOULD be reduced by the ratio of the old segment | |||
size. | size to the new segment size. | |||
The initial value of ssthresh SHOULD be set arbitrarily high (e.g., | The initial value of ssthresh SHOULD be set arbitrarily high (e.g., | |||
to the size of the largest possible advertised window), but ssthresh | to the size of the largest possible advertised window), but ssthresh | |||
MUST be reduced in response to congestion. Setting ssthresh as high | MUST be reduced in response to congestion. Setting ssthresh as high | |||
as possible allows the network conditions, rather than some | as possible allows the network conditions, rather than some | |||
arbitrary host limit, to dictate the sending rate. In cases where | arbitrary host limit, to dictate the sending rate. In cases where | |||
the end systems have a solid understanding of the network path, more | the end systems have a solid understanding of the network path, more | |||
carefully setting the initial ssthresh value may have merit (e.g., | carefully setting the initial ssthresh value may have merit (e.g., | |||
such that the end host does not create congestion along the path). | such that the end host does not create congestion along the path). | |||
The slow start algorithm is used when cwnd < ssthresh, while the | The slow start algorithm is used when cwnd < ssthresh, while the | |||
congestion avoidance algorithm is used when cwnd > ssthresh. When | congestion avoidance algorithm is used when cwnd > ssthresh. When | |||
cwnd and ssthresh are equal the sender may use either slow start or | cwnd and ssthresh are equal the sender may use either slow start or | |||
congestion avoidance. | congestion avoidance. | |||
During slow start, a TCP increments cwnd by at most SMSS bytes for | During slow start, a TCP increments cwnd by at most SMSS bytes for | |||
each ACK received that acknowledges new data. Slow start ends when | each ACK received that cumulatively acknowledges new data. Slow | |||
cwnd exceeds ssthresh (or, optionally, when it reaches it, as noted | start ends when cwnd exceeds ssthresh (or, optionally, when it | |||
above) or when congestion is observed. While traditionally TCP | reaches it, as noted above) or when congestion is observed. While | |||
implementations have increased cwnd by precisely SMSS bytes upon | traditionally TCP implementations have increased cwnd by precisely | |||
receipt of an ACK covering new data, we RECOMMEND that TCP | SMSS bytes upon receipt of an ACK covering new data, we RECOMMEND | |||
implementations increase cwnd, per: | that TCP implementations increase cwnd, per: | |||
cwnd += min (N, SMSS) (2) | cwnd += min (N, SMSS) (2) | |||
where N is the number of previously unacknowledged bytes | where N is the number of previously unacknowledged bytes | |||
acknowledged in the incoming ACK. This adjustment is part of | acknowledged in the incoming ACK. This adjustment is part of | |||
Appropriate Byte Counting [RFC3465] and provides robustness against | Appropriate Byte Counting [RFC3465] and provides robustness against | |||
misbehaving receivers which may attempt to induce a sender to | misbehaving receivers which may attempt to induce a sender to | |||
artificially inflate cwnd using a mechanism known as "ACK Division" | artificially inflate cwnd using a mechanism known as "ACK Division" | |||
[SCWA99]. ACK Division consists of a receiver sending multiple ACKs | [SCWA99]. ACK Division consists of a receiver sending multiple ACKs | |||
for a single TCP data segment, each acknowledging only a portion of | for a single TCP data segment, each acknowledging only a portion of | |||
skipping to change at page 5, line 29 | skipping to change at page 5, line 35 | |||
inappropriately inflate the amount of data injected into the | inappropriately inflate the amount of data injected into the | |||
network. | network. | |||
During congestion avoidance, cwnd is incremented by roughly 1 | During congestion avoidance, cwnd is incremented by roughly 1 | |||
full-sized segment per round-trip time (RTT). Congestion avoidance | full-sized segment per round-trip time (RTT). Congestion avoidance | |||
continues until congestion is detected. The basic guidelines for | continues until congestion is detected. The basic guidelines for | |||
incrementing cwnd during congestion avoidance are: | incrementing cwnd during congestion avoidance are: | |||
* MAY increment cwnd by SMSS bytes | * MAY increment cwnd by SMSS bytes | |||
* SHOULD increment cwnd per equation (2) | * SHOULD increment cwnd per equation (2) once per RTT | |||
* MUST NOT increment cwnd by more than SMSS bytes | * MUST NOT increment cwnd by more than SMSS bytes | |||
We note that [RFC3465] allows for cwnd increases of more than SMSS | We note that [RFC3465] allows for cwnd increases of more than SMSS | |||
bytes for incoming acknowledgments during slow start on an | bytes for incoming acknowledgments during slow start on an | |||
experimental basis, however such behavior is not allowed as part of | experimental basis, however such behavior is not allowed as part of | |||
the standard. | the standard. | |||
The RECOMMENDED way to increase cwnd during congestion avoidance is | The RECOMMENDED way to increase cwnd during congestion avoidance is | |||
to count the number of bytes that have been acknowledged by ACKs for | to count the number of bytes that have been acknowledged by ACKs for | |||
skipping to change at page 5, line 52 | skipping to change at page 6, line 4 | |||
acknowledged reaches cwnd, then cwnd can be incremented by up to | acknowledged reaches cwnd, then cwnd can be incremented by up to | |||
SMSS bytes. Note that during congestion avoidance, cwnd MUST NOT be | SMSS bytes. Note that during congestion avoidance, cwnd MUST NOT be | |||
increased by more than SMSS bytes per RTT. This method both allows | increased by more than SMSS bytes per RTT. This method both allows | |||
TCPs to increase cwnd by one segment per RTT in the face of delayed | TCPs to increase cwnd by one segment per RTT in the face of delayed | |||
ACKs and provides robustness against ACK Division attacks. | ACKs and provides robustness against ACK Division attacks. | |||
Another common formula that a TCP MAY use to update cwnd during | Another common formula that a TCP MAY use to update cwnd during | |||
congestion avoidance is given in equation 3: | congestion avoidance is given in equation 3: | |||
cwnd += SMSS*SMSS/cwnd (3) | cwnd += SMSS*SMSS/cwnd (3) | |||
This adjustment is executed on every incoming ACK that acknowledges | This adjustment is executed on every incoming ACK that acknowledges | |||
new data. | new data. Equation (3) provides an acceptable approximation to the | |||
Equation (3) provides an acceptable approximation to the underlying | underlying principle of increasing cwnd by 1 full-sized segment per | |||
principle of increasing cwnd by 1 full-sized segment per RTT. (Note | RTT. (Note that for a connection in which the receiver is | |||
that for a connection in which the receiver is acknowledging | acknowledging every-other packet, (3) is less aggressive than | |||
every-other packet, (3) is less aggressive than allowed -- roughly | allowed -- roughly increasing cwnd every second RTT.) | |||
increasing cwnd every second RTT.) | ||||
Implementation Note: Since integer arithmetic is usually used in TCP | Implementation Note: Since integer arithmetic is usually used in TCP | |||
implementations, the formula given in equation 3 can fail to | implementations, the formula given in equation 3 can fail to | |||
increase cwnd when the congestion window is larger than SMSS*SMSS. | increase cwnd when the congestion window is larger than SMSS*SMSS. | |||
If the above formula yields 0, the result SHOULD be rounded up to 1 | If the above formula yields 0, the result SHOULD be rounded up to 1 | |||
byte. | byte. | |||
Implementation Note: Older implementations have an additional | Implementation Note: Older implementations have an additional | |||
additive constant on the right-hand side of equation (3). This is | additive constant on the right-hand side of equation (3). This is | |||
incorrect and can actually lead to diminished performance [RFC2525]. | incorrect and can actually lead to diminished performance [RFC2525]. | |||
skipping to change at page 6, line 34 | skipping to change at page 6, line 38 | |||
value of ssthresh MUST be set to no more than the value given in | value of ssthresh MUST be set to no more than the value given in | |||
equation 4: | equation 4: | |||
ssthresh = max (FlightSize / 2, 2*SMSS) (4) | ssthresh = max (FlightSize / 2, 2*SMSS) (4) | |||
where, as discussed above, FlightSize is the amount of outstanding | where, as discussed above, FlightSize is the amount of outstanding | |||
data in the network. | data in the network. | |||
On the other hand, when a TCP sender detects segment loss using the | On the other hand, when a TCP sender detects segment loss using the | |||
retransmission timer and the given segment has already been | retransmission timer and the given segment has already been | |||
retransmitted at least once, the value of ssthresh is held | retransmitted by way of the retransmission timer at least once, the | |||
constant. | value of ssthresh is held constant. | |||
Implementation Note: An easy mistake to make is to simply use cwnd, | Implementation Note: An easy mistake to make is to simply use cwnd, | |||
rather than FlightSize, which in some implementations may | rather than FlightSize, which in some implementations may | |||
incidentally increase well beyond rwnd. | incidentally increase well beyond rwnd. | |||
Furthermore, upon a timeout (as specified in [RFC2988]) cwnd MUST be | Furthermore, upon a timeout (as specified in [RFC2988]) cwnd MUST be | |||
set to no more than the loss window, LW, which equals 1 full-sized | set to no more than the loss window, LW, which equals 1 full-sized | |||
segment (regardless of the value of IW). Therefore, after | segment (regardless of the value of IW). Therefore, after | |||
retransmitting the dropped segment the TCP sender uses the slow | retransmitting the dropped segment the TCP sender uses the slow | |||
start algorithm to increase the window from 1 full-sized segment to | start algorithm to increase the window from 1 full-sized segment to | |||
skipping to change at page 8, line 6 | skipping to change at page 8, line 11 | |||
TCP SHOULD send a segment of previously unsent data per | TCP SHOULD send a segment of previously unsent data per | |||
[RFC3042] provided that the receiver's advertised window allows, | [RFC3042] provided that the receiver's advertised window allows, | |||
the total FlightSize would remain less than or equal to cwnd | the total FlightSize would remain less than or equal to cwnd | |||
plus 2*SMSS, and that new data is available for transmission. | plus 2*SMSS, and that new data is available for transmission. | |||
Further, the TCP sender MUST NOT change cwnd to reflect these | Further, the TCP sender MUST NOT change cwnd to reflect these | |||
two segments [RFC3042]. Note that a sender using SACK [RFC2018] | two segments [RFC3042]. Note that a sender using SACK [RFC2018] | |||
MUST NOT send new data unless the incoming duplicate | MUST NOT send new data unless the incoming duplicate | |||
acknowledgment contains new SACK information. | acknowledgment contains new SACK information. | |||
2. When the third duplicate ACK is received, a TCP MUST set | 2. When the third duplicate ACK is received, a TCP MUST set | |||
ssthresh to no more than the value given in equation 4. | ssthresh to no more than the value given in equation 4. When | |||
[RFC3042] is in use, additional data sent in limited transmit | ||||
MUST NOT be included in this calculation. | ||||
3. The lost segment MUST be retransmitted and cwnd set to | 3. The lost segment starting at SND.UNA MUST be retransmitted and | |||
ssthresh plus 3*SMSS. This artificially "inflates" the | cwnd set to ssthresh plus 3*SMSS. This artificially "inflates" | |||
congestion window by the number of segments (three) that have | the congestion window by the number of segments (three) that | |||
left the network and which the receiver has buffered. | have left the network and which the receiver has buffered. | |||
4. For each additional duplicate ACK received (after the third), | 4. For each additional duplicate ACK received (after the third), | |||
cwnd MUST be incremented by SMSS. This artificially inflates | cwnd MUST be incremented by SMSS. This artificially inflates | |||
the congestion window in order to reflect the additional segment | the congestion window in order to reflect the additional segment | |||
that has left the network. | that has left the network. | |||
Note: [SCWA99] discusses a receiver-based attack whereby many | Note: [SCWA99] discusses a receiver-based attack whereby many | |||
bogus duplicate ACKs are sent to the data sender in order to | bogus duplicate ACKs are sent to the data sender in order to | |||
artificially inflate cwnd and cause a higher than appropriate | artificially inflate cwnd and cause a higher than appropriate | |||
sending rate to be used. A TCP MAY therefore limit the number | sending rate to be used. A TCP MAY therefore limit the number | |||
of times cwnd is artificially inflated during loss recovery | of times cwnd is artificially inflated during loss recovery | |||
to the number of outstanding segments (or, an approximation | to the number of outstanding segments (or, an approximation | |||
thereof). | thereof). | |||
5. Transmit a segment, if allowed by the new value of cwnd and the | 5. When previously unsent data is available and the new value of | |||
receiver's advertised window. | cwnd and the receiver's advertised window allow, a TCP SHOULD | |||
send 1*SMSS bytes of previously unsent data. | ||||
6. When the next ACK arrives that acknowledges previously | 6. When the next ACK arrives that acknowledges previously | |||
unacknowledged data, a TCP MUST set cwnd to ssthresh (the value | unacknowledged data, a TCP MUST set cwnd to ssthresh (the value | |||
set in step 2). This is termed "deflating" the window. | set in step 2). This is termed "deflating" the window. | |||
This ACK should be the acknowledgment elicited by the | This ACK should be the acknowledgment elicited by the | |||
retransmission from step 3, one RTT after the retransmission | retransmission from step 3, one RTT after the retransmission | |||
(though it may arrive sooner in the presence of significant out- | (though it may arrive sooner in the presence of significant out- | |||
of-order delivery of data segments at the receiver). | of-order delivery of data segments at the receiver). | |||
Additionally, this ACK should acknowledge all the intermediate | Additionally, this ACK should acknowledge all the intermediate | |||
skipping to change at page 8, line 56 | skipping to change at page 9, line 10 | |||
4.1 Re-starting Idle Connections | 4.1 Re-starting Idle Connections | |||
A known problem with the TCP congestion control algorithms described | A known problem with the TCP congestion control algorithms described | |||
above is that they allow a potentially inappropriate burst of | above is that they allow a potentially inappropriate burst of | |||
traffic to be transmitted after TCP has been idle for a relatively | traffic to be transmitted after TCP has been idle for a relatively | |||
long period of time. After an idle period, TCP cannot use the ACK | long period of time. After an idle period, TCP cannot use the ACK | |||
clock to strobe new segments into the network, as all the ACKs have | clock to strobe new segments into the network, as all the ACKs have | |||
drained from the network. Therefore, as specified above, TCP can | drained from the network. Therefore, as specified above, TCP can | |||
potentially send a cwnd-size line-rate burst into the network after | potentially send a cwnd-size line-rate burst into the network after | |||
an idle period. | an idle period. In addition, changing network conditions may have | |||
rendered TCP's notion of the available end-to-end network capacity | ||||
between two endpoints, as estimated by cwnd, inaccurate during the | ||||
course of a long idle period. | ||||
[Jac88] recommends that a TCP use slow start to restart | [Jac88] recommends that a TCP use slow start to restart | |||
transmission after a relatively long idle period. Slow start | transmission after a relatively long idle period. Slow start | |||
serves to restart the ACK clock, just as it does at the beginning | serves to restart the ACK clock, just as it does at the beginning | |||
of a transfer. This mechanism has been widely deployed in the | of a transfer. This mechanism has been widely deployed in the | |||
following manner. When TCP has not received a segment for more | following manner. When TCP has not received a segment for more | |||
than one retransmission timeout, cwnd is reduced to the value of | than one retransmission timeout, cwnd is reduced to the value of | |||
the restart window (RW) before transmission begins. | the restart window (RW) before transmission begins. | |||
For the purposes of this standard, we define RW = min(IW,cwnd). | For the purposes of this standard, we define RW = min(IW,cwnd). | |||
skipping to change at page 9, line 42 | skipping to change at page 9, line 52 | |||
generated for at least every second full-sized segment, and MUST be | generated for at least every second full-sized segment, and MUST be | |||
generated within 500 ms of the arrival of the first unacknowledged | generated within 500 ms of the arrival of the first unacknowledged | |||
packet. | packet. | |||
The requirement that an ACK "SHOULD" be generated for at least every | The requirement that an ACK "SHOULD" be generated for at least every | |||
second full-sized segment is listed in [RFC1122] in one place as a | second full-sized segment is listed in [RFC1122] in one place as a | |||
SHOULD and another as a MUST. Here we unambiguously state it is a | SHOULD and another as a MUST. Here we unambiguously state it is a | |||
SHOULD. We also emphasize that this is a SHOULD, meaning that an | SHOULD. We also emphasize that this is a SHOULD, meaning that an | |||
implementor should indeed only deviate from this requirement after | implementor should indeed only deviate from this requirement after | |||
careful consideration of the implications. See the discussion of | careful consideration of the implications. See the discussion of | |||
"Stretch ACK violation" in [RFC2525] and the references therein for a | "Stretch ACK violation" in [RFC2525] and the references therein for | |||
discussion of the possible performance problems with generating ACKs | a discussion of the possible performance problems with generating | |||
less frequently than every second full-sized segment. | ACKs less frequently than every second full-sized segment. | |||
In some cases, the sender and receiver may not agree on what | In some cases, the sender and receiver may not agree on what | |||
constitutes a full-sized segment. An implementation is deemed to | constitutes a full-sized segment. An implementation is deemed to | |||
comply with this requirement if it sends at least one acknowledgment | comply with this requirement if it sends at least one acknowledgment | |||
every time it receives 2*RMSS bytes of new data from the sender, | every time it receives 2*RMSS bytes of new data from the sender, | |||
where RMSS is the Maximum Segment Size specified by the receiver to | where RMSS is the Maximum Segment Size specified by the receiver to | |||
the sender (or the default value of 536 bytes, per [RFC1122], if the | the sender (or the default value of 536 bytes, per [RFC1122], if the | |||
receiver does not specify an MSS option during connection | receiver does not specify an MSS option during connection | |||
establishment). The sender may be forced to use a segment size less | establishment). The sender may be forced to use a segment size less | |||
than RMSS due to the maximum transmission unit (MTU), the path MTU | than RMSS due to the maximum transmission unit (MTU), the path MTU | |||
skipping to change at page 11, line 35 | skipping to change at page 11, line 46 | |||
The Internet to a considerable degree relies on the correct | The Internet to a considerable degree relies on the correct | |||
implementation of these algorithms in order to preserve network | implementation of these algorithms in order to preserve network | |||
stability and avoid congestion collapse. An attacker could cause | stability and avoid congestion collapse. An attacker could cause | |||
TCP endpoints to respond more aggressively in the face of congestion | TCP endpoints to respond more aggressively in the face of congestion | |||
by forging excessive duplicate acknowledgments or excessive | by forging excessive duplicate acknowledgments or excessive | |||
acknowledgments for new data. Conceivably, such an attack could | acknowledgments for new data. Conceivably, such an attack could | |||
drive a portion of the network into congestion collapse. | drive a portion of the network into congestion collapse. | |||
6. Changes Between RFC 2001 and RFC 2581 | 6. Changes Between RFC 2001 and RFC 2581 | |||
This document has been extensively rewritten editorially and it is | [RFC2001] has been extensively rewritten editorially and it is not | |||
not feasible to itemize the list of changes between the two | feasible to itemize the list of changes between [RFC2001] and | |||
documents. The intention of this document is not to change any of | [RFC2581]. The intention of [RFC2581] is to not change any of the | |||
the recommendations given in RFC 2001, but to further clarify cases | recommendations given in [RFC2001], but to further clarify cases | |||
that were not discussed in detail in 2001. Specifically, this | that were not discussed in detail in [RFC2001]. Specifically, | |||
document suggests what TCP connections should do after a relatively | [RFC2581] suggests what TCP connections should do after a relatively | |||
long idle period, as well as specifying and clarifying some of the | long idle period, as well as specifying and clarifying some of the | |||
issues pertaining to TCP ACK generation. Finally, the allowable | issues pertaining to TCP ACK generation. Finally, the allowable | |||
upper bound for the initial congestion window has also been raised | upper bound for the initial congestion window has also been raised | |||
from one to two segments. | from one to two segments. | |||
7. Changes Relative to RFC 2581 | 7. Changes Relative to RFC 2581 | |||
A specific definition for "duplicate acknowledgment" has been | A specific definition for "duplicate acknowledgment" has been | |||
added, based on the definition used by BSD TCP. | added, based on the definition used by BSD TCP. | |||
The document now notes that what to do with duplicate ACKs after the | The document now notes that what to do with duplicate ACKs after the | |||
retransmission timer has fired is future work and explicitly | retransmission timer has fired is future work and explicitly | |||
unspecified in this document. | unspecified in this document. | |||
The initial window requirements were changed to allow Larger | The initial window requirements were changed to allow Larger | |||
Initial Windows as standardized in [RFC3390]. Additionally, the | Initial Windows as standardized in [RFC3390]. Additionally, the | |||
steps to take when an initial window is discovered to be too large | steps to take when an initial window is discovered to be too large | |||
skipping to change at page 12, line 41 | skipping to change at page 12, line 51 | |||
The restart window has been changed to min(IW,cwnd) from IW. This | The restart window has been changed to min(IW,cwnd) from IW. This | |||
behavior was described as "experimental" in [RFC2581]. | behavior was described as "experimental" in [RFC2581]. | |||
It is now recommended that TCP implementors implement an advanced | It is now recommended that TCP implementors implement an advanced | |||
loss recovery algorithm conforming to the principles outlined in | loss recovery algorithm conforming to the principles outlined in | |||
this document. | this document. | |||
The security considerations have been updated to discuss ACK | The security considerations have been updated to discuss ACK | |||
division and recommend byte counting as a counter to this attack. | division and recommend byte counting as a counter to this attack. | |||
Acknowledgments | 8. IANA Considerations | |||
This document contains no IANA considerations, but apparently an | ||||
Internet *Draft* can no longer be published without this section. | ||||
Acknowledgments | ||||
The core algorithms we describe were developed by Van Jacobson | The core algorithms we describe were developed by Van Jacobson | |||
[Jac88, Jac90]. In addition, Limited Transmit [RFC3042] was | [Jac88, Jac90]. In addition, Limited Transmit [RFC3042] was | |||
developed in conjunction with Hari Balakrishnan and Sally Floyd. | developed in conjunction with Hari Balakrishnan and Sally Floyd. | |||
The initial congestion window size specified in this document is a | The initial congestion window size specified in this document is a | |||
result of work with Sally Floyd and Craig Partridge | result of work with Sally Floyd and Craig Partridge | |||
[RFC2414,RFC3390]. | [RFC2414,RFC3390]. | |||
W. Richard ("Rich") Stevens wrote the first version of this document | W. Richard ("Rich") Stevens wrote the first version of this document | |||
[RFC2001] and co-authored the second version [RFC2581]. This | [RFC2001] and co-authored the second version [RFC2581]. This | |||
present version much benefits from his clarity and thoughtfulness of | present version much benefits from his clarity and thoughtfulness of | |||
skipping to change at page 13, line 13 | skipping to change at page 13, line 28 | |||
We wish to emphasize that the shortcomings and mistakes of this | We wish to emphasize that the shortcomings and mistakes of this | |||
document are solely the responsibility of the current authors. | document are solely the responsibility of the current authors. | |||
Some of the text from this document is taken from "TCP/IP | Some of the text from this document is taken from "TCP/IP | |||
Illustrated, Volume 1: The Protocols" by W. Richard Stevens | Illustrated, Volume 1: The Protocols" by W. Richard Stevens | |||
(Addison-Wesley, 1994) and "TCP/IP Illustrated, Volume 2: The | (Addison-Wesley, 1994) and "TCP/IP Illustrated, Volume 2: The | |||
Implementation" by Gary R. Wright and W. Richard Stevens (Addison- | Implementation" by Gary R. Wright and W. Richard Stevens (Addison- | |||
Wesley, 1995). This material is used with the permission of | Wesley, 1995). This material is used with the permission of | |||
Addison-Wesley. | Addison-Wesley. | |||
Steve Arden, Neal Cardwell, Noritoshi Demizu, Kevin Fall, John | Anil Agarwal, Steve Arden, Neal Cardwell, Noritoshi Demizu, Gorry | |||
Heffner, Alfred Hoenes, Sally Floyd, Reiner Ludwig, Matt Mathis, | Fairhurst, Kevin Fall, John Heffner, Alfred Hoenes, Sally Floyd, | |||
Craig Partridge and Joe Touch contributed a number of helpful | Reiner Ludwig, Matt Mathis, Craig Partridge and Joe Touch | |||
suggestions. | contributed a number of helpful suggestions. | |||
Normative References | Normative References | |||
[RFC793] Postel, J., "Transmission Control Protocol", STD 7, RFC | [RFC793] Postel, J., "Transmission Control Protocol", STD 7, RFC | |||
793, September 1981. | 793, September 1981. | |||
[RFC1122] Braden, R., "Requirements for Internet Hosts -- | [RFC1122] Braden, R., "Requirements for Internet Hosts -- | |||
Communication Layers", STD 3, RFC 1122, October 1989. | Communication Layers", STD 3, RFC 1122, October 1989. | |||
[RFC1191] Mogul, J. and S. Deering, "Path MTU Discovery", RFC 1191, | [RFC1191] Mogul, J. and S. Deering, "Path MTU Discovery", RFC 1191, | |||
skipping to change at page 14, line 33 | skipping to change at page 14, line 49 | |||
[RFC2018] Mathis, M., Mahdavi, J., Floyd, S. and A. Romanow, "TCP | [RFC2018] Mathis, M., Mahdavi, J., Floyd, S. and A. Romanow, "TCP | |||
Selective Acknowledgement Options", RFC 2018, October 1996. | Selective Acknowledgement Options", RFC 2018, October 1996. | |||
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate | [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate | |||
Requirement Levels", BCP 14, RFC 2119, March 1997. | Requirement Levels", BCP 14, RFC 2119, March 1997. | |||
[RFC2414] Allman, M., Floyd, S. and C. Partridge, "Increasing TCP's | [RFC2414] Allman, M., Floyd, S. and C. Partridge, "Increasing TCP's | |||
Initial Window Size", RFC 2414, September 1998. | Initial Window Size", RFC 2414, September 1998. | |||
[RFC2525] Paxson, V., Allman, M., Dawson, S., Fenner, W., Griner, J., | [RFC2525] Paxson, V., Allman, M., Dawson, S., Fenner, W., Griner, | |||
Heavens, I., Lahey, K., Semke, J. and B. Volz, "Known TCP | J., Heavens, I., Lahey, K., Semke, J. and B. Volz, "Known TCP | |||
Implementation Problems", RFC 2525, March 1999. | Implementation Problems", RFC 2525, March 1999. | |||
[RFC2581] Allman, M., Paxson, V., W. Stevens, TCP Congestion | [RFC2581] Allman, M., Paxson, V., W. Stevens, TCP Congestion | |||
Control, RFC 2581, April 1999. | Control, RFC 2581, April 1999. | |||
[RFC2883] Floyd, S., J. Mahdavi, M. Mathis, M. Podolsky, An | [RFC2883] Floyd, S., J. Mahdavi, M. Mathis, M. Podolsky, An | |||
Extension to the Selective Acknowledgement (SACK) Option for | Extension to the Selective Acknowledgement (SACK) Option for | |||
TCP, RFC 2883, July 2000. | TCP, RFC 2883, July 2000. | |||
[RFC2988] V. Paxson and M. Allman, "Computing TCP's Retransmission | [RFC2988] V. Paxson and M. Allman, "Computing TCP's Retransmission | |||
Timer", RFC 2988, November 2000. | Timer", RFC 2988, November 2000. | |||
[RFC3042] Allman, M., Balakrishnan, H. and S. Floyd, "Enhancing | [RFC3042] Allman, M., Balakrishnan, H. and S. Floyd, "Enhancing | |||
TCP's Loss Recovery Using Limited Transmit", RFC 3042, January | TCP's Loss Recovery Using Limited Transmit", RFC 3042, January | |||
2001. | 2001. | |||
[RFC3168] K. Ramakrishnan, S. Floyd, D. Black, "The Addition of | ||||
Explicit Congestion Notification (ECN) to IP", RFC 3168, | ||||
September 2001. | ||||
[RFC3390] Allman, M., Floyd, S., C. Partridge, "Increasing TCP's | [RFC3390] Allman, M., Floyd, S., C. Partridge, "Increasing TCP's | |||
Initial Window", RFC 3390, October 2002. | Initial Window", RFC 3390, October 2002. | |||
[RFC3465] Mark Allman, TCP Congestion Control with Appropriate Byte | [RFC3465] Mark Allman, TCP Congestion Control with Appropriate Byte | |||
Counting (ABC), RFC 3465, February 2003. | Counting (ABC), RFC 3465, February 2003. | |||
[RFC3517] Ethan Blanton, Mark Allman, Kevin Fall, Lili Wang, A | [RFC3517] Ethan Blanton, Mark Allman, Kevin Fall, Lili Wang, A | |||
Conservative Selective Acknowledgment (SACK)-based Loss Recovery | Conservative Selective Acknowledgment (SACK)-based Loss Recovery | |||
Algorithm for TCP, RFC 3517, April 2003. | Algorithm for TCP, RFC 3517, April 2003. | |||
[RFC3782] Sally Floyd, Tom Henderson, Andrei Gurtov, The NewReno | [RFC3782] Sally Floyd, Tom Henderson, Andrei Gurtov, The NewReno | |||
Modification to TCP's Fast Recovery Algorithm, RFC 3782, April | Modification to TCP's Fast Recovery Algorithm, RFC 3782, April | |||
2004. | 2004. | |||
[RFC4821] Matt Mathis, John Heffner, Packetization Layer Path MTU | ||||
Discovery, RFC 4821, March 2007. | ||||
[SCWA99] Savage, S., Cardwell, N., Wetherall, D., and T. Anderson, | [SCWA99] Savage, S., Cardwell, N., Wetherall, D., and T. Anderson, | |||
"TCP Congestion Control With a Misbehaving Receiver", ACM | "TCP Congestion Control With a Misbehaving Receiver", ACM | |||
Computer Communication Review, 29(5), October 1999. | Computer Communication Review, 29(5), October 1999. | |||
[Ste94] Stevens, W., "TCP/IP Illustrated, Volume 1: The Protocols", | [Ste94] Stevens, W., "TCP/IP Illustrated, Volume 1: The Protocols", | |||
Addison-Wesley, 1994. | Addison-Wesley, 1994. | |||
[WS95] Wright, G. and W. Stevens, "TCP/IP Illustrated, Volume 2: The | [WS95] Wright, G. and W. Stevens, "TCP/IP Illustrated, Volume 2: The | |||
Implementation", Addison-Wesley, 1995. | Implementation", Addison-Wesley, 1995. | |||
Authors' Addresses | Authors' Addresses | |||
Mark Allman | Mark Allman | |||
ICIR / ICSI | International Computer Science Institute (ICSI) | |||
1947 Center Street | 1947 Center Street | |||
Suite 600 | Suite 600 | |||
Berkeley, CA 94704-1198 | Berkeley, CA 94704-1198 | |||
Phone: +1 440 235 1792 | Phone: +1 440 235 1792 | |||
EMail: mallman@icir.org | EMail: mallman@icir.org | |||
http://www.icir.org/mallman/ | http://www.icir.org/mallman/ | |||
Vern Paxson | Vern Paxson | |||
ICIR / ICSI | International Computer Science Institute (ICSI) | |||
1947 Center Street | 1947 Center Street | |||
Suite 600 | Suite 600 | |||
Berkeley, CA 94704-1198 | Berkeley, CA 94704-1198 | |||
Phone: +1 510/642-4274 x302 | Phone: +1 510/642-4274 x302 | |||
EMail: vern@icir.org | EMail: vern@icir.org | |||
http://www.icir.org/vern/ | http://www.icir.org/vern/ | |||
Ethan Blanton | Ethan Blanton | |||
Purdue University Computer Sciences | Purdue University Computer Sciences | |||
1398 Computer Science Building | 1398 Computer Science Building | |||
skipping to change at page 16, line 32 | skipping to change at page 16, line 54 | |||
on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE | on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE | |||
REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE | REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE | |||
IETF TRUST AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL | IETF TRUST AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL | |||
WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY | WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY | |||
WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE | WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE | |||
ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS | ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS | |||
FOR A PARTICULAR PURPOSE. | FOR A PARTICULAR PURPOSE. | |||
Copyright Statement | Copyright Statement | |||
Copyright (C) The IETF Trust (2007). This document is subject to | Copyright (C) The IETF Trust (2008). This document is subject to | |||
the rights, licenses and restrictions contained in BCP 78, and | the rights, licenses and restrictions contained in BCP 78, and | |||
except as set forth therein, the authors retain all their rights. | except as set forth therein, the authors retain all their rights. | |||
Acknowledgment | Acknowledgment | |||
Funding for the RFC Editor function is currently provided by the | Funding for the RFC Editor function is currently provided by the | |||
Internet Society. | Internet Society. | |||
End of changes. 34 change blocks. | ||||
70 lines changed or deleted | 88 lines changed or added | |||
This html diff was produced by rfcdiff 1.34. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ |