draft-ietf-tcpm-early-rexmt-01.txt   draft-ietf-tcpm-early-rexmt-02.txt 
Internet Engineering Task Force Mark Allman Internet Engineering Task Force Mark Allman
INTERNET DRAFT ICSI INTERNET DRAFT ICSI
File: draft-ietf-tcpm-early-rexmt-01.txt Konstantin Avrachenkov File: draft-ietf-tcpm-early-rexmt-02.txt Konstantin Avrachenkov
INRIA Intended Status: Experimental INRIA
Urtzi Ayesta Urtzi Ayesta
LAAS-CNRS LAAS-CNRS
Josh Blanton Josh Blanton
Ohio University Ohio University
Per Hurtig Per Hurtig
Karlstad University Karlstad University
January 2009 October 2009
Expires: July 2009 Expires: April 2010
Early Retransmit for TCP and SCTP Early Retransmit for TCP and SCTP
Status of this Memo Status of this Memo
This Internet-Draft is submitted to IETF in full conformance with This Internet-Draft is submitted to IETF in full conformance with
the provisions of BCP 78 and BCP 79. the provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that Task Force (IETF), its areas, and its working groups. Note that
skipping to change at page 1, line 39 skipping to change at page 1, line 39
months and may be updated, replaced, or obsoleted by other documents months and may be updated, replaced, or obsoleted by other documents
at any time. It is inappropriate to use Internet-Drafts as at any time. It is inappropriate to use Internet-Drafts as
reference material or to cite them other than as "work in progress." reference material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt. http://www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html. http://www.ietf.org/shadow.html.
This Internet-Draft will expire on July 13, 2009. This Internet-Draft will expire on April 27, 2010.
Copyright Notice Copyright Notice
Copyright (c) 2009 IETF Trust and the persons identified as the Copyright (c) 2009 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of (http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with carefully, as they describe your rights and restrictions with
respect to this document. respect to this document. Code Components extracted from this
document must include Simplified BSD License text as described in
Section 4.e of the Trust Legal Provisions and are provided without
warranty as described in the BSD License.
Abstract Abstract
This document proposes a new mechanism for TCP and SCTP that can be This document proposes a new mechanism for TCP and SCTP that can be
used to recover lost segments when a connection's congestion window used to recover lost segments when a connection's congestion window
is small. The "Early Retransmit" mechanism allows the transport to is small. The "Early Retransmit" mechanism allows the transport to
reduce, in certain special circumstances, the number of duplicate reduce, in certain special circumstances, the number of duplicate
acknowledgments required to trigger a fast retransmission. This acknowledgments required to trigger a fast retransmission. This
allows the transport to use fast retransmit to recover packet losses allows the transport to use fast retransmit to recover segment
that would otherwise require a lengthy retransmission timeout. losses that would otherwise require a lengthy retransmission
timeout.
Terminology Terminology
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [RFC2119]. document are to be interpreted as described in RFC 2119 [RFC2119].
The reader is expected to be familiar with the definitions given in
[RFC5681].
1 Introduction 1 Introduction
Many researchers have studied problems with TCP [RFC793,RFC2581] Many researchers have studied problems with TCP [RFC793,RFC5681]
when the congestion window is small and have outlined possible when the congestion window is small and have outlined possible
mechanisms to mitigate these problems mechanisms to mitigate these problems
[Mor97,BPS+98,Bal98,LK98,RFC3150,AA02]. SCTP's [RFC4960] loss [Mor97,BPS+98,Bal98,LK98,RFC3150,AA02]. SCTP's [RFC4960] loss
recovery and congestion control mechanisms are based on TCP and recovery and congestion control mechanisms are based on TCP and
therefore the same problems impact the performance of SCTP therefore the same problems impact the performance of SCTP
connections. When the transport detects a missing segment, the connections. When the transport detects a missing segment, the
connection enters a loss recovery phase. There are several variants connection enters a loss recovery phase. There are several variants
of the loss recovery phase depending on the TCP implemention. TCP of the loss recovery phase depending on the TCP implemention. TCP
can use slow start based recovery or Fast Recovery [RFC2581], can use slow start based recovery or Fast Recovery [RFC5681],
NewReno [RFC2582], and loss recovery based on selective NewReno [RFC3782], and loss recovery based on selective
acknowledgments (SACKs) [RFC2018,FF96,RFC3517]. SCTP's loss acknowledgments (SACKs) [RFC2018,FF96,RFC3517]. SCTP's loss
recovery is not as varied due to the built-in selective recovery is not as varied due to the built-in selective
acknowledgments. acknowledgments.
All the above variants have two methods for invoking loss recovery. All the above variants have two methods for invoking loss recovery.
First, if an acknowledgment (ACK) for a given segment is not First, if an acknowledgment (ACK) for a given segment is not
received in a certain amount of time a retransmission timer fires received in a certain amount of time a retransmission timer fires
and the segment is resent [RFC2988,RFC4960]. Second, the ``Fast and the segment is resent [RFC2988,RFC4960]. Second, the "Fast
Retransmit'' algorithm resends a segment when three duplicate ACKs Retransmit" algorithm resends a segment when three duplicate ACKs
arrive at the sender [Jac88,RFC2581]. Duplicate ACKs are triggered arrive at the sender [Jac88,RFC5681]. Duplicate ACKs are triggered by
by out-of-order arrivals at the receiver. However, because out-of-order arrivals at the receiver. However, because duplicate
duplicate ACKs from the receiver are triggered by both packet loss ACKs from the receiver are triggered by both segment loss and
and packet reordering in the network path, the sender waits for segment reordering in the network path, the sender waits for three
three duplicate ACKs in an attempt to disambiguate packet loss from duplicate ACKs in an attempt to disambiguate segment loss from
packet reordering. When using small congestion windows it may not segment reordering. When the congestion window is small it may not
be possible to generate the required number of duplicate ACKs to be possible to generate the required number of duplicate ACKs to
trigger Fast Retransmit when a loss does happen. trigger Fast Retransmit when a loss does happen.
Small windows can occur in a number of situations, such as: Small congestion windows can occur in a number of situations, such
as:
(1) The connection is constrained by end-to-end congestion control (1) The connection is constrained by end-to-end congestion control
when the connection's share of the path is small, the path has a when the connection's share of the path is small, the path has a
small bandwidth-delay product or the transport is ascertaining small bandwidth-delay product or the transport is ascertaining
the available bandwidth in the first few round-trip times of the available bandwidth in the first few round-trip times of
slow start. slow start.
(2) The connection is "application limited" and has only a limited (2) The connection is "application limited" and has only a limited
amount of data to send. This can happen any time the amount of data to send. This can happen any time the
application does not produce enough data to fill the congestion application does not produce enough data to fill the congestion
skipping to change at page 3, line 21 skipping to change at page 3, line 29
in [RFC2988] (for TCP) and [RFC4960] (for SCTP). To prevent in [RFC2988] (for TCP) and [RFC4960] (for SCTP). To prevent
spurious retransmissions of segments that are only delayed and not spurious retransmissions of segments that are only delayed and not
lost, the minimum RTO is conservatively chosen to be 1 second. lost, the minimum RTO is conservatively chosen to be 1 second.
Therefore, it behooves TCP senders to detect and recover from as Therefore, it behooves TCP senders to detect and recover from as
many losses as possible without incurring a lengthy timeout during many losses as possible without incurring a lengthy timeout during
which the connection remains idle. However, if not enough duplicate which the connection remains idle. However, if not enough duplicate
ACKs arrive from the receiver, the Fast Retransmit algorithm is ACKs arrive from the receiver, the Fast Retransmit algorithm is
never triggered---this situation occurs when the congestion window never triggered---this situation occurs when the congestion window
is small, if a large number of segments in a window are lost or at is small, if a large number of segments in a window are lost or at
the end of a transfer as data drains from the network. For the end of a transfer as data drains from the network. For
instance, consider a congestion window (cwnd) of three segments. If instance, consider a congestion window of three segments worth of
one segment is dropped by the network, then at most two duplicate data. If one segment is dropped by the network, then at most two
ACKs will arrive at the sender. Since three duplicate ACKs are duplicate ACKs will arrive at the sender. Since three duplicate
required to trigger Fast Retransmit, a timeout will be required to ACKs are required to trigger Fast Retransmit, a timeout will be
resend the dropped packet. required to resend the dropped segment. Note, delayed ACKs
[RFC5681] may further reduce the number of duplicate ACKs a receiver
sends. However, we assume that receivers send immediate ACKs when
there is a gap in the received sequence space per [RFC5681].
[BPS+98] shows that roughly 56% of retransmissions sent by a busy [BPS+98] shows that roughly 56% of retransmissions sent by a busy
web server are sent after the RTO timer expires, while only 44% are web server are sent after the RTO timer expires, while only 44% are
handled by Fast Retransmit. In addition, only 4% of the RTO handled by Fast Retransmit. In addition, only 4% of the RTO
timer-based retransmissions could have been avoided with SACK, which timer-based retransmissions could have been avoided with SACK, which
has to continue to disambiguate reordering from genuine loss. has to continue to disambiguate reordering from genuine loss.
Furthermore, [All00] shows that for one particular web server the Furthermore, [All00] shows that for one particular web server the
median transfer size is less than four segments, indicating that median number of bytes carried by a connection is less than four
more than half of the connections will be forced to rely on the RTO segments, indicating that more than half of the connections will be
timer to recover from any losses that occur. Thus, loss recovery forced to rely on the RTO timer to recover from any losses that
that does not rely on the conservative RTO is likely to be occur. Thus, loss recovery that does not rely on the conservative
beneficial for short TCP transfers. RTO is likely to be beneficial for short TCP transfers.
The Limited Transmit mechanism introduced in [RFC3042] allows a TCP The Limited Transmit mechanism introduced in [RFC3042] and currently
sender to transmit previously unsent data upon the reception of each codified in [RFC5681] allows a TCP sender to transmit previously
of the two duplicate ACKs that precede a Fast Retransmit. SCTP unsent data upon the reception of each of the two duplicate ACKs
[RFC4960] uses SACK information to calculate the number of that precede a Fast Retransmit. SCTP [RFC4960] uses SACK
outstanding segments in the network. Hence, when the first two information to calculate the number of outstanding segments in the
duplicate ACKs arrive at the sender they will indicate that data has network. Hence, when the first two duplicate ACKs arrive at the
left the network and allow the sender to transmit new data (if sender they will indicate that data has left the network and allow
available) similar to TCP's Limited Transmit algorithm. In the the sender to transmit new data (if available) similar to TCP's
remainder of this document we use "Limited Transmit" to include both Limited Transmit algorithm. In the remainder of this document we
TCP and SCTP mechanisms for sending in response to the first two use "Limited Transmit" to include both TCP and SCTP mechanisms for
duplicate ACKs. By sending these two new segments the sender is sending in response to the first two duplicate ACKs. By sending
attempting to induce additional duplicate ACKs (if appropriate) so these two new segments the sender is attempting to induce additional
that Fast Retransmit will be triggered before the retransmission duplicate ACKs (if appropriate) so that Fast Retransmit will be
timeout expires. The "Early Retransmit" mechanism outlined in this triggered before the retransmission timeout expires. The
document covers the case when previously unsent data is not sender-side "Early Retransmit" mechanism outlined in this document
available for transmission or cannot be transmitted due to an covers the case when previously unsent data is not available for
advertised window limitation. transmission (case (2) above) or cannot be transmitted due to an
advertised window limitation (case (3) above).
2 Early Retransmit Algorithm 2 Early Retransmit Algorithm
The Early Retransmit algorithm calls for lowering the threshold for The Early Retransmit algorithm calls for lowering the threshold for
triggering Fast Retransmit when the amount of outstanding data is triggering Fast Retransmit when the amount of outstanding data is
small and when no previously unsent data can be transmitted (such small and when no previously unsent data can be transmitted (such
that Limited Transmit could be used). Duplicate ACKs are triggered that Limited Transmit could be used). Duplicate ACKs are triggered
by each arriving out-of-order segment. Therefore, Fast Retransmit by each arriving out-of-order segment. Therefore, Fast Retransmit
will not be invoked when there are less than four outstanding will not be invoked when there are less than four outstanding
segments (assuming only one segment loss in the window). However, segments (assuming only one segment loss in the window). However,
TCP and SCTP are not required to track the number of outstanding TCP and SCTP are not required to track the number of outstanding
segments, but rather the number of outstanding bytes or messages. segments, but rather the number of outstanding bytes or messages.
Therefore, applying the intuitive notion of a transport with less (Note, SCTP's message boundaries do not necessarily correspond to
than four segments outstanding is more complicated than it first segment boundaries.) Therefore, applying the intuitive notion of a
appears. In section 2.1 we describe a "byte-based" variant of Early transport with less than four segments outstanding is more
Retransmit that attempts to roughly map the number of outstanding complicated than it first appears. In section 2.1 we describe a
bytes to a number of outstanding packets that is then used when "byte-based" variant of Early Retransmit that attempts to roughly
deciding whether to trigger Early Retransmit. In section 2.2 we map the number of outstanding bytes to a number of outstanding
describe a "packet-based" variant that represents a more precise segments that is then used when deciding whether to trigger Early
algorithm for triggering Early Retransmit. The precision comes at Retransmit. In section 2.2 we describe a "segment-based" variant
the cost of requiring additional state to be kept by the TCP sender. that represents a more precise algorithm for triggering Early
In both cases we describe SACK-based and non-SACK-based versions of Retransmit. The precision comes at the cost of requiring additional
the scheme (of course, the non-SACK version will not apply to SCTP). state to be kept by the TCP sender. In both cases we describe
SACK-based and non-SACK-based versions of the scheme (of course, the
non-SACK version will not apply to SCTP). This document explicitly
does not prefer one variant over the other, but leaves the choice to
the implementer.
2.1 Byte-based Early Retransmit 2.1 Byte-based Early Retransmit
A TCP or SCTP sender MAY use byte-based Early Retransmit. A TCP or SCTP sender MAY use byte-based Early Retransmit.
A sender employing byte-based Early Retransmit MUST use the A sender employing byte-based Early Retransmit MUST use the
following two conditions to determine when an Early Retransmit is following two conditions to determine when an Early Retransmit is
sent: sent:
(2.a) The amount of outstanding data (ownd)---data sent but not yet (2.a) The amount of outstanding data (ownd)---data sent but not yet
acknowledged---is less than 4*SMSS bytes. acknowledged---is less than 4*SMSS bytes.
(Note that in the byte-based variant of Early Retransmit Note that in the byte-based variant of Early Retransmit
'ownd' is equivalent to 'FlightSize' defined in [RFC2581]. We 'ownd' is equivalent to 'FlightSize' defined in [RFC5681]. We
use different notation because 'ownd' is not consistent with use different notation because 'ownd' is not consistent with
FlightSize through this document.) FlightSize through this document.
Also note that in SCTP messages will have to be converted to
bytes to make this variant of Early Retransmit work.
(2.b) There is either no unsent data ready for transmission at the (2.b) There is either no unsent data ready for transmission at the
sender or the advertised window does not permit new segments sender or the advertised receive window does not permit new
to be transmitted. segments to be transmitted.
When the above two conditions hold and the connection does not When the above two conditions hold and a TCP connection does not
support SACK the duplicate ACK threshold used to trigger a support SACK the duplicate ACK threshold used to trigger a
retransmission MUST be reduced to: retransmission MUST be reduced to:
ER_thresh = ceiling (ownd/SMSS) - 1 (1) ER_thresh = ceiling (ownd/SMSS) - 1 (1)
duplicate ACKs, where ownd is in terms of bytes. duplicate ACKs, where ownd is in terms of bytes. We call this
reduced ACK threshold enabling "Early Retransimission".
When conditions (2.a) and (2.b) hold and the connection does support When conditions (2.a) and (2.b) hold and a TCP connection does
SACK, Early Retransmit MUST be used only when "ownd - SMSS" bytes support SACK or SCTP is in use, Early Retransmit MUST be used only
have been SACKed. when "ownd - SMSS" bytes have been SACKed.
When conditions (2.a) and (2.b) do not hold, the transport MUST NOT When conditions (2.a) and (2.b) do not hold, the transport MUST NOT
use Early Retransmit, but rather prefer the standard mechanisms, use Early Retransmit, but rather prefer the standard mechanisms,
including Limited Transmit. including Fast Retransmit and Limited Transmit.
As noted above, the drawback of this byte-based variant is precision As noted above, the drawback of this byte-based variant is precision
[HB08]. We illustrate this with two examples: [HB08]. We illustrate this with two examples:
+ Consider a non-SACK TCP sender that uses an SMSS of 1460 bytes + Consider a non-SACK TCP sender that uses an SMSS of 1460 bytes
and transmits three segments each with 400 bytes of payload. and transmits three segments each with 400 bytes of payload.
This is a case where Early Retransmit could aid loss recovery if This is a case where Early Retransmit could aid loss recovery if
one segment is lost. However, in this case ER_thresh will one segment is lost. However, in this case ER_thresh will
become zero, per equation (1), because the number of outstanding become zero, per equation (1), because the number of outstanding
bytes is a poor estimate of the number of outstanding packets. bytes is a poor estimate of the number of outstanding segments.
A similar problem occurs for senders that employ SACK as the A similar problem occurs for senders that employ SACK as the
expression "ownd - SMSS" will become negative. expression "ownd - SMSS" will become negative.
+ Next, consider a non-SACK TCP sender that uses an SMSS of 1460 + Next, consider a non-SACK TCP sender that uses an SMSS of 1460
bytes and transmits 10 segments each with 400 bytes of payload. bytes and transmits 10 segments each with 400 bytes of payload.
In this case ER_thresh will be two, per equation (1). Thus, In this case ER_thresh will be two, per equation (1). Thus,
even though there are enough segments outstanding to trigger even though there are enough segments outstanding to trigger
Fast Retransmit with the standard duplicate ACK threshold Early Fast Retransmit with the standard duplicate ACK threshold Early
Retransmit will be triggered. This could cause or exacerbate Retransmit will be triggered. This could cause or exacerbate
performance problems caused by packet reordering in the network. performance problems caused by segment reordering in the network.
2.2 Packet-based Early Retransmit 2.2 Segment-based Early Retransmit
A TCP or SCTP sender MAY use packet-based Early Retransmit. A TCP or SCTP sender MAY use segment-based Early Retransmit.
A sender employing packet-based Early Retransmit MUST use the A sender employing segment-based Early Retransmit MUST use the
following two conditions to determine when an Early Retransmit is following two conditions to determine when an Early Retransmit is
sent: sent:
(3.a) The number of outstanding segments (oseg)---segments sent but (3.a) The number of outstanding segments (oseg)---segments sent but
not yet acknowledged---is less than four. not yet acknowledged---is less than four.
(3.b) There is either no unsent data ready for transmission at the (3.b) There is either no unsent data ready for transmission at the
sender or the advertised window does not permit new segments sender or the advertised receive window does not permit new
to be transmitted. segments to be transmitted.
When the above two conditions hold and the connection does not When the above two conditions hold and a TCP connection does not
support SACK the duplicate ACK threshold used to trigger a support SACK the duplicate ACK threshold used to trigger a
retransmission MUST be reduced to: retransmission MUST be reduced to:
ER_thresh = oseg - 1 (2) ER_thresh = oseg - 1 (2)
duplicate ACKs, where oseg represents the number of outstanding duplicate ACKs, where oseg represents the number of outstanding
segments. (We discuss tracking the number of outstanding segments segments. (We discuss tracking the number of outstanding segments
below.) below.) We call this reduced ACK threshold enabling "Early
Retransimission".
When conditions (3.a) and (3.b) hold and the connection does support When conditions (3.a) and (3.b) hold and a TCP connection does
SACK, Early Retransmit MUST be used only when "oseg - 1" segments support SACK or SCTP is in use, Early Retransmit MUST be used only
have been SACKed. when "oseg - 1" segments have been SACKed. A segment is considered
to be SACKed when all its data bytes (TCP) or data chunks (SCTP)
have been indicated as arrived by the receiver.
When conditions (3.a) and (3.b) do not hold, the transport MUST NOT When conditions (3.a) and (3.b) do not hold, the transport MUST NOT
use Early Retransmit, but rather prefer the standard mechanisms, use Early Retransmit, but rather prefer the standard mechanisms,
including Limited Transmit. including Fast Retransmit and Limited Transmit.
This version of Early Retransmit solves the precision issues This version of Early Retransmit solves the precision issues
discussed in the previous section. As noted previously, the cost is discussed in the previous section. As noted previously, the cost is
that the implementation will have to track packet boundaries to form that the implementation will have to track segment boundaries to
an understanding as to how many actual segments have been form an understanding as to how many actual segments have been
transmitted, but not acknowledged. This can be done by tracking the transmitted, but not acknowledged. This can be done by the sender
boundaries of the three segments on the right side of the current tracking the boundaries of the three segments on the right side of
window (which involves tracking four sequence numbers in TCP). This the current window (which involves tracking four sequence numbers in
could be done by keeping a circular list of the packet boundaries, TCP). This could be done by keeping a circular list of the segment
for instance. Cumulative ACKs that do not fall within this region boundaries, for instance. Cumulative ACKs that do not fall within
indicate that at least four segments are outstanding and therefore this region indicate that at least four segments are outstanding and
Early Retransmit MUST NOT be used. When the outstanding window therefore Early Retransmit MUST NOT be used. When the outstanding
becomes small enough that Early Retransmit can be invoked, a full window becomes small enough that Early Retransmit can be invoked, a
understanding of the number of outstanding packets will be full understanding of the number of outstanding segments will be
available from the four sequence numbers retained. available from the four sequence numbers retained.
3 Discussion 3 Discussion
In this section we discuss a number of issues surrounding the Early
Retransmit algorithm.
3.1 SACK vs. non-SACK
The SACK variant of the Early Retransmit algorithm is preferred to The SACK variant of the Early Retransmit algorithm is preferred to
the non-SACK variant in TCP due to its robustness in the face of ACK the non-SACK variant in TCP due to its robustness in the face of ACK
loss (since SACKs are sent redundantly) and due to interactions with loss (since SACKs are sent redundantly) and due to interactions with
the delayed ACK timer (SCTP does not have a non-SACK mode and the delayed ACK timer (SCTP does not have a non-SACK mode and
therefore naturally supports SACK-based Early Retransmit). Consider therefore naturally supports SACK-based Early Retransmit). Consider
a flight of three segments, S1...S3, with S2 being dropped by the a flight of three segments, S1...S3, with S2 being dropped by the
network. When S1 arrives it is in-order and so the receiver may or network. When S1 arrives it is in-order and so the receiver may or
may not delay the ACK, leading to two scenarios: may not delay the ACK, leading to two scenarios:
(A) The ACK for S1 is delayed: In this case the arrival of S3 will (A) The ACK for S1 is delayed: In this case the arrival of S3 will
trigger an ACK to be transmitted covering segment S1 (which was trigger an ACK to be transmitted covering segment S1 (which was
previously unacknowledged). In this case Early Retransmit previously unacknowledged). In this case Early Retransmit
without SACK will not prevent an RTO because no duplicate ACKs without SACK will not prevent an RTO because no duplicate ACKs
will arrive. However, with SACK the ACK for S1 will also will arrive. However, with SACK the ACK for S1 will also
include SACK information indicating that S3 has arrived at the include SACK information indicating that S3 has arrived at the
receiver. The sender can then invoke Early Retransmit on this receiver. The sender can then invoke Early Retransmit on this
ACK because only one packet remains outstanding. ACK because only one segment remains outstanding.
(B) The ACK for S1 is not delayed: In this case the arrival of S1 (B) The ACK for S1 is not delayed: In this case the arrival of S1
triggers an ACK of previously unacknowledged data. The arrival triggers an ACK of previously unacknowledged data. The arrival
of S3 triggers a duplicate ACK (because it is out-of-order). of S3 triggers a duplicate ACK (because it is out-of-order).
Both ACKs will cover the same segment (S1). Therefore, Both ACKs will cover the same segment (S1). Therefore,
regardless of whether SACK is used Early Retransmit can be regardless of whether SACK is used Early Retransmit can be
performed by the sender (assuming no ACK loss). performed by the sender (assuming no ACK loss).
3.2 Segment Reordering
Early Retransmit is less robust in the face of reordered segments Early Retransmit is less robust in the face of reordered segments
than when using the standard Fast Retransmit threshold. Research than when using the standard Fast Retransmit threshold. Research
shows that a general reduction in the number of duplicate ACKs shows that a general reduction in the number of duplicate ACKs
required to trigger Fast Retransmit to two (rather than three) leads required to trigger Fast Retransmit to two (rather than three) leads
to a reduction in the ratio of good to bad retransmits by a factor to a reduction in the ratio of good to bad retransmits by a factor
of three [Pax97]. However, this analysis did not include the of three [Pax97]. However, this analysis did not include the
additional conditioning on the event that the ownd was smaller than additional conditioning on the event that the ownd was smaller than
4 segments and that no new data was available for transmission. 4 segments and that no new data was available for transmission.
A number of studies have shown that network reordering is not a rare A number of studies have shown that network reordering is not a rare
event across some network paths. Various measurement studies have event across some network paths. Various measurement studies have
shown that reordering along most paths is negligible, but along shown that reordering along most paths is negligible, but along
certain paths can be quite prevalent [Pax97,BPS99,BS02,Pir05]. certain paths can be quite prevalent [Pax97,BPS99,BS02,Pir05].
Evaluating Early Retransmit in the face of real packet reordering is Evaluating Early Retransmit in the face of real segment reordering is
part of the experiment we hope to instigate with this document. part of the experiment we hope to instigate with this document.
3.3 Worst Case
Next, we note two "worst case" scenarios for Early Retransmit: Next, we note two "worst case" scenarios for Early Retransmit:
(1) Persistent reordering of segments coupled with an application (1) Persistent reordering of segments coupled with an application
that does not constantly send data can result in large numbers that does not constantly send data can result in large numbers
of needless retransmissions when using Early Retransmit. For of needless retransmissions when using Early Retransmit. For
instance, consider an application that sends data two segments instance, consider an application that sends data two segments
at a time, followed by an idle period when no data is queued for at a time, followed by an idle period when no data is queued for
delivery. If the network consistently reorders the two delivery. If the network consistently reorders the two
segments, the sender will needlessly retransmit one out of every segments, the sender will needlessly retransmit one out of every
two unique segments transmitted when using the above algorithm two unique segments transmitted when using the above algorithm
skipping to change at page 7, line 44 skipping to change at page 8, line 22
and part of the experiments that this document hopes to trigger and part of the experiments that this document hopes to trigger
would involve better understanding of whether such theoretical worst would involve better understanding of whether such theoretical worst
case scenarios are prevalent in the network and in general to case scenarios are prevalent in the network and in general to
explore the tradeoff between spurious fast retransmits and the delay explore the tradeoff between spurious fast retransmits and the delay
imposed by the RTO. Appendix A does offer a survey of possible imposed by the RTO. Appendix A does offer a survey of possible
mitigations that call for curtailing the use of Early Retransmit mitigations that call for curtailing the use of Early Retransmit
when it is making poor retransmission decisions. when it is making poor retransmission decisions.
4 Related Work 4 Related Work
There are a number of similar proposals in the literature that
attempt to mitigate the same problem Early Retransmit addresses.
Deployment of Explicit Congestion Notification (ECN) [Flo94,RFC3168] Deployment of Explicit Congestion Notification (ECN) [Flo94,RFC3168]
may benefit connections with small congestion window sizes may benefit connections with small congestion window sizes
[RFC2884]. ECN provides a method for indicating congestion to the [RFC2884]. ECN provides a method for indicating congestion to the
end-host without dropping segments. While some segment drops may end-host without dropping segments. While some segment drops may
still occur, ECN may allow a transport to perform better with small still occur, ECN may allow a transport to perform better with small
cwnd sizes because the sender will be required to detect less congestion window sizes because the sender will be required to
segment loss [RFC2884]. detect less segment loss [RFC2884].
[Bal98] outlines another solution to the problem of having no new [Bal98] outlines another solution to the problem of having no new
segments to transmit into the network when the first two duplicate segments to transmit into the network when the first two duplicate
ACKs arrive. In response to these duplicate ACKs, a TCP sender ACKs arrive. In response to these duplicate ACKs, a TCP sender
transmits zero-byte segments to induce additional duplicate ACKs. transmits zero-byte segments to induce additional duplicate ACKs.
This method preserves the robustness of the standard Fast Retransmit This method preserves the robustness of the standard Fast Retransmit
algorithm at the cost of injecting segments into the network that do algorithm at the cost of injecting segments into the network that do
not deliver any data, and therefore are potentially wasting network not deliver any data, and therefore are potentially wasting network
resources (at a time when there is a reasonable chance that the resources (at a time when there is a reasonable chance that the
resources are scarce). resources are scarce).
[RFC4653] also defines an orthogonal method for altering the
duplicate ACK threshold. The mechanisms proposed in this document
decrease the duplicate ACK threshold when a small amount of data is
outstanding. Meanwhile, the mechanisms in [RFC4653] increase the
duplicate ACK threshold (over the standard of 3) when the congestion
window is large in an effort to increase robustness to segment
reordering.
5 Security Considerations 5 Security Considerations
The security considerations found in [RFC2581] apply to this The security considerations found in [RFC5681] apply to this
document. No additional security problems have been identified with document. No additional security problems have been identified with
Early Retransmit at this time. Early Retransmit at this time.
6 IANA Considerations
None
Acknowledgments Acknowledgments
We thank Sally Floyd for her feedback in discussions about Early We thank Sally Floyd for her feedback in discussions about Early
Retransmit. The notion of Early Transmit was originally sketched in Retransmit. The notion of Early Transmit was originally sketched in
an Internet-Draft co-authored by Sally Floyd and Hari Balakrishnan. an Internet-Draft co-authored by Sally Floyd and Hari Balakrishnan.
Armando Caro and many members of the TSVWG and TCPM working groups Armando Caro, Joe Touch and Alexander Zimmermann and many members of
provided good discussions that helped shape this document. Our the TSVWG and TCPM working groups provided good discussions that
thanks to all! helped shape this document. Our thanks to all!
Normative References Normative References
[RFC793] Jon Postel. Transmission Control Protocol. Std 7, RFC [RFC793] Jon Postel. Transmission Control Protocol. Std 7, RFC
793. September 1981. 793. September 1981.
[RFC2018] Matt Mathis, Jamshid Mahdavi, Sally Floyd, Allyn Romanow. [RFC2018] Matt Mathis, Jamshid Mahdavi, Sally Floyd, Allyn Romanow.
TCP Selective Acknowledgement Options. RFC 2018, October 1996. TCP Selective Acknowledgement Options. RFC 2018, October 1996.
[RFC2581] Mark Allman, Vern Paxson, W. Richard Stevens. TCP [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Congestion Control. RFC 2581, April 1999. Requirement Levels", BCP 14, RFC 2119, March 1997.
[RFC2883] Sally Floyd, Jamshid Mahdavi, Matt Mathis, Matt Podolsky. [RFC2883] Sally Floyd, Jamshid Mahdavi, Matt Mathis, Matt Podolsky.
An Extension to the Selective Acknowledgement (SACK) Option for An Extension to the Selective Acknowledgement (SACK) Option for
TCP. RFC 2883, July 2000. TCP. RFC 2883, July 2000.
[RFC2988] Vern Paxson, Mark Allman. Computing TCP's Retransmission [RFC2988] Vern Paxson, Mark Allman. Computing TCP's Retransmission
Timer. RFC 2988, April 2000. Timer. RFC 2988, April 2000.
[RFC3042] Mark Allman, Hari Balakrishnan, Sally Floyd. Enhancing [RFC3042] Mark Allman, Hari Balakrishnan, Sally Floyd. Enhancing
TCP's Loss Recovery Using Limited Transmit. RFC 3042, January TCP's Loss Recovery Using Limited Transmit. RFC 3042, January
2001. 2001.
[RFC3522] Reiner Ludwig, Michael Meyer. The Eifel Detection
Algorithm for TCP. RFC 3522, April 2003.
[RFC4960] R. Stewart. Stream Control Transmission Protocol. RFC [RFC4960] R. Stewart. Stream Control Transmission Protocol. RFC
4960, September 2007. 4960, September 2007.
[RFC5681] Mark Allman, Vern Paxson, Ethan Blanton. TCP Congestion
Control. RFC 5681, May 2009.
Informative References Informative References
[AA02] Urtzi Ayesta, Konstantin Avrachenkov, "The Effect of the [AA02] Urtzi Ayesta, Konstantin Avrachenkov, "The Effect of the
Initial Window Size and Limited Transmit Algorithm on the Initial Window Size and Limited Transmit Algorithm on the
Transient Behavior of TCP Transfers", In Proc. of the 15th ITC Transient Behavior of TCP Transfers", In Proc. of the 15th ITC
Internet Specialist Seminar, Wurzburg, July 2002. Internet Specialist Seminar, Wurzburg, July 2002.
[All00] Mark Allman. A Web Server's View of the Transport Layer. [All00] Mark Allman. A Web Server's View of the Transport Layer.
ACM Computer Communications Review, October 2000. ACM Computer Communications Review, October 2000.
[Bal98] Hari Balakrishnan. Challenges to Reliable Data Transport [Bal98] Hari Balakrishnan. Challenges to Reliable Data Transport
over Heterogeneous Wireless Networks. Ph.D. Thesis, University over Heterogeneous Wireless Networks. Ph.D. Thesis, University
of California at Berkeley, August 1998. of California at Berkeley, August 1998.
[BPS+98] Hari Balakrishnan, Venkata Padmanabhan, Srinivasan Seshan, [BPS+98] Hari Balakrishnan, Venkata Padmanabhan, Srinivasan Seshan,
Mark Stemm, and Randy Katz. TCP Behavior of a Busy Web Server: Mark Stemm, and Randy Katz. TCP Behavior of a Busy Web Server:
Analysis and Improvements. Proc. IEEE INFOCOM Conf., San Analysis and Improvements. Proc. IEEE INFOCOM Conf., San
Francisco, CA, March 1998. Francisco, CA, March 1998.
skipping to change at page 9, line 16 skipping to change at page 10, line 8
[Bal98] Hari Balakrishnan. Challenges to Reliable Data Transport [Bal98] Hari Balakrishnan. Challenges to Reliable Data Transport
over Heterogeneous Wireless Networks. Ph.D. Thesis, University over Heterogeneous Wireless Networks. Ph.D. Thesis, University
of California at Berkeley, August 1998. of California at Berkeley, August 1998.
[BPS+98] Hari Balakrishnan, Venkata Padmanabhan, Srinivasan Seshan, [BPS+98] Hari Balakrishnan, Venkata Padmanabhan, Srinivasan Seshan,
Mark Stemm, and Randy Katz. TCP Behavior of a Busy Web Server: Mark Stemm, and Randy Katz. TCP Behavior of a Busy Web Server:
Analysis and Improvements. Proc. IEEE INFOCOM Conf., San Analysis and Improvements. Proc. IEEE INFOCOM Conf., San
Francisco, CA, March 1998. Francisco, CA, March 1998.
[BPS99] Jon Bennett, Craig Partridge, Nicholas Shectman. Packet
Reordering is Not Pathological Network Behavior. IEEE/ACM
Transactions on Networking, December 1999.
[BS02] John Bellardo, Stefan Savage. Measuring Packet Reordering, [BS02] John Bellardo, Stefan Savage. Measuring Packet Reordering,
ACM/USENIX Internet Measurement Workshop, November 2002. ACM/USENIX Internet Measurement Workshop, November 2002.
[FF96] Kevin Fall, Sally Floyd. Simulation-based Comparisons of [FF96] Kevin Fall, Sally Floyd. Simulation-based Comparisons of
Tahoe, Reno, and SACK TCP. ACM Computer Communication Review, Tahoe, Reno, and SACK TCP. ACM Computer Communication Review,
July 1996. July 1996.
[Flo94] Sally Floyd. TCP and Explicit Congestion Notification. ACM [Flo94] Sally Floyd. TCP and Explicit Congestion Notification. ACM
Computer Communication Review, October 1994. Computer Communication Review, October 1994.
[HB08] Per Hurtig, Anna Brunstrom. Enhancing SCTP Loss Recovery: An [HB08] Per Hurtig, Anna Brunstrom. Enhancing SCTP Loss Recovery: An
Experimental Evaluation of Early Retransmit. Elsevier Computer Experimental Evaluation of Early Retransmit. Elsevier Computer
Communication, 2008, to appear. Communications, Vol. 31(16), October 2008, pp. 3778-3788.
[Jac88] Van Jacobson. Congestion Avoidance and Control. ACM [Jac88] Van Jacobson. Congestion Avoidance and Control. ACM
SIGCOMM 1988. SIGCOMM 1988.
[LK98] Dong Lin, H.T. Kung. TCP Fast Recovery Strategies: Analysis [LK98] Dong Lin, H.T. Kung. TCP Fast Recovery Strategies: Analysis
and Improvements. Proceedings of InfoCom, San Francisco, CA, and Improvements. Proceedings of InfoCom, San Francisco, CA,
March 1998. March 1998.
[Mor97] Robert Morris. TCP Behavior with Many Flows. Proceedings [Mor97] Robert Morris. TCP Behavior with Many Flows. Proceedings
of the Fifth IEEE International Conference on Network Protocols. of the Fifth IEEE International Conference on Network Protocols.
skipping to change at page 9, line 50 skipping to change at page 10, line 46
[Pax97] Vern Paxson. End-to-End Internet Packet Dynamics. ACM [Pax97] Vern Paxson. End-to-End Internet Packet Dynamics. ACM
SIGCOMM, September 1997. SIGCOMM, September 1997.
[Pir05] N. M. Piratla, "A Theoretical Foundation, Metrics and [Pir05] N. M. Piratla, "A Theoretical Foundation, Metrics and
Modeling of Packet Reordering and Methodology of Delay Modeling Modeling of Packet Reordering and Methodology of Delay Modeling
using Inter-packet Gaps," Ph.D. Dissertation, Department of using Inter-packet Gaps," Ph.D. Dissertation, Department of
Electrical and Computer Engineering, Colorado State University, Electrical and Computer Engineering, Colorado State University,
Fort Collins, CO, Fall 2005. Fort Collins, CO, Fall 2005.
[RFC2582] Sally Floyd, Tom Henderson. The NewReno Modification to
TCP's Fast Recovery Algorithm. RFC 2582, April 1999.
[RFC2884] Jamal Hadi Salim and Uvaiz Ahmed. Performance Evaluation [RFC2884] Jamal Hadi Salim and Uvaiz Ahmed. Performance Evaluation
of Explicit Congestion Notification (ECN) in IP Networks. RFC of Explicit Congestion Notification (ECN) in IP Networks. RFC
2884, July 2000. 2884, July 2000.
[RFC3150] Spencer Dawkins, Gabriel Montenegro, Markku Kojo, Vincent [RFC3150] Spencer Dawkins, Gabriel Montenegro, Markku Kojo, Vincent
Magret. End-to-end Performance Implications of Slow Links. RFC Magret. End-to-end Performance Implications of Slow Links. RFC
3150, July 2001. 3150, July 2001.
[RFC3168] K. K. Ramakrishnan, Sally Floyd, David Black. The [RFC3168] K. K. Ramakrishnan, Sally Floyd, David Black. The
Addition of Explicit Congestion Notification (ECN) to IP. RFC Addition of Explicit Congestion Notification (ECN) to IP. RFC
3168, September 2001. 3168, September 2001.
[RFC3517] Ethan Blanton, Mark Allman, Kevin Fall, Lili Wang. A [RFC3517] Ethan Blanton, Mark Allman, Kevin Fall, Lili Wang. A
Conservative Selective Acknowledgment (SACK)-based Loss Recovery Conservative Selective Acknowledgment (SACK)-based Loss Recovery
Algorithm for TCP. RFC 3517, April 2003. Algorithm for TCP. RFC 3517, April 2003.
[RFC3522] Reiner Ludwig, Michael Meyer. The Eifel Detection
Algorithm for TCP. RFC 3522, April 2003.
[RFC3782] Sally Floyd, Tom Henderson, Andrei Gurtov. The NewReno
Modification to TCP's Fast Recovery Algorithm. RFC 3782, April
2004.
Author's Addresses: Author's Addresses:
Mark Allman Mark Allman
International Computer Science Institute International Computer Science Institute
1947 Center Street, Suite 600 1947 Center Street, Suite 600
Berkeley, CA 94704-1198 Berkeley, CA 94704-1198
Phone: 440-235-1792 Phone: 440-235-1792
mallman@icir.org mallman@icir.org
http://www.icir.org/mallman/ http://www.icir.org/mallman/
skipping to change at page 11, line 8 skipping to change at page 12, line 9
per.hurtig@kau.se per.hurtig@kau.se
Appendix A: Research Issues in Adjusting the Duplicate ACK Threshold Appendix A: Research Issues in Adjusting the Duplicate ACK Threshold
Decreasing the number of duplicate ACKs required to trigger Fast Decreasing the number of duplicate ACKs required to trigger Fast
Retransmit, as suggested in section 2, has the drawback of making Retransmit, as suggested in section 2, has the drawback of making
Fast Retransmit less robust in the face of minor network reordering. Fast Retransmit less robust in the face of minor network reordering.
Two egregious examples of problems caused by reordering are given in Two egregious examples of problems caused by reordering are given in
section 3. This appendix outlines several schemes that have been section 3. This appendix outlines several schemes that have been
suggested to mitigate the problems caused by Early Retransmit in the suggested to mitigate the problems caused by Early Retransmit in the
face of packet reordering. These methods need further research face of segment reordering. These methods need further research
before they are suggested for general use (and, current consensus is before they are suggested for general use (and, current consensus is
that the cases that make Early Retransmit unnecessarily retransmit a that the cases that make Early Retransmit unnecessarily retransmit a
large amount of data are pathological and therefore these large amount of data are pathological and therefore these
mitigations are not generally required). mitigations are not generally required).
MITIGATION A.1: Allow a connection to use Early Retransmit as long MITIGATION A.1: Allow a connection to use Early Retransmit as long
as the algorithm is not injecting "too much" spurious data into as the algorithm is not injecting "too much" spurious data into
the network. For instance, using the information provided by the network. For instance, using the information provided by
TCP's DSACK option [RFC2883] or SCTP's Duplicate-TSN TCP's DSACK option [RFC2883] or SCTP's Duplicate-TSN
notification, a sender can determine when segments sent via notification, a sender can determine when segments sent via
 End of changes. 54 change blocks. 
113 lines changed or deleted 166 lines changed or added

This html diff was produced by rfcdiff 1.37a. The latest version is available from http://tools.ietf.org/tools/rfcdiff/