draft-ietf-tcpm-3517bis-02.txt   rfc6675.txt 
TCPM Working Group E. Blanton Internet Engineering Task Force (IETF) E. Blanton
INTERNET-DRAFT Purdue University Request for Comments: 6675 Purdue University
draft-ietf-tcpm-3517bis-02.txt M. Allman Obsoletes: 3517 M. Allman
Obsoletes: 3517 ICSI Category: Standards Track ICSI
Intended status: Standards Track L. Wang ISSN: 2070-1721 L. Wang
Expires: September 2012 Juniper Networks Juniper Networks
I. Jarvinen I. Jarvinen
M. Kojo M. Kojo
University of Helsinki University of Helsinki
Y. Nishida Y. Nishida
WIDE Project WIDE Project
March 26, 2012 August 2012
A Conservative Selective Acknowledgment (SACK)-based
Loss Recovery Algorithm for TCP
Status of this Memo A Conservative Loss Recovery Algorithm Based on
Selective Acknowledgment (SACK) for TCP
This Internet-Draft is submitted to IETF in full conformance with Abstract
the provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering This document presents a conservative loss recovery algorithm for TCP
Task Force (IETF), its areas, and its working groups. Note that that is based on the use of the selective acknowledgment (SACK) TCP
other groups may also distribute working documents as Internet- option. The algorithm presented in this document conforms to the
Drafts. spirit of the current congestion control specification (RFC 5681),
but allows TCP senders to recover more effectively when multiple
segments are lost from a single flight of data. This document
obsoletes RFC 3517 and describes changes from it.
Internet-Drafts are draft documents valid for a maximum of six Status of This Memo
months and may be updated, replaced, or obsoleted by other documents
at any time. It is inappropriate to use Internet-Drafts as
reference material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at This is an Internet Standards Track document.
http://www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at This document is a product of the Internet Engineering Task Force
http://www.ietf.org/shadow.html. (IETF). It represents the consensus of the IETF community. It has
received public review and has been approved for publication by
the Internet Engineering Steering Group (IESG). Further
information on Internet Standards is available in Section 2 of
RFC 5741.
This Internet-Draft will expire on September 23, 2012. Information about the current status of this document, any
errata, and how to provide feedback on it may be obtained at
http://www.rfc-editor.org/info/rfc6675.
Copyright Notice Copyright Notice
Copyright (c) 2012 IETF Trust and the persons identified as the Copyright (c) 2012 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with
respect to this document. Code Components extracted from this
document must include Simplified BSD License text as described in
Section 4.e of the Trust Legal Provisions and are provided without
warranty as described in the Simplified BSD License.
Abstract
This document presents a conservative loss recovery algorithm for TCP This document is subject to BCP 78 and the IETF Trust's Legal
that is based on the use of the selective acknowledgment (SACK) TCP Provisions Relating to IETF Documents
option. The algorithm presented in this document conforms to the (http://trustee.ietf.org/license-info) in effect on the date of
spirit of the current congestion control specification (RFC 5681), publication of this document. Please review these documents
but allows TCP senders to recover more effectively when multiple carefully, as they describe your rights and restrictions with respect
segments are lost from a single flight of data. to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
1 Introduction 1. Introduction
This document presents a conservative loss recovery algorithm for TCP This document presents a conservative loss recovery algorithm for TCP
that is based on the use of the selective acknowledgment (SACK) TCP that is based on the use of the selective acknowledgment (SACK) TCP
option. While the TCP SACK [RFC2018] is being steadily deployed in option. While the TCP SACK option [RFC2018] is being steadily
the Internet [All00], there is evidence that hosts are not using the deployed in the Internet [All00], there is evidence that hosts are
SACK information when making retransmission and congestion control not using the SACK information when making retransmission and
decisions [PF01]. The goal of this document is to outline one congestion control decisions [PF01]. The goal of this document is to
straightforward method for TCP implementations to use SACK outline one straightforward method for TCP implementations to use
information to increase performance. SACK information to increase performance.
[RFC5681] allows advanced loss recovery algorithms to be used by TCP [RFC5681] allows advanced loss recovery algorithms to be used by TCP
[RFC793] provided that they follow the spirit of TCP's congestion [RFC793] provided that they follow the spirit of TCP's congestion
control algorithms [RFC5681, RFC2914]. [RFC3782] outlines one such control algorithms [RFC5681] [RFC2914]. [RFC6582] outlines one such
advanced recovery algorithm called NewReno. This document outlines a advanced recovery algorithm called NewReno. This document outlines a
loss recovery algorithm that uses the SACK [RFC2018] TCP option to loss recovery algorithm that uses the SACK TCP option [RFC2018] to
enhance TCP's loss recovery. The algorithm outlined in this enhance TCP's loss recovery. The algorithm outlined in this
document, heavily based on the algorithm detailed in [FF96], is a document, heavily based on the algorithm detailed in [FF96], is a
conservative replacement of the fast recovery algorithm [Jac90, conservative replacement of the fast recovery algorithm [Jac90]
RFC5681]. The algorithm specified in this document is a [RFC5681]. The algorithm specified in this document is a
straightforward SACK-based loss recovery strategy that follows the straightforward SACK-based loss recovery strategy that follows the
guidelines set in [RFC5681] and can safely be used in TCP guidelines set in [RFC5681] and can safely be used in TCP
implementations. Alternate SACK-based loss recovery methods can be implementations. Alternate SACK-based loss recovery methods can be
used in TCP as implementers see fit (as long as the alternate used in TCP as implementers see fit (as long as the alternate
algorithms follow the guidelines provided in [RFC5681]). Please algorithms follow the guidelines provided in [RFC5681]). Please
note, however, that the SACK-based decisions in this document (such note, however, that the SACK-based decisions in this document (such
as what segments are to be sent at what time) are largely decoupled as what segments are to be sent at what time) are largely decoupled
from the congestion control algorithms, and as such can be treated as from the congestion control algorithms, and as such can be treated as
separate issues if so desired. separate issues if so desired.
This document represents a revision of [RFC3517] to address several This document represents a revision of [RFC3517] to address several
situations that are not handled explicitly in that document. A situations that are not handled explicitly in that document. A
summary of the changes between this document and [RFC3517] can be summary of the changes between this document and [RFC3517] can be
found in Section 9. found in Section 9.
2. Definitions
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in BCP 14, RFC 2119 document are to be interpreted as described in BCP 14, RFC 2119
[RFC2119]. [RFC2119].
2 Definitions
The reader is expected to be familiar with the definitions given in The reader is expected to be familiar with the definitions given in
[RFC5681]. [RFC5681].
The reader is assumed to be familiar with selective acknowledgments The reader is assumed to be familiar with selective acknowledgments
as specified in [RFC2018]. as specified in [RFC2018].
For the purposes of explaining the SACK-based loss recovery algorithm For the purposes of explaining the SACK-based loss recovery
we define six variables that a TCP sender stores: algorithm, we define six variables that a TCP sender stores:
"HighACK" is the sequence number of the highest byte of data that "HighACK" is the sequence number of the highest byte of data that
has been cumulatively ACKed at a given point. has been cumulatively ACKed at a given point.
"HighData" is the highest sequence number transmitted at a given "HighData" is the highest sequence number transmitted at a given
point. point.
"HighRxt" is the highest sequence number which has been "HighRxt" is the highest sequence number which has been
retransmitted during the current loss recovery phase. retransmitted during the current loss recovery phase.
"RescueRxt" is the highest sequence number which has been "RescueRxt" is the highest sequence number which has been
retransmitted optimistically to prevent stalling of the ACK clock optimistically retransmitted to prevent stalling of the ACK clock
when there is loss at the end of the window and no new data is when there is loss at the end of the window and no new data is
available for transmission. available for transmission.
"Pipe" is a sender's estimate of the number of bytes outstanding "Pipe" is a sender's estimate of the number of bytes outstanding
in the network. This is used during recovery for limiting the in the network. This is used during recovery for limiting the
sender's sending rate. The pipe variable allows TCP to use a sender's sending rate. The pipe variable allows TCP to use
fundamentally different congestion control than specified in fundamentally different congestion control than the algorithm
[RFC5681]. The algorithm is often referred to as the "pipe specified in [RFC5681]. The congestion control algorithm using
algorithm". the pipe estimate is often referred to as the "pipe algorithm".
"DupAcks" is the number of duplicate acknowledgments received "DupAcks" is the number of duplicate acknowledgments received
since the last cumulative acknowledgment. since the last cumulative acknowledgment.
For the purposes of this specification we define a "duplicate For the purposes of this specification, we define a "duplicate
acknowledgment" as a segment that arrives carrying a SACK block that acknowledgment" as a segment that arrives carrying a SACK block that
identifies previously unacknowledged and un-SACKed octets between identifies previously unacknowledged and un-SACKed octets between
HighACK and HighData. Note that an ACK which carries new HighACK and HighData. Note that an ACK which carries new SACK data
SACK data is counted as a duplicate acknowledgment under this is counted as a duplicate acknowledgment under this definition even
definition even if it carries new data, changes the advertised if it carries new data, changes the advertised window, or moves the
window, or moves the cumulative acknowledgment point, which is cumulative acknowledgment point, which is different from the
different from the definition of duplicate acknowledgment definition of duplicate acknowledgment in [RFC5681].
in [RFC5681].
We define a variable "DupThresh" that holds the number of duplicate We define a variable "DupThresh" that holds the number of duplicate
acknowledgments required to trigger a retransmission. Per [RFC5681] acknowledgments required to trigger a retransmission. Per [RFC5681],
this threshold is defined to be 3 duplicate acknowledgments. this threshold is defined to be 3 duplicate acknowledgments.
However, implementers should consult any updates to [RFC5681] to However, implementers should consult any updates to [RFC5681] to
determine the current value for DupThresh (or method for determining determine the current value for DupThresh (or method for determining
its value). its value).
Finally, a range of sequence numbers [A,B] is said to "cover" Finally, a range of sequence numbers [A,B] is said to "cover"
sequence number S if A <= S <= B. sequence number S if A <= S <= B.
3 Keeping Track of SACK Information 3. Keeping Track of SACK Information
For a TCP sender to implement the algorithm defined in the next For a TCP sender to implement the algorithm defined in the next
section it must keep a data structure to store incoming selective section, it must keep a data structure to store incoming selective
acknowledgment information on a per connection basis. Such a data acknowledgment information on a per connection basis. Such a data
structure is commonly called the "scoreboard". The specifics of the structure is commonly called the "scoreboard". The specifics of the
scoreboard data structure are out of scope for this document (as long scoreboard data structure are out of scope for this document (as long
as the implementation can perform all functions required by this as the implementation can perform all functions required by this
specification). specification).
Note that this document refers to keeping account of (marking) Note that this document refers to keeping account of (marking)
individual octets of data transferred across a TCP connection. A individual octets of data transferred across a TCP connection. A
real-world implementation of the scoreboard would likely prefer to real-world implementation of the scoreboard would likely prefer to
manage this data as sequence number ranges. The algorithms presented manage this data as sequence number ranges. The algorithms presented
here allow this, but require the ability to mark arbitrary sequence here allow this, but require the ability to mark arbitrary sequence
number ranges as having been selectively acknowledged. number ranges as having been selectively acknowledged.
Finally, note that the algorithm in this document assumes a Finally, note that the algorithm in this document assumes a sender
sender that is not keeping track of segment boundaries after that is not keeping track of segment boundaries after transmitting a
transmitting a segment. It is possible that a sender that did segment. It is possible that there is a more refined and precise
keep this extra state may be able to use a more refined and algorithm available to a sender that keeps this extra state than the
precise algorithm than the one presented herein, however, we algorithm presented herein; however, we leave this as future work.
leave this as future work.
4 Processing and Acting Upon SACK Information 4. Processing and Acting Upon SACK Information
For the purposes of the algorithm defined in this document the This section describes a specific structure and control flow for
implementing the TCP behavior described by this standard. The
behavior is what is standardized, and this particular collection of
functions is the strongly recommended means of implementing that
behavior, though other approaches to achieving that behavior are
feasible.
The definition of Sender Maximum Segment Size (SMSS) used in this
section is provided in [RFC5681].
For the purposes of the algorithm defined in this document, the
scoreboard SHOULD implement the following functions: scoreboard SHOULD implement the following functions:
Update (): Update ():
Given the information provided in an ACK, each octet that is Given the information provided in an ACK, each octet that is
cumulatively ACKed or SACKed should be marked accordingly in the cumulatively ACKed or SACKed should be marked accordingly in the
scoreboard data structure, and the total number of octets SACKed scoreboard data structure, and the total number of octets SACKed
should be recorded. should be recorded.
Note: SACK information is advisory and therefore SACKed data MUST Note: SACK information is advisory and therefore SACKed data MUST
NOT be removed from TCP's retransmission buffer until the data is NOT be removed from the TCP's retransmission buffer until the data
cumulatively acknowledged [RFC2018]. is cumulatively acknowledged [RFC2018].
IsLost (SeqNum): IsLost (SeqNum):
This routine returns whether the given sequence number is This routine returns whether the given sequence number is
considered to be lost. The routine returns true when either considered to be lost. The routine returns true when either
DupThresh discontiguous SACKed sequences have arrived above DupThresh discontiguous SACKed sequences have arrived above
'SeqNum' or more than (DupThresh - 1) * SMSS bytes with sequence 'SeqNum' or more than (DupThresh - 1) * SMSS bytes with sequence
numbers greater than 'SeqNum' have been SACKed. Otherwise, the numbers greater than 'SeqNum' have been SACKed. Otherwise, the
routine returns false. routine returns false.
SetPipe (): SetPipe ():
This routine traverses the sequence space from HighACK to HighData This routine traverses the sequence space from HighACK to HighData
and MUST set the "pipe" variable to an estimate of the number of and MUST set the "pipe" variable to an estimate of the number of
octets that are currently in transit between the TCP sender and octets that are currently in transit between the TCP sender and
the TCP receiver. After initializing pipe to zero the following the TCP receiver. After initializing pipe to zero, the following
steps are taken for each octet 'S1' in the sequence space between steps are taken for each octet 'S1' in the sequence space between
HighACK and HighData that has not been SACKed: HighACK and HighData that has not been SACKed:
(a) If IsLost (S1) returns false: (a) If IsLost (S1) returns false:
Pipe is incremented by 1 octet. Pipe is incremented by 1 octet.
The effect of this condition is that pipe is incremented for The effect of this condition is that pipe is incremented for
packets that have not been SACKed and have not been determined packets that have not been SACKed and have not been determined
to have been lost (i.e., those segments that are still assumed to have been lost (i.e., those segments that are still assumed
skipping to change at page 5, line 55 skipping to change at page 6, line 38
(2) If no sequence number 'S2' per rule (1) exists but there (2) If no sequence number 'S2' per rule (1) exists but there
exists available unsent data and the receiver's advertised exists available unsent data and the receiver's advertised
window allows, the sequence range of one segment of up to SMSS window allows, the sequence range of one segment of up to SMSS
octets of previously unsent data starting with sequence number octets of previously unsent data starting with sequence number
HighData+1 MUST be returned. HighData+1 MUST be returned.
(3) If the conditions for rules (1) and (2) fail, but there exists (3) If the conditions for rules (1) and (2) fail, but there exists
an unSACKed sequence number 'S3' that meets the criteria for an unSACKed sequence number 'S3' that meets the criteria for
detecting loss given in steps (1.a) and (1.b) above detecting loss given in steps (1.a) and (1.b) above
(specifically excluding step (1.c)) then one segment of up to (specifically excluding step (1.c)), then one segment of up to
SMSS octets starting with S3 SHOULD be returned. SMSS octets starting with S3 SHOULD be returned.
(4) If the conditions for (1), (2), and (3) fail, but there (4) If the conditions for (1), (2), and (3) fail, but there exists
exists outstanding unSACKed data, we provide the outstanding unSACKed data, we provide the opportunity for a
opportunity for a single "rescue" retransmission per entry single "rescue" retransmission per entry into loss recovery.
into loss recovery. If HighACK is greater than RescueRxt If HighACK is greater than RescueRxt (or RescueRxt is
(or RescueRxt is undefined), then one segment of up to undefined), then one segment of up to SMSS octets that MUST
SMSS octets which MUST include the highest outstanding include the highest outstanding unSACKed sequence number
unSACKed sequence number SHOULD be returned, and RescueRxt SHOULD be returned, and RescueRxt set to RecoveryPoint.
set to RecoveryPoint. HighRxt MUST NOT be updated. HighRxt MUST NOT be updated.
Note that rules (3) and (4) are a sort of retransmission "last Note that rules (3) and (4) are a sort of retransmission "last
resort". They allow for retransmission of sequence numbers resort". They allow for retransmission of sequence numbers
even when the sender has less certainty a segment has been even when the sender has less certainty a segment has been
lost than as with rule (1). Retransmitting segments via rule lost than as with rule (1). Retransmitting segments via rule
(3) and (4) will help sustain TCP's ACK clock and therefore (3) and (4) will help sustain the TCP's ACK clock and
can potentially help avoid retransmission timeouts. However, therefore can potentially help avoid retransmission timeouts.
in sending these segments the sender has two copies of the However, in sending these segments, the sender has two copies
same data considered to be in the network (and also in the of the same data considered to be in the network (and also in
Pipe estimate, in the case of (3)). When an ACK or SACK the pipe estimate, in the case of (3)). When an ACK or SACK
arrives covering this retransmitted segment, the sender cannot arrives covering this retransmitted segment, the sender cannot
be sure exactly how much data left the network (one of the two be sure exactly how much data left the network (one of the two
transmissions of the packet or both transmissions of the transmissions of the packet or both transmissions of the
packet). Therefore the sender may underestimate Pipe by packet). Therefore, the sender may underestimate pipe by
considering both segments to have left the network when it is considering both segments to have left the network when it is
possible that only one of the two has. possible that only one of the two has.
(5) If the conditions for each of (1), (2), (3), and (4) are not (5) If the conditions for each of (1), (2), (3), and (4) are not
met, then NextSeg () MUST indicate failure, and no segment is met, then NextSeg () MUST indicate failure, and no segment is
returned. returned.
Note: The SACK-based loss recovery algorithm outlined in this Note: The SACK-based loss recovery algorithm outlined in this
document requires more computational resources than previous TCP loss document requires more computational resources than previous TCP loss
recovery strategies. However, we believe the scoreboard data recovery strategies. However, we believe the scoreboard data
structure can be implemented in a reasonably efficient manner (both structure can be implemented in a reasonably efficient manner (both
in terms of computation complexity and memory usage) in most TCP in terms of computation complexity and memory usage) in most TCP
implementations. implementations.
5 Algorithm Details 5. Algorithm Details
Upon the receipt of any ACK containing SACK information, the Upon the receipt of any ACK containing SACK information, the
scoreboard MUST be updated via the Update () routine. scoreboard MUST be updated via the Update () routine.
If the incoming ACK is a cumulative acknowledgment, the TCP MUST If the incoming ACK is a cumulative acknowledgment, the TCP MUST
reset DupAcks to zero. reset DupAcks to zero.
If the incoming ACK is a duplicate acknowledgment per the definition If the incoming ACK is a duplicate acknowledgment per the definition
in Section 2 (regardless of its status as a cumulative in Section 2 (regardless of its status as a cumulative
acknowledgment), and the TCP is not currently in loss recovery, the acknowledgment), and the TCP is not currently in loss recovery, the
TCP MUST increase DupAcks by one and take the following steps: TCP MUST increase DupAcks by one and take the following steps:
(1) If DupAcks >= DupThresh, go to step (4). (1) If DupAcks >= DupThresh, go to step (4).
Note: This check covers the case when a TCP receives SACK Note: This check covers the case when a TCP receives SACK
information for multiple segments smaller than SMSS, which can information for multiple segments smaller than SMSS, which can
potentially prevent IsLost() (next step) from declaring a segment potentially prevent IsLost() (next step) from declaring a segment
as lost. as lost.
(2) If DupAcks < DupThresh but IsLost (HighACK + 1) returns (2) If DupAcks < DupThresh but IsLost (HighACK + 1) returns true --
true---indicating at least three segments have arrived above indicating at least three segments have arrived above the current
the current cumulative acknowledgment point, which is taken cumulative acknowledgment point, which is taken to indicate loss
to indicate loss---go to step (4). -- go to step (4).
(3) The TCP MAY transmit previously unsent data segments as per (3) The TCP MAY transmit previously unsent data segments as per
Limited Transmit [RFC5681], except that the number of octets Limited Transmit [RFC5681], except that the number of octets
which may be sent is governed by Pipe and cwnd as follows: which may be sent is governed by pipe and cwnd as follows:
(3.1) Set HighRxt to HighACK. (3.1) Set HighRxt to HighACK.
(3.2) Run SetPipe (). (3.2) Run SetPipe ().
(3.3) If (cwnd - pipe) >= 1 SMSS, there exists previously (3.3) If (cwnd - pipe) >= 1 SMSS, there exists previously unsent
unsent data, and the receiver's advertised window data, and the receiver's advertised window allows, transmit
allows, transmit up to 1 SMSS of data starting with the up to 1 SMSS of data starting with the octet HighData+1 and
octet HighData+1 and update HighData to reflect this update HighData to reflect this transmission, then return
transmission, then return to (3.2). to (3.2).
(3.4) Terminate processing of this ACK. (3.4) Terminate processing of this ACK.
(4) Invoke Fast Retransmit and enter loss recovery as follows: (4) Invoke fast retransmit and enter loss recovery as follows:
(4.1) RecoveryPoint = HighData (4.1) RecoveryPoint = HighData
When the TCP sender receives a cumulative ACK for this When the TCP sender receives a cumulative ACK for this data
data octet the loss recovery phase is terminated. octet, the loss recovery phase is terminated.
(4.2) ssthresh = cwnd = (FlightSize / 2) (4.2) ssthresh = cwnd = (FlightSize / 2)
The congestion window (cwnd) and slow start threshold The congestion window (cwnd) and slow start threshold
(ssthresh) are reduced to half of FlightSize per (ssthresh) are reduced to half of FlightSize per [RFC5681].
[RFC5681]. Additionally, note that [RFC5681] requires Additionally, note that [RFC5681] requires that any
any segments sent as part of the Limited Transmit segments sent as part of the Limited Transmit mechanism not
mechanism not be counted in FlightSize for the purpose be counted in FlightSize for the purpose of the above
of the above equation. equation.
(4.3) Retransmit the first data segment presumed dropped -- the (4.3) Retransmit the first data segment presumed dropped -- the
segment starting with sequence number HighACK + 1. To segment starting with sequence number HighACK + 1. To
prevent repeated retransmission of the same data or a prevent repeated retransmission of the same data or a
premature rescue retransmission, set both HighRxt and premature rescue retransmission, set both HighRxt and
RescueRxt to the highest sequence number in the RescueRxt to the highest sequence number in the
retransmitted segment. retransmitted segment.
(4.4) Run SetPipe () (4.4) Run SetPipe ()
Set a "pipe" variable to the number of outstanding Set a "pipe" variable to the number of outstanding octets
octets currently "in the pipe"; this is the data which currently "in the pipe"; this is the data which has been
has been sent by the TCP sender but for which no sent by the TCP sender but for which no cumulative or
cumulative or selective acknowledgment has been selective acknowledgment has been received and the data has
received and the data has not been determined to have not been determined to have been dropped in the network.
been dropped in the network. It is assumed that the It is assumed that the data is still traversing the network
data is still traversing the network path. path.
(4.5) In order to take advantage of potential additional (4.5) In order to take advantage of potential additional
available cwnd, proceed to step (C) below. available cwnd, proceed to step (C) below.
Once a TCP is in the loss recovery phase the following procedure MUST Once a TCP is in the loss recovery phase, the following procedure
be used for each arriving ACK: MUST be used for each arriving ACK:
(A) An incoming cumulative ACK for a sequence number greater than (A) An incoming cumulative ACK for a sequence number greater than
RecoveryPoint signals the end of loss recovery and the loss RecoveryPoint signals the end of loss recovery, and the loss
recovery phase MUST be terminated. Any information contained in recovery phase MUST be terminated. Any information contained in
the scoreboard for sequence numbers greater than the new value of the scoreboard for sequence numbers greater than the new value of
HighACK SHOULD NOT be cleared when leaving the loss recovery HighACK SHOULD NOT be cleared when leaving the loss recovery
phase. phase.
(B) Upon receipt of an ACK that does not cover RecoveryPoint the (B) Upon receipt of an ACK that does not cover RecoveryPoint, the
following actions MUST be taken: following actions MUST be taken:
(B.1) Use Update () to record the new SACK information conveyed (B.1) Use Update () to record the new SACK information conveyed
by the incoming ACK. by the incoming ACK.
(B.2) Use SetPipe () to re-calculate the number of octets still (B.2) Use SetPipe () to re-calculate the number of octets still
in the network. in the network.
(C) If cwnd - pipe >= 1 SMSS the sender SHOULD transmit one or more (C) If cwnd - pipe >= 1 SMSS, the sender SHOULD transmit one or more
segments as follows: segments as follows:
(C.1) The scoreboard MUST be queried via NextSeg () for the (C.1) The scoreboard MUST be queried via NextSeg () for the
sequence number range of the next segment to transmit (if any), sequence number range of the next segment to transmit (if
and the given segment sent. If NextSeg () returns failure (no any), and the given segment sent. If NextSeg () returns
data to send) return without sending anything (i.e., terminate failure (no data to send), return without sending anything
steps C.1 -- C.5). (i.e., terminate steps C.1 -- C.5).
(C.2) If any of the data octets sent in (C.1) are below HighData, (C.2) If any of the data octets sent in (C.1) are below HighData,
HighRxt MUST be set to the highest sequence number of the HighRxt MUST be set to the highest sequence number of the
retransmitted segment unless NextSeg () rule (4) was invoked for retransmitted segment unless NextSeg () rule (4) was
this retransmission. invoked for this retransmission.
(C.3) If any of the data octets sent in (C.1) are above HighData, (C.3) If any of the data octets sent in (C.1) are above HighData,
HighData must be updated to reflect the transmission of HighData must be updated to reflect the transmission of
previously unsent data. previously unsent data.
(C.4) The estimate of the amount of data outstanding in the (C.4) The estimate of the amount of data outstanding in the
network must be updated by incrementing pipe by the number of network must be updated by incrementing pipe by the number
octets transmitted in (C.1). of octets transmitted in (C.1).
(C.5) If cwnd - pipe >= 1 SMSS, return to (C.1) (C.5) If cwnd - pipe >= 1 SMSS, return to (C.1)
Note that steps (A) and (C) can potentially send a burst of Note that steps (A) and (C) can potentially send a burst of
back-to-back segments into the network if the incoming cumulative back-to-back segments into the network if the incoming cumulative
acknowledgment is for more than SMSS octets of data, or if incoming acknowledgment is for more than SMSS octets of data, or if incoming
SACK blocks indicate that more than SMSS octets of data have been SACK blocks indicate that more than SMSS octets of data have been
lost in the second half of the window. lost in the second half of the window.
5.1 Retransmission Timeouts 5.1. Retransmission Timeouts
In order to avoid memory deadlocks, the TCP receiver is allowed In order to avoid memory deadlocks, the TCP receiver is allowed to
to discard data that has already been selectively acknowledged. discard data that has already been selectively acknowledged. As a
As a result, [RFC2018] suggests that a TCP sender SHOULD expunge result, [RFC2018] suggests that a TCP sender SHOULD expunge the SACK
the SACK information gathered from a receiver upon a information gathered from a receiver upon a retransmission timeout
retransmission timeout "since the timeout might indicate that the (RTO) "since the timeout might indicate that the data receiver has
data receiver has reneged." Additionally, a TCP sender MUST reneged." Additionally, a TCP sender MUST "ignore prior SACK
"ignore prior SACK information in determining which data to information in determining which data to retransmit." However, since
retransmit." However, since the publication of [RFC2018] this the publication of [RFC2018], this has come to be viewed by some as
has come to be viewed by some as too strong. It has been too strong. It has been suggested that, as long as robust tests for
suggested that, as long as robust tests for reneging are present, reneging are present, an implementation can retain and use SACK
an implementation can retain and use SACK information across a information across a timeout event [Errata1610]. While this document
timeout event [Errata1610]. While this document does not change does not change the specification in [RFC2018], we note that
the specification in [RFC2018], we note that implementers should implementers should consult any updates to [RFC2018] on this subject.
consult any updates to [RFC2018] on this subject. Further, a Further, a SACK TCP sender SHOULD utilize all SACK information made
SACK TCP sender SHOULD utilize all SACK information made
available during the loss recovery following an RTO. available during the loss recovery following an RTO.
If an RTO occurs during loss recovery as specified in this document, If an RTO occurs during loss recovery as specified in this document,
RecoveryPoint MUST be set to HighData. Further, the new value of RecoveryPoint MUST be set to HighData. Further, the new value of
RecoveryPoint MUST be preserved and the loss recovery algorithm RecoveryPoint MUST be preserved and the loss recovery algorithm
outlined in this document MUST be terminated. In addition, a new outlined in this document MUST be terminated. In addition, a new
recovery phase (as described in section 5) MUST NOT be initiated recovery phase (as described in Section 5) MUST NOT be initiated
until HighACK is greater than or equal to the new value of until HighACK is greater than or equal to the new value of
RecoveryPoint. RecoveryPoint.
As described in Sections 4 and 5, Update () SHOULD continue to be As described in Sections 4 and 5, Update () SHOULD continue to be
used appropriately upon receipt of ACKs. This will allow the used appropriately upon receipt of ACKs. This will allow the
recovery period after an RTO to benefit from all available recovery period after an RTO to benefit from all available
information provided by the receiver, even if SACK information information provided by the receiver, even if SACK information was
was expunged due to the RTO. expunged due to the RTO.
If there are segments missing from the receiver's buffer If there are segments missing from the receiver's buffer following
following processing of the retransmitted segment, the processing of the retransmitted segment, the corresponding ACK will
corresponding ACK will contain SACK information. In this case, a contain SACK information. In this case, a TCP sender SHOULD use this
TCP sender SHOULD use this SACK information when determining what SACK information when determining what data should be sent in each
data should be sent in each segment following an RTO. The exact segment following an RTO. The exact algorithm for this selection is
algorithm for this selection is not specified in this document not specified in this document (specifically NextSeg () is
(specifically NextSeg () is inappropriate during loss recovery inappropriate during loss recovery after an RTO). A relatively
after an RTO). A relatively straightforward approach to "filling straightforward approach to "filling in" the sequence space reported
in" the sequence space reported as missing should be a reasonable as missing should be a reasonable approach.
approach.
6 Managing the RTO Timer 6. Managing the RTO Timer
The standard TCP RTO estimator is defined in [RFC6298]. Due to the The standard TCP RTO estimator is defined in [RFC6298]. Due to the
fact that the SACK algorithm in this document can have an impact on fact that the SACK algorithm in this document can have an impact on
the behavior of the estimator, implementers may wish to consider how the behavior of the estimator, implementers may wish to consider how
the timer is managed. [RFC6298] calls for the RTO timer to be the timer is managed. [RFC6298] calls for the RTO timer to be
re-armed each time an ACK arrives that advances the cumulative ACK re-armed each time an ACK arrives that advances the cumulative ACK
point. Because the algorithm presented in this document can keep the point. Because the algorithm presented in this document can keep the
ACK clock going through a fairly significant loss event, ACK clock going through a fairly significant loss event
(comparatively longer than the algorithm described in [RFC5681]), on (comparatively longer than the algorithm described in [RFC5681]), on
some networks the loss event could last longer than the RTO. In this some networks the loss event could last longer than the RTO. In this
case the RTO timer would expire prematurely and a segment that need case the RTO timer would expire prematurely and a segment that need
not be retransmitted would be resent. not be retransmitted would be resent.
Therefore we give implementers the latitude to use the standard Therefore, we give implementers the latitude to use the standard
[RFC6298] style RTO management or, optionally, a more careful variant [RFC6298]-style RTO management or, optionally, a more careful variant
that re-arms the RTO timer on each retransmission that is sent during that re-arms the RTO timer on each retransmission that is sent during
recovery MAY be used. This provides a more conservative timer than recovery MAY be used. This provides a more conservative timer than
specified in [RFC6298], and so may not always be an attractive specified in [RFC6298], and so may not always be an attractive
alternative. However, in some cases it may prevent needless alternative. However, in some cases it may prevent needless
retransmissions, go-back-N transmission and further reduction of the retransmissions, go-back-N transmission, and further reduction of the
congestion window. congestion window.
7 Research 7. Research
The algorithm specified in this document is analyzed in [FF96], which The algorithm specified in this document is analyzed in [FF96], which
shows that the above algorithm is effective in reducing transfer time shows that the above algorithm is effective in reducing transfer time
over standard TCP Reno [RFC5681] when multiple segments are dropped over standard TCP Reno [RFC5681] when multiple segments are dropped
from a window of data (especially as the number of drops increases). from a window of data (especially as the number of drops increases).
[AHKO97] shows that the algorithm defined in this document can [AHKO97] shows that the algorithm defined in this document can
greatly improve throughput in connections traversing satellite greatly improve throughput in connections traversing satellite
channels. channels.
8 Security Considerations 8. Security Considerations
The algorithm presented in this paper shares security considerations The algorithm presented in this paper shares security considerations
with [RFC5681]. A key difference is that an algorithm based on SACKs with [RFC5681]. A key difference is that an algorithm based on SACKs
is more robust against attackers forging duplicate ACKs to force the is more robust against attackers forging duplicate ACKs to force the
TCP sender to reduce cwnd. With SACKs, TCP senders have an TCP sender to reduce cwnd. With SACKs, TCP senders have an
additional check on whether or not a particular ACK is legitimate. additional check on whether or not a particular ACK is legitimate.
While not fool-proof, SACK does provide some amount of protection in While not fool-proof, SACK does provide some amount of protection in
this area. this area.
Similarly, [CPNI309] sketches a variant of a blind attack [RFC5961] Similarly, [CPNI309] sketches a variant of a blind attack [RFC5961]
whereby an attacker can spoof out-of-window data to a TCP endpoint, whereby an attacker can spoof out-of-window data to a TCP endpoint,
causing it to respond to the legitimate peer with a duplicate causing it to respond to the legitimate peer with a duplicate
cumulative ACK, per [RFC793]. Adding a SACK-based requirement to cumulative ACK, per [RFC793]. Adding a SACK-based requirement to
trigger loss recovery effectively mitigates this attack, as the trigger loss recovery effectively mitigates this attack, as the
duplicate ACKs caused by out-of-window segments will not contain SACK duplicate ACKs caused by out-of-window segments will not contain SACK
information indicating reception of previously un-SACKED in-window information indicating reception of previously un-SACKED in-window
data. data.
9 Changes Relative to RFC 3517 9. Changes Relative to RFC 3517
The state variable "DupAcks" has been added to the list of variables The state variable "DupAcks" has been added to the list of variables
maintained by this algorithm, and its usage specified. maintained by this algorithm, and its usage specified.
The function IsLost () has been modified to require that more than The function IsLost () has been modified to require that more than
(DupThresh - 1) * SMSS octets have been SACKed above a given sequence (DupThresh - 1) * SMSS octets have been SACKed above a given sequence
number as indication that it is lost, changed from at least number as indication that it is lost, which is changed from the
(DupThresh * SMSS). This retains the requirement that at least three minimum requirement of (DupThresh * SMSS) described in [RFC3517].
segments following the sequence number in question have been SACKed, This retains the requirement that at least three segments following
while improving detection in the event that the sender has the sequence number in question have been SACKed, while improving
outstanding segments which are smaller than SMSS. detection in the event that the sender has outstanding segments which
are smaller than SMSS.
The definition of a "duplicate acknowledgment" has been modified to The definition of a "duplicate acknowledgment" has been modified to
utilize the SACK information in detecting loss. Duplicate cumulative utilize the SACK information in detecting loss. Duplicate cumulative
acknowledgments can be caused by either loss or reordering in the acknowledgments can be caused by either loss or reordering in the
network. To disambiguate loss and reordering TCP's fast retransmit network. To disambiguate loss and reordering, TCP's fast retransmit
algorithm [RFC5681] waits until three duplicate ACKs arrive to algorithm [RFC5681] waits until three duplicate ACKs arrive to
trigger loss recovery. This notion was then the basis for the trigger loss recovery. This notion was then the basis for the
algorithm specified in [RFC3517]. However, with SACK information algorithm specified in [RFC3517]. However, with SACK information
there is no need to rely blindly on the cumulative acknowledgment there is no need to rely blindly on the cumulative acknowledgment
field. We can leverage the additional information present in the field. We can leverage the additional information present in the
SACK blocks to understand that three segments have arrived at the SACK blocks to understand that three segments lying above a gap in
receiver which lie above a gap in the sequence space, and can use the sequence space have arrived at the receiver, and can use this
that to trigger loss recovery. This notion was used in [RFC3517] understanding to trigger loss recovery. This notion was used in
during loss recovery, and the change in this document is that the [RFC3517] during loss recovery, and the change in this document is
notion is also used to enter a loss recovery phase. that the notion is also used to enter a loss recovery phase.
The state variable "RescueRxt" has been added to the list of The state variable "RescueRxt" has been added to the list of
variables maintained by the algorithm, and its usage specified. This variables maintained by the algorithm, and its usage specified. This
variable is used to allow for one extra retransmission per entry into variable is used to allow for one extra retransmission per entry into
loss recovery, in order to keep the ACK clock going under certain loss recovery, in order to keep the ACK clock going under certain
circumstances involving loss at the end of the window. This circumstances involving loss at the end of the window. This
mechanism allows for no more than one segment of no larger than 1 mechanism allows for no more than one segment of no larger than 1
SMSS to be optimistically retransmitted per loss recovery. SMSS to be optimistically retransmitted per loss recovery.
Rule (3) of NextSeg() has been changed from MAY to SHOULD, to Rule (3) of NextSeg() has been changed from MAY to SHOULD, to
appropriately reflect the opinion of the authors and working group appropriately reflect the opinion of the authors and working group
that it should be left in, rather than out, if an implementor does that it should be left in, rather than out, if an implementor does
not have a compelling reason to do otherwise. not have a compelling reason to do otherwise.
10 IANA Considerations 10. Acknowledgments
This document has no actions for IANA.
Acknowledgments
The authors wish to thank Sally Floyd for encouraging [RFC3517] The authors wish to thank Sally Floyd for encouraging [RFC3517] and
and commenting on early drafts. The algorithm described in this commenting on early drafts. The algorithm described in this document
document is loosely based on an algorithm outlined by Kevin Fall is loosely based on an algorithm outlined by Kevin Fall and Sally
and Sally Floyd in [FF96], although the authors of this document Floyd in [FF96], although the authors of this document assume
assume responsibility for any mistakes in the above text. responsibility for any mistakes in the above text.
[RFC3517] was co-authored by Kevin Fall, who provided crucial input [RFC3517] was co-authored by Kevin Fall, who provided crucial input
to that document and hence this follow-on work. to that document and hence this follow-on work.
Murali Bashyam, Ken Calvert, Tom Henderson, Reiner Ludwig, Murali Bashyam, Ken Calvert, Tom Henderson, Reiner Ludwig, Jamshid
Jamshid Mahdavi, Matt Mathis, Shawn Ostermann, Vern Paxson and Mahdavi, Matt Mathis, Shawn Ostermann, Vern Paxson, and Venkat
Venkat Venkatsubra provided valuable feedback on earlier versions Venkatsubra provided valuable feedback on earlier versions of this
of this document. document.
We thank Matt Mathis and Jamshid Mahdavi for implementing the We thank Matt Mathis and Jamshid Mahdavi for implementing the
scoreboard in ns and hence guiding our thinking in keeping track scoreboard in ns and hence guiding our thinking in keeping track of
of SACK state. SACK state.
The first author would like to thank Ohio University and the Ohio The first author would like to thank Ohio University and the Ohio
University Internetworking Research Group for supporting the bulk of University Internetworking Research Group for supporting the bulk of
his work on RFC 3517, from which this document is derived. his work on RFC 3517, from which this document is derived.
Normative References 11. References
[RFC793] Postel, J., "Transmission Control Protocol", STD 7, RFC 11.1. Normative References
793, September 1981.
[RFC2018] Mathis, M., Mahdavi, J., Floyd, S. and A. Romanow, "TCP [RFC793] Postel, J., "Transmission Control Protocol", STD 7, RFC
Selective Acknowledgment Options", RFC 2018, October 1996. 793, September 1981.
[RFC2026] Bradner, S., "The Internet Standards Process -- Revision [RFC2018] Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP
3", BCP 9, RFC 2026, October 1996. Selective Acknowledgment Options", RFC 2018, October 1996.
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, March 1997. Requirement Levels", BCP 14, RFC 2119, March 1997.
[RFC5681] Allman, M., Paxson, V. and E. Blanton, "TCP Congestion [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion
Control", RFC 5681, September 2009. Control", RFC 5681, September 2009.
Informative References 11.2. Informative References
[AHKO97] Mark Allman, Chris Hayes, Hans Kruse, Shawn Ostermann, "TCP [AHKO97] Mark Allman, Chris Hayes, Hans Kruse, Shawn Ostermann,
Performance Over Satellite Links", Proceedings of the Fifth "TCP Performance Over Satellite Links", Proceedings of the
International Conference on Telecommunications Systems, Fifth International Conference on Telecommunications
Nashville, TN, March, 1997. Systems, Nashville, TN, March, 1997.
[All00] Mark Allman, "A Web Server's View of the Transport Layer", [All00] Mark Allman, "A Web Server's View of the Transport Layer",
ACM Computer Communication Review, 30(5), October 2000. ACM Computer Communication Review, 30(5), October 2000.
[CPNI309] Fernando Gont, "Security Assessment of the Transmission [CPNI309] Fernando Gont, "Security Assessment of the Transmission
Control Protocol (TCP)", CPNI Technical Note 3/2009, Control Protocol (TCP)", CPNI Technical Note 3/2009,
http://www.cpni.gov.uk/Docs/tn-03-09-security-assessment-TCP.pdf, <http://www.gont.com.ar/papers/
February 2009. tn-03-09-security-assessment-TCP.pdf>, February 2009.
[Errata1610] Matt Mathis, "RFC Errata Report 1610 for RFC 2018", [Errata1610]
http://www.rfc-editor.org/errata_search.php?eid=1610, RFC Errata, Errata ID 1610, RFC 2018,
Verified 2008-12-09. <http://www.rfc-editor.org>.
[FF96] Kevin Fall and Sally Floyd, "Simulation-based Comparisons [FF96] Kevin Fall and Sally Floyd, "Simulation-based Comparisons
of Tahoe, Reno and SACK TCP", Computer Communication of Tahoe, Reno and SACK TCP", Computer Communication
Review, July 1996. Review, July 1996.
[Jac90] Van Jacobson, "Modified TCP Congestion Avoidance [Jac90] Van Jacobson, "Modified TCP Congestion Avoidance
Algorithm", Technical Report, LBL, April 1990. Algorithm", Technical Report, LBL, April 1990.
[PF01] Jitendra Padhye, Sally Floyd "Identifying the TCP Behavior [PF01] Jitendra Padhye, Sally Floyd "Identifying the TCP Behavior
of Web Servers", ACM SIGCOMM, August 2001. of Web Servers", ACM SIGCOMM, August 2001.
[RFC3782] Floyd, S., Henderson, T., and A. Gurtov, "The NewReno [RFC6582] Henderson, T., Floyd, S., Gurtov, A., and Y. Nishida, "The
Modification to TCP's Fast Recovery Algorithm", RFC 3782, NewReno Modification to TCP's Fast Recovery Algorithm",
April 2004. RFC 6582, April 2012.
[RFC2914] Floyd, S., "Congestion Control Principles", BCP 41, RFC [RFC2914] Floyd, S., "Congestion Control Principles", BCP 41, RFC
2914, September 2000. 2914, September 2000.
[RFC6298] Paxson, V., Allman, M., Chu, J., and M. Sargent, "Computing [RFC6298] Paxson, V., Allman, M., Chu, J., and M. Sargent,
TCP's Retransmission Timer", RFC 6298, June 2011. "Computing TCP's Retransmission Timer", RFC 6298, June
2011.
[RFC3517] Blanton, E., Allman, M., Fall, K., and L. Wang, "A [RFC3517] Blanton, E., Allman, M., Fall, K., and L. Wang, "A
Conservative Selective Acknowledgment (SACK)-based Loss Conservative Selective Acknowledgment (SACK)-based Loss
Recovery Algorithm for TCP", RFC 3517, April 2003. Recovery Algorithm for TCP", RFC 3517, April 2003.
[RFC5961] Ramaiah, A., Stewart, R., and M. Dalal, "Improving TCP's [RFC5961] Ramaiah, A., Stewart, R., and M. Dalal, "Improving TCP's
Robustness to Blind In-Window Attacks", RFC 5961, August Robustness to Blind In-Window Attacks", RFC 5961, August
2010. 2010.
Authors' Addresses Authors' Addresses
Ethan Blanton Ethan Blanton
Purdue University Computer Sciences Purdue University Computer Sciences
305 N. University St. 305 N. University St.
West Lafayette, IN 47907 West Lafayette, IN 47907
United States
EMail: elb@psg.com EMail: elb@psg.com
Mark Allman Mark Allman
International Computer Science Institute International Computer Science Institute
1947 Center St. Suite 600 1947 Center St. Suite 600
Berkeley, CA 94704 Berkeley, CA 94704
United States
Phone: 440-235-1792
EMail: mallman@icir.org EMail: mallman@icir.org
http://www.icir.org/mallman http://www.icir.org/mallman
Lili Wang Lili Wang
Juniper Networks Juniper Networks
10 Technology Park Drive 10 Technology Park Drive
Westford, MA 01886 Westford, MA 01886
United States
EMail: liliw@juniper.net EMail: liliw@juniper.net
Ilpo Jarvinen Ilpo Jarvinen
University of Helsinki University of Helsinki
P.O. Box 68 P.O. Box 68
FI-00014 UNIVERSITY OF HELSINKI FI-00014 UNIVERSITY OF HELSINKI
Finland Finland
EMail: ilpo.jarvinen@helsinki.fi
Email: ilpo.jarvinen@helsinki.fi
Markku Kojo Markku Kojo
University of Helsinki University of Helsinki
P.O. Box 68 P.O. Box 68
FI-00014 UNIVERSITY OF HELSINKI FI-00014 UNIVERSITY OF HELSINKI
Finland Finland
EMail: kojo@cs.helsinki.fi
Email: kojo@cs.helsinki.fi
Yoshifumi Nishida Yoshifumi Nishida
WIDE Project WIDE Project
Endo 5322 Endo 5322
Fujisawa, Kanagawa 252-8520 Fujisawa, Kanagawa 252-8520
Japan Japan
EMail: nishida@wide.ad.jp
Email: nishida@wide.ad.jp
 End of changes. 99 change blocks. 
269 lines changed or deleted 262 lines changed or added

This html diff was produced by rfcdiff 1.41. The latest version is available from http://tools.ietf.org/tools/rfcdiff/