draft-ietf-tcpm-rack-07.txt   draft-ietf-tcpm-rack-08.txt 
TCP Maintenance Working Group Y. Cheng TCP Maintenance Working Group Y. Cheng
Internet-Draft N. Cardwell Internet-Draft N. Cardwell
Intended status: Standards Track N. Dukkipati Intended status: Standards Track N. Dukkipati
Expires: July 20, 2020 P. Jha Expires: September 10, 2020 P. Jha
Google, Inc Google, Inc
Jan 17, 2020 March 9, 2020
RACK: a time-based fast loss detection algorithm for TCP RACK: a time-based fast loss detection algorithm for TCP
draft-ietf-tcpm-rack-07 draft-ietf-tcpm-rack-08
Abstract Abstract
This document presents a new TCP loss detection algorithm called RACK This document presents a new TCP loss detection algorithm called RACK
("Recent ACKnowledgment"). RACK uses the notion of time, instead of ("Recent ACKnowledgment"). RACK uses the notion of time, instead of
packet or sequence counts, to detect losses, for modern TCP packet or sequence counts, to detect losses, for modern TCP
implementations that can support per-packet timestamps and the implementations that can support per-packet timestamps and the
selective acknowledgment (SACK) option. It is intended to replace selective acknowledgment (SACK) option. It is intended to be an
the conventional DUPACK threshold approach and its variants, as well alternative to the DUPACK threshold approach [RFC6675], as well as
as other nonstandard approaches. other nonstandard approaches such as FACK [FACK].
Status of This Memo Status of This Memo
This Internet-Draft is submitted in full conformance with the This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79. provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/. Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on July 20, 2020. This Internet-Draft will expire on September 10, 2020.
Copyright Notice Copyright Notice
Copyright (c) 2020 IETF Trust and the persons identified as the Copyright (c) 2020 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(https://trustee.ietf.org/license-info) in effect on the date of (https://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
skipping to change at page 6, line 10 skipping to change at page 6, line 10
alternative solutions, such as MPTCP, for such networks. alternative solutions, such as MPTCP, for such networks.
To conclude, the RACK algorithm aims to adapt to small degrees of To conclude, the RACK algorithm aims to adapt to small degrees of
reordering, quickly recover most losses within one to two round reordering, quickly recover most losses within one to two round
trips, and avoid costly retransmission timeouts (RTOs). In the trips, and avoid costly retransmission timeouts (RTOs). In the
presence of reordering, the adaptation algorithm can impose presence of reordering, the adaptation algorithm can impose
sometimes-needless delays when it waits to disambiguate loss from sometimes-needless delays when it waits to disambiguate loss from
reordering, but the penalty for waiting is bounded to one round trip reordering, but the penalty for waiting is bounded to one round trip
and such delays are confined to longer-running flows. and such delays are confined to longer-running flows.
This document provides a concrete and detailed reordering window
adaptation algorithm for implementers. We note that the specifics of
the algorithm are likely to evolve over time. But that is a separate
engineering optimization that's out of scope for this document.
5. Requirements 5. Requirements
The reader is expected to be familiar with the definitions given in The reader is expected to be familiar with the definitions given in
the TCP congestion control [RFC5681] and selective acknowledgment the TCP congestion control [RFC5681] and selective acknowledgment
[RFC2018] RFCs. Familiarity with the conservative SACK-based [RFC2018] RFCs. Familiarity with the conservative SACK-based
recovery for TCP [RFC6675] is not expected but helps. recovery for TCP [RFC6675] is not expected but helps.
RACK has three requirements: RACK has three requirements:
1. The connection MUST use selective acknowledgment (SACK) options 1. The connection MUST use selective acknowledgment (SACK) options
skipping to change at page 8, line 4 skipping to change at page 7, line 42
"RACK.min_RTT" is the estimated minimum round-trip time (RTT) of the "RACK.min_RTT" is the estimated minimum round-trip time (RTT) of the
connection. connection.
"RACK.ack_ts" is the time when all the sequences in RACK.packet were "RACK.ack_ts" is the time when all the sequences in RACK.packet were
selectively or cumulatively acknowledged. selectively or cumulatively acknowledged.
"RACK.reo_wnd_incr" is the multiplier applied to adjust RACK.reo_wnd "RACK.reo_wnd_incr" is the multiplier applied to adjust RACK.reo_wnd
"RACK.reo_wnd_persist" is the number of loss recoveries before "RACK.reo_wnd_persist" is the number of loss recoveries before
resetting RACK.reo_wnd resetting RACK.reo_wnd
"RACK.dsack" indicates if a DSACK option has been received since last "RACK.dsack" indicates if a DSACK option has been received since last
RACK.reo_wnd change "RACK.pkts_sacked" returns the total number of RACK.reo_wnd change
packets selectively acknowledged in the SACK scoreboard.
"RACK.pkts_sacked" returns the total number of packets selectively
acknowledged in the SACK scoreboard.
"RACK.reord" indicates the connection has detected packet reordering "RACK.reord" indicates the connection has detected packet reordering
event(s) event(s)
"RACK.fack" is the highest selectively or cumulatively acknowledged "RACK.fack" is the highest selectively or cumulatively acknowledged
sequence sequence
Note that the Packet.xmit_ts variable is per packet in flight. The Note that the Packet.xmit_ts variable is per packet in flight. The
RACK.xmit_ts, RACK.end_seq, RACK.rtt, RACK.reo_wnd, and RACK.min_RTT RACK.xmit_ts, RACK.end_seq, RACK.rtt, RACK.reo_wnd, and RACK.min_RTT
variables are kept in the per-connection TCP control block. variables are kept in the per-connection TCP control block.
RACK.packet and RACK.ack_ts are used as local variables in the RACK.packet and RACK.ack_ts are used as local variables in the
algorithm. algorithm.
7. Algorithm Details 7. Algorithm Details
skipping to change at page 9, line 28 skipping to change at page 9, line 22
2. The packet was last retransmitted less than RACK.min_rtt ago. 2. The packet was last retransmitted less than RACK.min_rtt ago.
If the ACK is not ignored as invalid, update the RACK.rtt to be the If the ACK is not ignored as invalid, update the RACK.rtt to be the
RTT sample calculated using this ACK, and continue. If this ACK or RTT sample calculated using this ACK, and continue. If this ACK or
SACK was for the most recently sent packet, then record the SACK was for the most recently sent packet, then record the
RACK.xmit_ts timestamp and RACK.end_seq sequence implied by this ACK. RACK.xmit_ts timestamp and RACK.end_seq sequence implied by this ACK.
Otherwise exit here and omit the following steps. Otherwise exit here and omit the following steps.
Notice that the second condition above is a heuristic. This Notice that the second condition above is a heuristic. This
heuristic would fail to update RACK stats if the packet is spuriously heuristic would fail to update RACK stats if a data packet is
retransmitted because of a recent minimum RTT decrease (e.g. path spuriously retransmitted because of a recent minimum RTT decrease
change). Consequentially RACK may not detect losses from ACK events (e.g. path change). For example, in cases with a TCP connection
and the recovery resorts to the (slower) TLP or RTO timer-based without TCP timestamps, and where the first M packets in a flight of
events. However such events should be rare and the connection would data packets travel an old (longer) original path, and the remaining
pick up the new minimum RTT when the recovery ends to avoid repeated N packets in that flight travel a new (shorter) path and arrive out
similar failures. of order and elicit SACKs, then those SACKs for the N packets can
initiate a spurious retransmission of the first M packets. In such
scenarios, the sender would not be able to update its RACK.min_rtt
using the (ambiguous) RTT samples from retransmissions, so during
recovery all RTT samples may be less than RACK.min_rtt, and thus meet
the second condition. In such cases RACK may not detect losses from
ACK events and the recovery would then resort to the (slower) TLP or
RTO timer-based recovery. However, such events should be rare and
the connection would pick up the new minimum RTT when the recovery
ends, so the sender can avoid repeated similar failures.
Step 2 may be summarized in pseudocode as: Step 2 may be summarized in pseudocode as:
RACK_sent_after(t1, seq1, t2, seq2): RACK_sent_after(t1, seq1, t2, seq2):
If t1 > t2: If t1 > t2:
Return true Return true
Else if t1 == t2 AND seq1 > seq2: Else if t1 == t2 AND seq1 > seq2:
Return true Return true
Else: Else:
Return false Return false
skipping to change at page 11, line 26 skipping to change at page 11, line 26
Step 4: Update RACK reordering window Step 4: Update RACK reordering window
To handle the prevalent small degree of reordering, RACK.reo_wnd To handle the prevalent small degree of reordering, RACK.reo_wnd
serves as an allowance for settling time before marking a packet serves as an allowance for settling time before marking a packet
lost. This section documents a detailed algorithm following the lost. This section documents a detailed algorithm following the
design rationale section. RACK starts initially with a conservative design rationale section. RACK starts initially with a conservative
window of min_RTT/4. If no reordering has been observed, RACK uses window of min_RTT/4. If no reordering has been observed, RACK uses
RACK.reo_wnd of 0 during loss recovery, in order to retransmit RACK.reo_wnd of 0 during loss recovery, in order to retransmit
quickly, or when the number of DUPACKs exceeds the classic DUPACK quickly, or when the number of DUPACKs exceeds the classic DUPACK
threshold. The subtle difference of this approach and conventional threshold. The subtle difference between this approach and the
one [RFC5681][RFC6675] is dicussed later in the section of "RACK and conventional one [RFC5681][RFC6675] is discussed later in the section
TLP discussions". "RACK and TLP Discussion".
Further, RACK MAY use DSACK [RFC3708] to adapt the reordering window, Further, RACK MAY use DSACK [RFC3708] to adapt the reordering window,
to higher degrees of reordering, if DSACK is supported. Receiving an to higher degrees of reordering, if DSACK is supported. Receiving an
ACK with a DSACK indicates a spurious retransmission, which in turn ACK with a DSACK indicates a spurious retransmission, which in turn
suggests that the RACK reordering window, RACK.reo_wnd, is likely too suggests that the RACK reordering window, RACK.reo_wnd, is likely too
small. The sender MAY increase the RACK.reo_wnd window linearly for small. The sender MAY increase the RACK.reo_wnd window linearly for
every round trip in which the sender receives a DSACK, so that after every round trip in which the sender receives a DSACK, so that after
N distinct round trips in which a DSACK is received, the RACK.reo_wnd N distinct round trips in which a DSACK is received, the RACK.reo_wnd
becomes (N+1) * min_RTT / 4, with an upper-bound of SRTT. The becomes (N+1) * min_RTT / 4, with an upper-bound of SRTT. The
inflated RACK.reo_wnd would persist for 16 loss recoveries and after inflated RACK.reo_wnd would persist for 16 loss recoveries and after
which it resets to its starting value, min_RTT / 4. which it resets to its starting value, min_RTT / 4.
The following pseudocode implements above algorithm. Note that The following pseudocode implements the above algorithm. Note that
extensions that require additional TCP features (e.g. DSACK) would extensions that require additional TCP features (e.g. DSACK) would
work if the feature functions simply return false. work if the feature functions simply return false.
RACK_update_reo_wnd(): RACK_update_reo_wnd():
RACK.min_RTT = TCP_min_RTT() RACK.min_RTT = TCP_min_RTT()
If DSACK option is present: If DSACK option is present:
RACK.dsack = true RACK.dsack = true
If SND.UNA < RACK.rtt_seq: If SND.UNA < RACK.rtt_seq:
RACK.dsack = false /* React to DSACK once per round trip */ RACK.dsack = false /* React to DSACK once per round trip */
skipping to change at page 14, line 9 skipping to change at page 14, line 9
In turn, that is equivalent to stating that a RACK sender should In turn, that is equivalent to stating that a RACK sender should
declare a packet lost when: declare a packet lost when:
Packet.xmit_ts + RACK.rtt + RACK.reo_wnd - now <= 0 Packet.xmit_ts + RACK.rtt + RACK.reo_wnd - now <= 0
The following pseudocode implements the algorithm above. When an ACK The following pseudocode implements the algorithm above. When an ACK
is received or the RACK timer expires, call RACK_detect_loss(). The is received or the RACK timer expires, call RACK_detect_loss(). The
algorithm includes an additional optimization to break timestamp ties algorithm includes an additional optimization to break timestamp ties
by using the TCP sequence space. The optimization is particularly by using the TCP sequence space. The optimization is particularly
useful to detect losses in a timely manner with TCP Segmentation useful to detect losses in a timely manner with TCP Segmentation
Offload, where multiple packets in one TSO blob have identical Offload, where multiple packets in one TCP Segmentation Offload (TSO)
timestamps. It is also useful when the timestamp clock granularity blob have identical timestamps.
is close to or longer than the actual round trip time.
RACK_detect_loss(): RACK_detect_loss():
timeout = 0 timeout = 0
For each packet, Packet, not acknowledged yet: For each packet, Packet, not acknowledged yet:
If Packet.lost is TRUE AND Packet.retransmitted is FALSE: If Packet.lost is TRUE AND Packet.retransmitted is FALSE:
Continue /* Packet lost but not yet retransmitted */ Continue /* Packet lost but not yet retransmitted */
If RACK_sent_after(RACK.xmit_ts, RACK.end_seq, If RACK_sent_after(RACK.xmit_ts, RACK.end_seq,
Packet.xmit_ts, Packet.end_seq): Packet.xmit_ts, Packet.end_seq):
skipping to change at page 14, line 51 skipping to change at page 14, line 50
list when it is delivered or marked lost. We RECOMMEND such an list when it is delivered or marked lost. We RECOMMEND such an
optimization because it enables implementations to support high-BDP optimization because it enables implementations to support high-BDP
networks. This optimization is implemented in Linux and sees orders networks. This optimization is implemented in Linux and sees orders
of magnitude improvement in CPU usage on high-speed WAN networks. of magnitude improvement in CPU usage on high-speed WAN networks.
7.3. Tail Loss Probe: fast recovery for tail losses 7.3. Tail Loss Probe: fast recovery for tail losses
This section describes a supplemental algorithm, Tail Loss Probe This section describes a supplemental algorithm, Tail Loss Probe
(TLP), which leverages RACK to further reduce RTO recoveries. TLP (TLP), which leverages RACK to further reduce RTO recoveries. TLP
triggers fast recovery to quickly repair tail losses that can triggers fast recovery to quickly repair tail losses that can
otherwise be recovered by RTOs only. After an original data otherwise be recovered only via RTOs. After an original data
transmission, TLP sends a probe data segment within one to two RTTs. transmission, TLP sends a probe data segment within one to two RTTs.
The probe data segment can either be new, previously unsent data, or The probe data segment can either be new, previously unsent data, or
a retransmission of previously sent data just below SND.NXT. In a retransmission of previously sent data just below SND.NXT. In
either case the goal is to elicit more feedback from the receiver, in either case the goal is to elicit more feedback from the receiver, in
the form of an ACK (potentially with SACK blocks), to allow RACK to the form of an ACK (potentially with SACK blocks), to allow RACK to
trigger fast recovery instead of an RTO. trigger fast recovery instead of an RTO.
An RTO occurs when the first unacknowledged sequence number is not An RTO occurs when the first unacknowledged sequence number is not
acknowledged after a conservative period of time has elapsed acknowledged after a conservative period of time has elapsed
[RFC6298]. Common causes of RTOs include: [RFC6298]. Common causes of RTOs include:
skipping to change at page 15, line 42 skipping to change at page 15, line 42
[RFC5682] is designed to detect such spurious retransmission [RFC5682] is designed to detect such spurious retransmission
timeouts and at least partially undo the consequences of such timeouts and at least partially undo the consequences of such
events, but F-RTO cannot be used in many situations. events, but F-RTO cannot be used in many situations.
7.4. Tail Loss Probe: An Example 7.4. Tail Loss Probe: An Example
Following is an example of TLP. All events listed are at a TCP Following is an example of TLP. All events listed are at a TCP
sender. sender.
1. Sender transmits segments 1-10: 1, 2, 3, ..., 8, 9, 10. There is 1. Sender transmits segments 1-10: 1, 2, 3, ..., 8, 9, 10. There is
no more new data to transmit. A PTO is scheduled to fire in 2 no more new data to transmit. A TLP is scheduled to be sent 2
RTTs, after the transmission of the 10th segment. RTTs after the transmission of the 10th segment.
2. Sender receives acknowledgements (ACKs) for segments 1-5; 2. Sender receives acknowledgements (ACKs) for segments 1-5;
segments 6-10 are lost and no ACKs are received. The sender segments 6-10 are lost and no ACKs are received. The sender
reschedules its PTO timer relative to the last received ACK, reschedules its TLP at a time relative to the last received ACK,
which is the ACK for segment 5 in this case. The sender sets the which is the ACK for segment 5 in this case. The sender sets the
PTO interval using the calculation described in step (2) of the time for the TLP using the calculation described in step (2) of
algorithm. the algorithm.
3. When PTO fires, sender retransmits segment 10. 3. When the TLP timer fires, sender retransmits segment 10.
4. After an RTT, a SACK for packet 10 arrives. The ACK also carries 4. After an RTT, a SACK for packet 10 arrives. The ACK also carries
SACK holes for segments 6, 7, 8 and 9. This triggers RACK-based SACK holes for segments 6, 7, 8 and 9. This triggers RACK-based
loss recovery. loss recovery.
5. The connection enters fast recovery and retransmits the remaining 5. The connection enters fast recovery and retransmits the remaining
lost segments. lost segments.
7.5. Tail Loss Probe Algorithm Details 7.5. Tail Loss Probe Algorithm Details
We define the terminology used in specifying the TLP algorithm: We define the terminology used in specifying the TLP algorithm:
FlightSize: amount of outstanding data in the network, as defined in FlightSize: amount of outstanding data in the network, as defined in
[RFC5681]. [RFC5681].
RTO: The transport's retransmission timeout (RTO) is based on RTO: The transport's retransmission timeout (RTO) is based on
measured round-trip times (RTT) between the sender and receiver, as measured round-trip times (RTT) between the sender and receiver, as
specified in [RFC6298] for TCP. PTO: Probe timeout (PTO) is a timer specified in [RFC6298] for TCP. PTO: Probe timeout (PTO) is a timer
event indicating that an ACK is overdue. Its value is constrained to event indicating that an ACK is overdue and the sender should try to
be smaller than or equal to an RTO. transmit a TLP. Its value is constrained to be smaller than or equal
to an RTO.
SRTT: smoothed round-trip time, computed as specified in [RFC6298]. SRTT: smoothed round-trip time, computed as specified in [RFC6298].
TLPRxtOut: a boolean indicating whether there is an unacknowledged TLPRxtOut: a boolean indicating whether there is an unacknowledged
TLP retransmission. TLP retransmission.
TLPHighRxt: the value of SND.NXT at the time of sending a TLP TLPHighRxt: the value of SND.NXT at the time of sending a TLP
retransmission. retransmission.
WCDelAckT: maximum delayed ACK timer value.
The TLP algorithm has three phases, which we discuss in turn. The TLP algorithm has three phases, which we discuss in turn.
7.5.1. Phase 1: Scheduling a loss probe 7.5.1. Phase 1: Scheduling a loss probe
Step 1: Check conditions for scheduling a PTO. Step 1: Check conditions for scheduling a PTO.
A sender should check to see if it should schedule a PTO in the A sender should check to see if it should schedule a PTO in the
following situations: following situations:
1. After transmitting new data that was not itself a TLP probe 1. After transmitting new data that was not itself a TLP probe
2. Upon receiving an ACK that cumulatively acknowledges data 2. Upon receiving an ACK that cumulatively acknowledges data
3. Upon receiving a SACK that selectively acknowledges data that was
last sent before the segment with SEG.SEQ=SND.UNA was last
(re)transmitted
A sender should schedule a PTO only if all of the following A sender should schedule a PTO only if all of the following
conditions are met: conditions are met:
1. The connection supports SACK [RFC2018] 1. The connection supports SACK [RFC2018]
2. The connection has no SACKed sequences in the SACK scoreboard 2. The connection has no SACKed sequences in the SACK scoreboard
3. The connection is not in loss recovery 3. The connection is not in loss recovery
If a PTO can be scheduled according to these conditions, the sender If a PTO can be scheduled according to these conditions, the sender
should schedule a PTO. If there was a previously scheduled PTO or should schedule a PTO. If there was a previously scheduled PTO or
RTO pending, then that pending PTO or RTO should first be cancelled, RTO pending, then that pending PTO or RTO should first be cancelled,
and then the new PTO should be scheduled. and then the new PTO should be scheduled.
If a PTO cannot be scheduled according to these conditions, then the If a PTO cannot be scheduled according to these conditions, then the
skipping to change at page 17, line 30 skipping to change at page 17, line 27
Step 2: Select the duration of the PTO. Step 2: Select the duration of the PTO.
A sender SHOULD use the following logic to select the duration of a A sender SHOULD use the following logic to select the duration of a
PTO: PTO:
TLP_timeout(): TLP_timeout():
If SRTT is available: If SRTT is available:
PTO = 2 * SRTT PTO = 2 * SRTT
If FlightSize = 1: If FlightSize = 1:
PTO += WCDelAckT PTO += WCDelAckT
Else:
PTO += 2ms
Else: Else:
PTO = 1 sec PTO = 1 sec
If Now() + PTO > TCP_RTO_expire(): If Now() + PTO > TCP_RTO_expire():
PTO = TCP_RTO_expire() - Now() PTO = TCP_RTO_expire() - Now()
Aiming for a PTO value of 2*SRTT allows a sender to wait long enough Aiming for a PTO value of 2*SRTT allows a sender to wait long enough
to know that an ACK is overdue. Under normal circumstances, i.e. no to know that an ACK is overdue. Under normal circumstances, i.e. no
losses, an ACK typically arrives in one SRTT. But choosing PTO to be losses, an ACK typically arrives in one SRTT. But choosing PTO to be
exactly an SRTT is likely to generate spurious probes given that exactly an SRTT is likely to generate spurious probes given that
network delay variance and even end-system timings can easily push an network delay variance and even end-system timings can easily push an
ACK to be above an SRTT. We chose PTO to be the next integral ACK to be above an SRTT. We chose PTO to be the next integral
multiple of SRTT. multiple of SRTT.
Similarly, network delay variations and end-system processing
latencies and timer granularities can easily delay ACKs beyond
2*SRTT, so senders SHOULD add at least 2ms to a computed PTO value
(and MAY add more if the sending host OS timer granularity is more
coarse than 1ms).
WCDelAckT stands for worst case delayed ACK timer. When FlightSize WCDelAckT stands for worst case delayed ACK timer. When FlightSize
is 1, PTO is inflated by WCDelAckT time to compensate for a potential is 1, PTO is inflated by WCDelAckT time to compensate for a potential
long delayed ACK timer at the receiver. The RECOMMENDED value for long delayed ACK timer at the receiver. The RECOMMENDED value for
WCDelAckT is 200ms. WCDelAckT is 200ms.
Finally, if the time at which an RTO would fire (here denoted Finally, if the time at which an RTO would fire (here denoted
"TCP_RTO_expire") is sooner than the computed time for the PTO, then "TCP_RTO_expire") is sooner than the computed time for the PTO, then
a probe is scheduled to be sent at that earlier time. a probe is scheduled to be sent at that earlier time.
7.5.2. Phase 2: Sending a loss probe 7.5.2. Phase 2: Sending a loss probe
skipping to change at page 18, line 26 skipping to change at page 18, line 17
When the PTO fires, transmit a probe data segment: When the PTO fires, transmit a probe data segment:
TLP_send_probe(): TLP_send_probe():
If an unsent segment exists AND If an unsent segment exists AND
the receive window allows new data to be sent: the receive window allows new data to be sent:
Transmit the lowest-sequence unsent segment of up to SMSS Transmit the lowest-sequence unsent segment of up to SMSS
Increment FlightSize by the size of the newly-sent segment Increment FlightSize by the size of the newly-sent segment
Else if TLPRxtOut is not set: Else if TLPRxtOut is not set:
Retransmit the highest-sequence segment sent so far Retransmit the highest-sequence segment sent so far
TLPRxtOut = true
TLPHighRxt = SND.NXT
The cwnd remains unchanged The cwnd remains unchanged
When the loss probe is a retransmission, the sender uses the highest- When the loss probe is a retransmission, the sender uses the highest-
sequence segment sent so far. This is in order to deal with the sequence segment sent so far. This is in order to deal with the
retransmission ambiguity problem in TCP. Suppose a sender sends N retransmission ambiguity problem in TCP. Suppose a sender sends N
segments, and then retransmits the last segment (segment N) as a loss segments, and then retransmits the last segment (segment N) as a loss
probe, and then the sender receives a SACK for segment N. As long as probe, and then the sender receives a SACK for segment N. As long as
the sender waits for any required RACK reordering settling timer to the sender waits for any required RACK reordering settling timer to
then expire, it doesn't matter if that SACK was for the original then expire, it doesn't matter if that SACK was for the original
transmission of segment N or the TLP retransmission; in either case transmission of segment N or the TLP retransmission; in either case
skipping to change at page 19, line 23 skipping to change at page 19, line 17
If the only loss in an outstanding window of data was the last If the only loss in an outstanding window of data was the last
segment, then a TLP loss probe retransmission of that data segment segment, then a TLP loss probe retransmission of that data segment
might repair the loss. TLP recovery detection examines ACKs to might repair the loss. TLP recovery detection examines ACKs to
detect when the probe might have repaired a loss, and thus allows detect when the probe might have repaired a loss, and thus allows
congestion control to properly reduce the congestion window (cwnd) congestion control to properly reduce the congestion window (cwnd)
[RFC5681]. [RFC5681].
Consider a TLP retransmission episode where a sender retransmits a Consider a TLP retransmission episode where a sender retransmits a
tail packet in a flight. The TLP retransmission episode ends when tail packet in a flight. The TLP retransmission episode ends when
the sender receives an ACK with a SEG.ACK above the SND.NXT at the the sender receives an ACK with a SEG.ACK above the SND.NXT at the
time the episode started. During the TLP retransmission episode the time the episode started (i.e. TLPHighRxt). During the TLP
sender checks for a duplicate ACK or D-SACK indicating that both the retransmission episode the sender checks for a duplicate ACK or
original segment and TLP retransmission arrived at the receiver, D-SACK indicating that both the original segment and TLP
meaning there was no loss that needed repairing. If the TLP sender retransmission arrived at the receiver, meaning there was no loss
does not receive such an indication before the end of the TLP that needed repairing. If the TLP sender does not receive such an
retransmission episode, then it MUST estimate that either the indication before the end of the TLP retransmission episode, then it
original data segment or the TLP retransmission were lost, and MUST estimate that either the original data segment or the TLP
congestion control MUST react appropriately to that loss as it would retransmission were lost, and congestion control MUST react
any other loss. appropriately to that loss as it would any other loss.
Since a significant fraction of the hosts that support SACK do not Since a significant fraction of the hosts that support SACK do not
support duplicate selective acknowledgments (D-SACKs) [RFC2883] the support duplicate selective acknowledgments (D-SACKs) [RFC2883] the
TLP algorithm for detecting such lost segments relies only on basic TLP algorithm for detecting such lost segments relies only on basic
SACK support [RFC2018]. SACK support [RFC2018].
7.6.1. Initializing and resetting state 7.6.1. Initializing and resetting state
When a connection is created, or suffers a retransmission timeout, or When a connection is created, or suffers a retransmission timeout, or
enters fast recovery, it executes the following: enters fast recovery, it executes the following:
skipping to change at page 20, line 14 skipping to change at page 20, line 8
Note that this condition only restricts TLP loss probes that are Note that this condition only restricts TLP loss probes that are
retransmissions. There may be an arbitrary number of outstanding retransmissions. There may be an arbitrary number of outstanding
unacknowledged TLP loss probes that consist of new, previously-unsent unacknowledged TLP loss probes that consist of new, previously-unsent
data, since the retransmission timeout and fast recovery algorithms data, since the retransmission timeout and fast recovery algorithms
are sufficient to detect losses of such probe segments. are sufficient to detect losses of such probe segments.
Upon sending a TLP probe that is a retransmission, the sender sets Upon sending a TLP probe that is a retransmission, the sender sets
TLPRxtOut to true and TLPHighRxt to SND.NXT. TLPRxtOut to true and TLPHighRxt to SND.NXT.
Detecting recoveries accomplished by loss probes 7.6.3. Detecting recoveries accomplished by loss probes
Step 1: Track ACKs indicating receipt of original and retransmitted Step 1: Track ACKs indicating receipt of original and retransmitted
segments segments
A sender considers both the original segment and TLP probe A sender considers both the original segment and TLP probe
retransmission segment as acknowledged if either 1 or 2 are true: retransmission segment as acknowledged if either 1 or 2 are true:
1. This is a duplicate acknowledgment (as defined in [RFC5681], 1. This is a duplicate acknowledgment (as defined in [RFC5681],
section 2), and all of the following conditions are met: section 2), and all of the following conditions are met:
skipping to change at page 21, line 25 skipping to change at page 21, line 18
a congestion control response equivalent to fast recovery. a congestion control response equivalent to fast recovery.
More precisely, on each ACK the sender executes the following: More precisely, on each ACK the sender executes the following:
if (TLPRxtOut and SEG.ACK >= TLPHighRxt) { if (TLPRxtOut and SEG.ACK >= TLPHighRxt) {
TLPRxtOut = false TLPRxtOut = false
EnterRecovery() EnterRecovery()
ExitRecovery() ExitRecovery()
} }
8. RACK and TLP discussions 8. RACK and TLP Discussion
8.1. Advantages 8.1. Advantages
The biggest advantage of RACK is that every data packet, whether it The biggest advantage of RACK is that every data packet, whether it
is an original data transmission or a retransmission, can be used to is an original data transmission or a retransmission, can be used to
detect losses of the packets sent chronologically prior to it. detect losses of the packets sent chronologically prior to it.
Example: TAIL DROP. Consider a sender that transmits a window of Example: TAIL DROP. Consider a sender that transmits a window of
three data packets (P1, P2, P3), and P1 and P3 are lost. Suppose the three data packets (P1, P2, P3), and P1 and P3 are lost. Suppose the
transmission of each packet is at least RACK.reo_wnd (1 millisecond transmission of each packet is at least RACK.reo_wnd (1 millisecond
skipping to change at page 22, line 51 skipping to change at page 22, line 45
RACK requires the sender to record the transmission time of each RACK requires the sender to record the transmission time of each
packet sent at a clock granularity of one millisecond or finer. TCP packet sent at a clock granularity of one millisecond or finer. TCP
implementations that record this already for RTT estimation do not implementations that record this already for RTT estimation do not
require any new per-packet state. But implementations that are not require any new per-packet state. But implementations that are not
yet recording packet transmission times will need to add per-packet yet recording packet transmission times will need to add per-packet
internal state (commonly either 4 or 8 octets per packet or TSO blob) internal state (commonly either 4 or 8 octets per packet or TSO blob)
to track transmission times. In contrast, the conventional [RFC6675] to track transmission times. In contrast, the conventional [RFC6675]
loss detection approach does not require any per-packet state beyond loss detection approach does not require any per-packet state beyond
the SACK scoreboard. This is particularly useful on ultra-low RTT the SACK scoreboard. This is particularly useful on ultra-low RTT
networks where the RTT is far less than the sender TCP clock networks where the RTT is far less than the sender TCP clock
grainularity (e.g. inside data-centers). granularity (e.g. inside data-centers).
RACK can easily and optionally support the conventional approach in RACK can easily and optionally support the conventional approach in
[RFC6675][RFC5681] by resetting the reordering window to zero when [RFC6675][RFC5681] by resetting the reordering window to zero when
the threshold is met. Note that this approach differs slightly from the threshold is met. Note that this approach differs slightly from
[RFC6675] which considers a packet lost when at least #DupThresh [RFC6675] which considers a packet lost when at least DupThresh
higher-sequenc packets are SACKed. RACK's approach considers a higher-sequence packets are SACKed. RACK's approach considers a
packet lost when at least one higher sequence packet is SACKed and packet lost when at least one higher sequence packet is SACKed and
the total number of SACKed packets is at least DupThresh. For the total number of SACKed packets is at least DupThresh. For
example, suppose a connection sends 10 packets, and packets 3, 5, 7 example, suppose a connection sends 10 packets, and packets 3, 5, 7
are SACKed. [RFC6675] considers packets 1 and 2 lost. RACK are SACKed. [RFC6675] considers packets 1 and 2 lost. RACK
considers packets 1, 2, 4, 6 lost. considers packets 1, 2, 4, 6 lost.
8.3. Adjusting the reordering window 8.3. Adjusting the reordering window
When the sender detects packet reordering, RACK uses a reordering When the sender detects packet reordering, RACK uses a reordering
window of min_rtt / 4. It uses the minimum RTT to accommodate window of min_rtt / 4. It uses the minimum RTT to accommodate
skipping to change at page 24, line 5 skipping to change at page 23, line 47
threshold based on the current or previous flight sizes. RACK takes threshold based on the current or previous flight sizes. RACK takes
a different approach, by using only one ACK event and a reordering a different approach, by using only one ACK event and a reordering
window. RACK can be seen as an extended Early Retransmit [RFC5827] window. RACK can be seen as an extended Early Retransmit [RFC5827]
without a FlightSize limit but with an additional reordering window. without a FlightSize limit but with an additional reordering window.
[FACK] considers an original packet to be lost when its sequence [FACK] considers an original packet to be lost when its sequence
range is sufficiently far below the highest SACKed sequence. In some range is sufficiently far below the highest SACKed sequence. In some
sense RACK can be seen as a generalized form of FACK that operates in sense RACK can be seen as a generalized form of FACK that operates in
time space instead of sequence space, enabling it to better handle time space instead of sequence space, enabling it to better handle
reordering, application-limited traffic, and lost retransmissions. reordering, application-limited traffic, and lost retransmissions.
Nevertheless RACK is still an experimental algorithm. Since the Since the 3 duplicate ACK threshold for triggering fast recovery
oldest loss detection algorithm, the 3 duplicate ACK threshold [RFC5681] has been widely deployed and usually works well in the
[RFC5681], has been standardized and widely deployed. RACK can absence of reordering, RACK uses this signal to trigger fast recovery
easily and optionally support the conventional approach for if a connection has not observed reordering.
compatibility.
RACK is compatible with and does not interfere with the the standard RACK is compatible with and does not interfere with the standard RTO
RTO [RFC6298], RTO-restart [RFC7765], F-RTO [RFC5682] and Eifel [RFC6298], RTO-restart [RFC7765], F-RTO [RFC5682] and Eifel
algorithms [RFC3522]. This is because RACK only detects loss by algorithms [RFC3522]. This is because RACK only detects loss by
using ACK events. It neither changes the RTO timer calculation nor using ACK events. It neither changes the RTO timer calculation nor
detects spurious timeouts. detects spurious timeouts.
Furthermore, RACK naturally works well with Tail Loss Probe [TLP] Furthermore, RACK naturally works well with Tail Loss Probe [TLP]
because a tail loss probe solicits either an ACK or SACK, which can because a tail loss probe solicits either an ACK or SACK, which can
be used by RACK to detect more losses. RACK can be used to relax be used by RACK to detect more losses. RACK can be used to relax
TLP's requirement for using FACK and retransmitting the the highest- TLP's requirement for using FACK and retransmitting the the highest-
sequenced packet, because RACK is agnostic to packet sequence sequenced packet, because RACK is agnostic to packet sequence
numbers, and uses transmission time instead. Thus TLP could be numbers, and uses transmission time instead. Thus TLP could be
modified to retransmit the first unacknowledged packet, which could modified to retransmit the first unacknowledged packet, which could
improve application latency. improve application latency.
8.5. Interaction with congestion control 8.5. Interaction with congestion control
RACK intentionally decouples loss detection from congestion control. RACK intentionally decouples loss detection from congestion control.
RACK only detects losses; it does not modify the congestion control RACK only detects losses; it does not modify the congestion control
algorithm [RFC5681][RFC6937]. A packet marked lost by RACK SHOULD algorithm [RFC5681][RFC6937]. A packet marked lost by RACK SHOULD
NOT be retransmitted until congestion control deems this appropriate. NOT be retransmitted until congestion control deems this appropriate.
Specifically, Proportional Rate Reduction [RFC6937] SHOULD be used
when using RACK.
RACK is applicable for both fast recovery and recovery after a RACK is applicable for both fast recovery and recovery after a
retransmission timeout (RTO) in [RFC5681]. RACK applies equally to retransmission timeout (RTO) in [RFC5681]. RACK applies equally to
fast recovery and RTO recovery because RACK is purely based on the fast recovery and RTO recovery because RACK is purely based on the
transmission time order of packets. When a packet retransmitted by transmission time order of packets. When a packet retransmitted by
RTO is acknowledged, RACK will mark any unacked packet sent RTO is acknowledged, RACK will mark any unacked packet sent
sufficiently prior to the RTO as lost, because at least one RTT has sufficiently prior to the RTO as lost, because at least one RTT has
elapsed since these packets were sent. elapsed since these packets were sent.
RACK may detect losses faster or slower than the conventional RACK may detect losses faster or slower than the conventional
skipping to change at page 25, line 33 skipping to change at page 25, line 26
+ 2 + 4 + 3 = 10). The recovery latency would be 2*RTT + 4*RTT, with + 2 + 4 + 3 = 10). The recovery latency would be 2*RTT + 4*RTT, with
an ending cwnd set to the slow start threshold of 10 packets. an ending cwnd set to the slow start threshold of 10 packets.
In both cases, the sender after the recovery would be in congestion In both cases, the sender after the recovery would be in congestion
avoidance. The difference in recovery latency (RTO + 4*RTT vs 6*RTT) avoidance. The difference in recovery latency (RTO + 4*RTT vs 6*RTT)
can be significant if the RTT is much smaller than the minimum RTO (1 can be significant if the RTT is much smaller than the minimum RTO (1
second in RFC6298) or if the RTT is large. The former case is common second in RFC6298) or if the RTT is large. The former case is common
in local area networks, data-center networks, or content distribution in local area networks, data-center networks, or content distribution
networks with deep deployments. The latter case is more common in networks with deep deployments. The latter case is more common in
developing regions with highly congested and/or high-latency developing regions with highly congested and/or high-latency
networks. The ending congestion window after recovery also impacts networks.
subsequent data transfer.
8.6. TLP recovery detection with delayed ACKs 8.6. TLP recovery detection with delayed ACKs
Delayed ACKs complicate the detection of repairs done by TLP, since Delayed ACKs complicate the detection of repairs done by TLP, since
with a delayed ACK the sender receives one fewer ACK than would with a delayed ACK the sender receives one fewer ACK than would
normally be expected. To mitigate this complication, before sending normally be expected. To mitigate this complication, before sending
a TLP loss probe retransmission, the sender should attempt to wait a TLP loss probe retransmission, the sender should attempt to wait
long enough that the receiver has sent any delayed ACKs that it is long enough that the receiver has sent any delayed ACKs that it is
withholding. The sender algorithm described above features such a withholding. The sender algorithm described above features such a
delay, in the form of WCDelAckT. Furthermore, if the receiver delay, in the form of WCDelAckT. Furthermore, if the receiver
skipping to change at page 27, line 48 skipping to change at page 27, line 40
Note to RFC Editor: this section may be removed on publication as an Note to RFC Editor: this section may be removed on publication as an
RFC. RFC.
12. Acknowledgments 12. Acknowledgments
The authors thank Matt Mathis for his insights in FACK and Michael The authors thank Matt Mathis for his insights in FACK and Michael
Welzl for his per-packet timer idea that inspired this work. Eric Welzl for his per-packet timer idea that inspired this work. Eric
Dumazet, Randy Stewart, Van Jacobson, Ian Swett, Rick Jones, Jana Dumazet, Randy Stewart, Van Jacobson, Ian Swett, Rick Jones, Jana
Iyengar, Hiren Panchasara, Praveen Balasubramanian, Yoshifumi Iyengar, Hiren Panchasara, Praveen Balasubramanian, Yoshifumi
Nishida, Bob Briscoe, Felix Weinrank, Michael Tuexen, and Martin Duke Nishida, Bob Briscoe, Felix Weinrank, Michael Tuexen, Martin Duke,
contributed to the draft or the implementations in Linux, FreeBSD, and Ilpo Jarvinen contributed to the draft or the implementations in
Windows and QUIC. Linux, FreeBSD, Windows and QUIC.
13. References 13. References
13.1. Normative References 13.1. Normative References
[RFC2018] Mathis, M. and J. Mahdavi, "TCP Selective Acknowledgment [RFC2018] Mathis, M. and J. Mahdavi, "TCP Selective Acknowledgment
Options", RFC 2018, October 1996. Options", RFC 2018, October 1996.
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", RFC 2119, March 1997. Requirement Levels", RFC 2119, March 1997.
skipping to change at page 29, line 19 skipping to change at page 29, line 12
Communication Review, Volume 26, Issue 4, Oct. 1996. , Communication Review, Volume 26, Issue 4, Oct. 1996. ,
1996. 1996.
[POLICER16] [POLICER16]
Flach, T., Papageorge, P., Terzis, A., Pedrosa, L., Cheng, Flach, T., Papageorge, P., Terzis, A., Pedrosa, L., Cheng,
Y., Karim, T., Katz-Bassett, E., and R. Govindan, "An Y., Karim, T., Katz-Bassett, E., and R. Govindan, "An
Analysis of Traffic Policing in the Web", ACM SIGCOMM , Analysis of Traffic Policing in the Web", ACM SIGCOMM ,
2016. 2016.
[QUIC-LR] Iyengar, J. and I. Swett, "QUIC Loss Recovery And [QUIC-LR] Iyengar, J. and I. Swett, "QUIC Loss Recovery And
Congestion Control", draft-tsvwg-quic-loss-recovery-01 Congestion Control", draft-ietf-quic-recovery-latest (work
(work in progress), June 2016. in progress), March 2020.
[RFC7765] Hurtig, P., Brunstrom, A., Petlund, A., and M. Welzl, "TCP [RFC7765] Hurtig, P., Brunstrom, A., Petlund, A., and M. Welzl, "TCP
and SCTP RTO Restart", February 2016. and SCTP RTO Restart", February 2016.
[SCWA99] Savage, S., Cardwell, N., Wetherall, D., and T. Anderson, [SCWA99] Savage, S., Cardwell, N., Wetherall, D., and T. Anderson,
"TCP Congestion Control With a Misbehaving Receiver", ACM "TCP Congestion Control With a Misbehaving Receiver", ACM
Computer Communication Review, 29(5) , 1999. Computer Communication Review, 29(5) , 1999.
[Sprout] Winstein, K., Sivaraman, A., and H. Balakrishnan, [Sprout] Winstein, K., Sivaraman, A., and H. Balakrishnan,
"Stochastic Forecasts Achieve High Throughput and Low "Stochastic Forecasts Achieve High Throughput and Low
 End of changes. 37 change blocks. 
81 lines changed or deleted 75 lines changed or added

This html diff was produced by rfcdiff 1.47. The latest version is available from http://tools.ietf.org/tools/rfcdiff/