draft-ietf-tcpm-1323bis-16.txt   draft-ietf-tcpm-1323bis-17.txt 
TCP Maintenance (TCPM) D. Borman TCP Maintenance (TCPM) D. Borman
Internet-Draft Quantum Corporation Internet-Draft Quantum Corporation
Intended status: Standards Track B. Braden Intended status: Standards Track B. Braden
Expires: May 16, 2014 University of Southern Expires: May 19, 2014 University of Southern
California California
V. Jacobson V. Jacobson
Google, Inc. Google, Inc.
R. Scheffenegger, Ed. R. Scheffenegger, Ed.
NetApp, Inc. NetApp, Inc.
November 12, 2013 November 15, 2013
TCP Extensions for High Performance TCP Extensions for High Performance
draft-ietf-tcpm-1323bis-16 draft-ietf-tcpm-1323bis-17
Abstract Abstract
This document specifies a set of TCP extensions to improve This document specifies a set of TCP extensions to improve
performance over paths with a large bandwidth * delay product and to performance over paths with a large bandwidth * delay product and to
provide reliable operation over very high-speed paths. It defines provide reliable operation over very high-speed paths. It defines
TCP options for scaled windows and timestamps. The timestamps can be the TCP Window Scale (WS) option and the TCP Timestamps (TS) option
used for two distinct mechanisms, PAWS (Protection Against Wrapped and their semantics. The Window Scale option is used to support
Sequences) and RTTM (Round Trip Time Measurement). larger receive windows, while the Timestamps option can be used for
at least two distinct mechanisms, PAWS (Protection Against Wrapped
Sequences) and RTTM (Round Trip Time Measurement), that are also
described herein.
This document obsoletes RFC 1323 and describes changes from it. This document obsoletes RFC1323 and describes changes from it.
Status of this Memo Status of this Memo
This Internet-Draft is submitted in full conformance with the This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79. provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/. Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on May 16, 2014. This Internet-Draft will expire on May 19, 2014.
Copyright Notice Copyright Notice
Copyright (c) 2013 IETF Trust and the persons identified as the Copyright (c) 2013 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of (http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of include Simplified BSD License text as described in Section 4.e of
skipping to change at page 3, line 12 skipping to change at page 3, line 12
the Trust Legal Provisions and are provided without warranty as the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License. described in the Simplified BSD License.
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1. TCP Performance . . . . . . . . . . . . . . . . . . . . . 4 1.1. TCP Performance . . . . . . . . . . . . . . . . . . . . . 4
1.2. TCP Reliability . . . . . . . . . . . . . . . . . . . . . 5 1.2. TCP Reliability . . . . . . . . . . . . . . . . . . . . . 5
1.3. Using TCP options . . . . . . . . . . . . . . . . . . . . 6 1.3. Using TCP options . . . . . . . . . . . . . . . . . . . . 6
1.4. Terminology . . . . . . . . . . . . . . . . . . . . . . . 7 1.4. Terminology . . . . . . . . . . . . . . . . . . . . . . . 7
2. TCP Window Scale Option . . . . . . . . . . . . . . . . . . . 8 2. TCP Window Scale option . . . . . . . . . . . . . . . . . . . 8
2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 8 2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 8
2.2. Window Scale Option . . . . . . . . . . . . . . . . . . . 8 2.2. Window Scale option . . . . . . . . . . . . . . . . . . . 8
2.3. Using the Window Scale Option . . . . . . . . . . . . . . 9 2.3. Using the Window Scale option . . . . . . . . . . . . . . 9
2.4. Addressing Window Retraction . . . . . . . . . . . . . . . 10 2.4. Addressing Window Retraction . . . . . . . . . . . . . . . 10
3. TCP Timestamps option . . . . . . . . . . . . . . . . . . . . 12 3. TCP Timestamps option . . . . . . . . . . . . . . . . . . . . 12
3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 12 3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 12
3.2. Timestamps option . . . . . . . . . . . . . . . . . . . . 12 3.2. Timestamps option . . . . . . . . . . . . . . . . . . . . 12
4. The RTTM Mechanism . . . . . . . . . . . . . . . . . . . . . . 15 4. The RTTM Mechanism . . . . . . . . . . . . . . . . . . . . . . 15
4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 15 4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 15
4.2. Updating the RTO value . . . . . . . . . . . . . . . . . . 16 4.2. Updating the RTO value . . . . . . . . . . . . . . . . . . 16
4.3. Which Timestamp to Echo . . . . . . . . . . . . . . . . . 16 4.3. Which Timestamp to Echo . . . . . . . . . . . . . . . . . 16
5. PAWS - Protection Against Wrapped Sequence Numbers . . . . . . 20 5. PAWS - Protection Against Wrapped Sequence Numbers . . . . . . 20
5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 20 5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 20
skipping to change at page 5, line 23 skipping to change at page 5, line 23
throughput. throughput.
To generalize the Fast Retransmit / Fast Recovery mechanism to To generalize the Fast Retransmit / Fast Recovery mechanism to
handle multiple packets dropped per window, Selective handle multiple packets dropped per window, Selective
Acknowledgments are required. Unlike the normal cumulative Acknowledgments are required. Unlike the normal cumulative
acknowledgments of TCP, Selective Acknowledgments give the acknowledgments of TCP, Selective Acknowledgments give the
sender a complete picture of which segments are queued at the sender a complete picture of which segments are queued at the
receiver and which have not yet arrived. receiver and which have not yet arrived.
Selective acknowledgements and their use are specified in Selective acknowledgements and their use are specified in
separate documents, "TCP Selective Acknowledgment Options" separate documents, "TCP Selective Acknowledgment options"
[RFC2018], "An Extension to the Selective Acknowledgement (SACK) [RFC2018], "An Extension to the Selective Acknowledgement (SACK)
Option for TCP" [RFC2883], and "A Conservative Selective option for TCP" [RFC2883], and "A Conservative Selective
Acknowledgment (SACK)-based Loss Recovery Algorithm for TCP" Acknowledgment (SACK)-based Loss Recovery Algorithm for TCP"
[RFC6675], and not further discussed in this document. [RFC6675], and not further discussed in this document.
1.2. TCP Reliability 1.2. TCP Reliability
An especially serious kind of error may result from an accidental An especially serious kind of error may result from an accidental
reuse of TCP sequence numbers in data segments. TCP reliability reuse of TCP sequence numbers in data segments. TCP reliability
depends upon the existence of a bound on the lifetime of a segment: depends upon the existence of a bound on the lifetime of a segment:
the "Maximum Segment Lifetime" or MSL. the "Maximum Segment Lifetime" or MSL.
skipping to change at page 8, line 5 skipping to change at page 8, line 5
1.4. Terminology 1.4. Terminology
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in [RFC2119]. document are to be interpreted as described in [RFC2119].
In this document, these words will appear with that interpretation In this document, these words will appear with that interpretation
only when in UPPER CASE. Lower case uses of these words are not to only when in UPPER CASE. Lower case uses of these words are not to
be interpreted as carrying [RFC2119] significance. be interpreted as carrying [RFC2119] significance.
2. TCP Window Scale Option 2. TCP Window Scale option
2.1. Introduction 2.1. Introduction
The window scale extension expands the definition of the TCP window The window scale extension expands the definition of the TCP window
to 30 bits and then uses an implicit scale factor to carry this 30- to 30 bits and then uses an implicit scale factor to carry this 30-
bit value in the 16-bit Window field of the TCP header (SEG.WND in bit value in the 16-bit Window field of the TCP header (SEG.WND in
[RFC0793]). The exponent of the scale factor is carried in a TCP [RFC0793]). The exponent of the scale factor is carried in a TCP
option, Window Scale. This option is sent only in a <SYN> segment (a option, Window Scale. This option is sent only in a <SYN> segment (a
segment with the SYN bit on), hence the window scale is fixed in each segment with the SYN bit on), hence the window scale is fixed in each
direction when a connection is opened. direction when a connection is opened.
The maximum receive window, and therefore the scale factor, is The maximum receive window, and therefore the scale factor, is
determined by the maximum receive buffer space. In a typical modern determined by the maximum receive buffer space. In a typical modern
implementation, this maximum buffer space is set by default but can implementation, this maximum buffer space is set by default but can
be overridden by a user program before a TCP connection is opened. be overridden by a user program before a TCP connection is opened.
This determines the scale factor, and therefore no new user interface This determines the scale factor, and therefore no new user interface
is needed for window scaling. is needed for window scaling.
2.2. Window Scale Option 2.2. Window Scale option
The three-byte Window Scale option MAY be sent in a <SYN> segment by The three-byte Window Scale option MAY be sent in a <SYN> segment by
a TCP. It has two purposes: (1) indicate that the TCP is prepared to a TCP. It has two purposes: (1) indicate that the TCP is prepared to
both send and receive window scaling, and (2) communicate the both send and receive window scaling, and (2) communicate the
exponent of a scale factor to be applied to its receive window. exponent of a scale factor to be applied to its receive window.
Thus, a TCP that is prepared to scale windows SHOULD send the option, Thus, a TCP that is prepared to scale windows SHOULD send the option,
even if its own scale factor is 1 and the exponent 0. The scale even if its own scale factor is 1 and the exponent 0. The scale
factor is limited to a power of two and encoded logarithmically, so factor is limited to a power of two and encoded logarithmically, so
it may be implemented by binary shift operations. The maximum scale it may be implemented by binary shift operations. The maximum scale
exponent is limited to 14 for a maximum permissible receive window exponent is limited to 14 for a maximum permissible receive window
size of 1 GiB (2^(14+16)). size of 1 GiB (2^(14+16)).
TCP Window Scale Option (WSopt): TCP Window Scale option (WSopt):
Kind: 3 Kind: 3
Length: 3 bytes Length: 3 bytes
+---------+---------+---------+ +---------+---------+---------+
| Kind=3 |Length=3 |shift.cnt| | Kind=3 |Length=3 |shift.cnt|
+---------+---------+---------+ +---------+---------+---------+
1 1 1 1 1 1
skipping to change at page 9, line 16 skipping to change at page 9, line 16
This option MAY be sent in an initial <SYN> segment (i.e., a segment This option MAY be sent in an initial <SYN> segment (i.e., a segment
with the SYN bit on and the ACK bit off). It MAY also be sent in a with the SYN bit on and the ACK bit off). It MAY also be sent in a
<SYN,ACK> segment, but only if a Window Scale option was received in <SYN,ACK> segment, but only if a Window Scale option was received in
the initial <SYN> segment. A Window Scale option in a segment the initial <SYN> segment. A Window Scale option in a segment
without a SYN bit MUST be ignored. without a SYN bit MUST be ignored.
The window field in a segment where the SYN bit is set (i.e., a <SYN> The window field in a segment where the SYN bit is set (i.e., a <SYN>
or <SYN,ACK>) MUST NOT be scaled. or <SYN,ACK>) MUST NOT be scaled.
2.3. Using the Window Scale Option 2.3. Using the Window Scale option
A model implementation of window scaling is as follows, using the A model implementation of window scaling is as follows, using the
notation of [RFC0793]: notation of [RFC0793]:
o The connection state MUST be augmented by two window shift o The connection state is augmented by two window shift counters,
counters, Snd.Wind.Shift and Rcv.Wind.Shift, to be applied to the Snd.Wind.Shift and Rcv.Wind.Shift, to be applied to the incoming
incoming and outgoing window fields, respectively. and outgoing window fields, respectively.
o If a TCP receives a <SYN> segment containing a Window Scale o If a TCP receives a <SYN> segment containing a Window Scale
option, it SHOULD send its own Window Scale option in the option, it SHOULD send its own Window Scale option in the
<SYN,ACK> segment. <SYN,ACK> segment.
o The Window Scale option MUST be sent with shift.cnt = R, where R o The Window Scale option MUST be sent with shift.cnt = R, where R
is the value that the TCP would like to use for its receive is the value that the TCP would like to use for its receive
window. window.
o Upon receiving a <SYN> segment with a Window Scale option o Upon receiving a <SYN> segment with a Window Scale option
skipping to change at page 12, line 16 skipping to change at page 12, line 16
3.1. Introduction 3.1. Introduction
The Timestamps option is introduced to address some of the issues The Timestamps option is introduced to address some of the issues
mentioned in Section 1.1 and Section 1.2. The Timestamps option is mentioned in Section 1.1 and Section 1.2. The Timestamps option is
specified in a symmetrical manner, so that TSval timestamps are specified in a symmetrical manner, so that TSval timestamps are
carried in both data and <ACK> segments and are echoed in TSecr carried in both data and <ACK> segments and are echoed in TSecr
fields carried in returning <ACK> or data segments. Originally used fields carried in returning <ACK> or data segments. Originally used
primarily for timestamping individual segments, the properties of the primarily for timestamping individual segments, the properties of the
Timestamps option allow not only the use for taking time measurements Timestamps option allow not only the use for taking time measurements
(Section 4), but additional uses as well (xref target="sec4"/>). (Section 4), but additional uses as well (Section 5).
It is necessary to remember that there is a distinction between the It is necessary to remember that there is a distinction between the
Timestamps option conveying timestamp information, and the use of Timestamps option conveying timestamp information, and the use of
that information. In particular, the Round Trip Time Measurement that information. In particular, the Round Trip Time Measurement
(RTTM) mechanism must be viewed independently from updating the (RTTM) mechanism must be viewed independently from updating the
Retransmission Timeout (RTO) (see Section 4.2). In this case, the Retransmission Timeout (RTO) (see Section 4.2). In this case, the
sample granularity also needs to be taken into account. Other sample granularity also needs to be taken into account. Other
mechanisms, such as PAWS, or Eifel, are not built upon the timestamp mechanisms, such as PAWS, or Eifel, are not built upon the timestamp
information itself, but are based on the intrinsic property of information itself, but are based on the intrinsic property of
monotonically increasing values. monotonically non-decreasing values.
The Timestamps option is important when large receive windows are The Timestamps option is important when large receive windows are
used, to allow the use of the PAWS mechanism (see Section 5). used, to allow the use of the PAWS mechanism (see Section 5).
Furthermore, the option may be useful for all TCP's, since it Furthermore, the option may be useful for all TCP's, since it
simplifies the sender and allows the use of additional optimizations simplifies the sender and allows the use of additional optimizations
such as Eifel ([RFC3522], [RFC4015]) and others ([RFC6817], such as Eifel ([RFC3522], [RFC4015]) and others ([RFC6817],
[Kuzmanovic03], [Kuehlewind10]. [Kuzmanovic03], [Kuehlewind10].
3.2. Timestamps option 3.2. Timestamps option
skipping to change at page 13, line 20 skipping to change at page 13, line 20
+-------+-------+---------------------+---------------------+ +-------+-------+---------------------+---------------------+
|Kind=8 | 10 | TS Value (TSval) |TS Echo Reply (TSecr)| |Kind=8 | 10 | TS Value (TSval) |TS Echo Reply (TSecr)|
+-------+-------+---------------------+---------------------+ +-------+-------+---------------------+---------------------+
1 1 4 4 1 1 4 4
The Timestamps option carries two four-byte timestamp fields. The The Timestamps option carries two four-byte timestamp fields. The
Timestamp Value field (TSval) contains the current value of the Timestamp Value field (TSval) contains the current value of the
timestamp clock of the TCP sending the option. timestamp clock of the TCP sending the option.
The Timestamp Echo Reply field (TSecr) is valid if the ACK bit is set The Timestamp Echo Reply (TSecr) field is valid if the ACK bit is set
in the TCP header; if it is valid, it echoes a timestamp value that in the TCP header. If the ACK bit is not set in the outgoing TCP
was sent by the remote TCP in the TSval field of a Timestamps option. header, the sender of that segment SHOULD set the TSecr field to
When TSecr is not valid, its value MUST be zero. However, a value of zero. When the ACK bit is set in an outgoing segment, the sender
zero does not imply TSecr being invalid. The TSecr value will MUST echo a recently received Timestamp Value (TSval) sent by the
generally be from the most recent Timestamps option that was remote TCP in the TSval field of a Timestamps option. The exact
received; however, there are exceptions that are explained below. rules on which TSval MUST be echoed are given in Section 4.3. When
the ACK bit is not set, the receiver MUST ignore the value of the
TSecr field.
A TCP MAY send the Timestamps option (TSopt) in an initial <SYN> A TCP MAY send the Timestamps option (TSopt) in an initial <SYN>
segment (i.e., segment containing a SYN bit and no ACK bit), and MAY segment (i.e., segment containing a SYN bit and no ACK bit), and MAY
send a TSopt in <SYN,ACK> only if it received a TSopt in the initial send a TSopt in <SYN,ACK> only if it received a TSopt in the initial
<SYN> segment for the connection. <SYN> segment for the connection.
Once TSopt has been successfully negotiated, that is both <SYN>, and Once TSopt has been successfully negotiated, that is both <SYN>, and
<SYN,ACK> contain TSopt, the TSopt MUST be sent in every non-<RST> <SYN,ACK> contain TSopt, the TSopt MUST be sent in every non-<RST>
segment for the duration of the connection, and SHOULD be sent in an segment for the duration of the connection, and SHOULD be sent in an
<RST> segment (see Section 5.2 for details). The TCP SHOULD remember <RST> segment (see Section 5.2 for details). The TCP SHOULD remember
skipping to change at page 38, line 7 skipping to change at page 38, line 7
numbers on every connection. Using timestamps instead, it is numbers on every connection. Using timestamps instead, it is
only necessary to keep one quantity per remote host, regardless only necessary to keep one quantity per remote host, regardless
of the number of simultaneous connections to that host. of the number of simultaneous connections to that host.
Appendix C. Summary of Notation Appendix C. Summary of Notation
The following notation has been used in this document. The following notation has been used in this document.
Options Options
WSopt: TCP Window Scale Option WSopt: TCP Window Scale option
TSopt: TCP Timestamps option TSopt: TCP Timestamps option
Option Fields Option Fields
shift.cnt: Window scale byte in WSopt shift.cnt: Window scale byte in WSopt
TSval: 32-bit Timestamp Value field in TSopt TSval: 32-bit Timestamp Value field in TSopt
TSecr: 32-bit Timestamp Reply field in TSopt TSecr: 32-bit Timestamp Reply field in TSopt
Option Fields in Current Segment Option Fields in Current Segment
skipping to change at page 42, line 44 skipping to change at page 42, line 44
contents straddle the boundary between old and new, only the contents straddle the boundary between old and new, only the
new parts should be processed. new parts should be processed.
Rescale the received window field: Rescale the received window field:
TrueWindow = SEG.WND << Snd.Wind.Shift, TrueWindow = SEG.WND << Snd.Wind.Shift,
and use "TrueWindow" in place of SEG.WND in the following and use "TrueWindow" in place of SEG.WND in the following
steps. steps.
Check whether the segment contains a Timestamp Option and Check whether the segment contains a Timestamps option and
bit Snd.TS.OK is on. If so: bit Snd.TS.OK is on. If so:
If SEG.TSval < TS.Recent and the RST bit is off, then If SEG.TSval < TS.Recent and the RST bit is off, then
test whether connection has been idle less than 24 days; test whether connection has been idle less than 24 days;
if all are true, then the segment is not acceptable; if all are true, then the segment is not acceptable;
follow steps below for an unacceptable segment. follow steps below for an unacceptable segment.
If SEG.SEQ is less than or equal to Last.ACK.sent, then If SEG.SEQ is less than or equal to Last.ACK.sent, then
save SEG.TSval in variable TS.Recent. save SEG.TSval in variable TS.Recent.
skipping to change at page 44, line 10 skipping to change at page 44, line 10
ESTABLISHED STATE ESTABLISHED STATE
FIN-WAIT-1 STATE FIN-WAIT-1 STATE
FIN-WAIT-2 STATE FIN-WAIT-2 STATE
... ...
Send an acknowledgment of the form: Send an acknowledgment of the form:
<SEQ=SND.NXT><ACK=RCV.NXT><CTL=ACK> <SEQ=SND.NXT><ACK=RCV.NXT><CTL=ACK>
If the Snd.TS.OK bit is on, include Timestamp Option If the Snd.TS.OK bit is on, include Timestamps option
<TSval=Snd.TSclock,TSecr=TS.Recent> in this <ACK> segment. <TSval=Snd.TSclock,TSecr=TS.Recent> in this <ACK> segment.
Set Last.ACK.sent to SEG.ACK of the acknowledgment, and send Set Last.ACK.sent to SEG.ACK of the acknowledgment, and send
it. This acknowledgment should be piggy-backed on a segment it. This acknowledgment should be piggy-backed on a segment
being transmitted if possible without incurring undue delay. being transmitted if possible without incurring undue delay.
... ...
Appendix E. Timestamps Edge Cases Appendix E. Timestamps Edge Cases
While the rules laid out for when to calculate RTTM produce the While the rules laid out for when to calculate RTTM produce the
 End of changes. 22 change blocks. 
33 lines changed or deleted 37 lines changed or added

This html diff was produced by rfcdiff 1.41. The latest version is available from http://tools.ietf.org/tools/rfcdiff/