draft-ietf-tcpm-1323bis-11.txt   draft-ietf-tcpm-1323bis-12.txt 
TCP Maintenance (TCPM) D. Borman TCP Maintenance (TCPM) D. Borman
Internet-Draft Quantum Corporation Internet-Draft Quantum Corporation
Intended status: Standards Track B. Braden Intended status: Standards Track B. Braden
Expires: October 24, 2013 University of Southern Expires: November 15, 2013 University of Southern
California California
V. Jacobson V. Jacobson
Packet Design Packet Design
R. Scheffenegger, Ed. R. Scheffenegger, Ed.
NetApp, Inc. NetApp, Inc.
April 22, 2013 May 14, 2013
TCP Extensions for High Performance TCP Extensions for High Performance
draft-ietf-tcpm-1323bis-11 draft-ietf-tcpm-1323bis-12
Abstract Abstract
This document specifies a set of TCP extensions to improve This document specifies a set of TCP extensions to improve
performance over paths with a large bandwidth * delay product and to performance over paths with a large bandwidth * delay product and to
provide reliable operation over very high-speed paths. It defines provide reliable operation over very high-speed paths. It defines
TCP options for scaled windows and timestamps. The timestamps are TCP options for scaled windows and timestamps. The timestamps are
used for two distinct mechanisms, RTTM (Round Trip Time Measurement) used for two distinct mechanisms, RTTM (Round Trip Time Measurement)
and PAWS (Protection Against Wrapped Sequences). and PAWS (Protection Against Wrapped Sequences).
skipping to change at page 1, line 43 skipping to change at page 1, line 43
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/. Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on October 24, 2013. This Internet-Draft will expire on November 15, 2013.
Copyright Notice Copyright Notice
Copyright (c) 2013 IETF Trust and the persons identified as the Copyright (c) 2013 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of (http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
skipping to change at page 3, line 17 skipping to change at page 3, line 17
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1. TCP Performance . . . . . . . . . . . . . . . . . . . . . 4 1.1. TCP Performance . . . . . . . . . . . . . . . . . . . . . 4
1.2. TCP Reliability . . . . . . . . . . . . . . . . . . . . . 5 1.2. TCP Reliability . . . . . . . . . . . . . . . . . . . . . 5
1.3. Using TCP options . . . . . . . . . . . . . . . . . . . . 6 1.3. Using TCP options . . . . . . . . . . . . . . . . . . . . 6
1.4. Terminology . . . . . . . . . . . . . . . . . . . . . . . 7 1.4. Terminology . . . . . . . . . . . . . . . . . . . . . . . 7
2. TCP Window Scale Option . . . . . . . . . . . . . . . . . . . 8 2. TCP Window Scale Option . . . . . . . . . . . . . . . . . . . 8
2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 8 2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 8
2.2. Window Scale Option . . . . . . . . . . . . . . . . . . . 8 2.2. Window Scale Option . . . . . . . . . . . . . . . . . . . 8
2.3. Using the Window Scale Option . . . . . . . . . . . . . . 9 2.3. Using the Window Scale Option . . . . . . . . . . . . . . 9
2.4. Addressing Window Retraction . . . . . . . . . . . . . . . 10 2.4. Addressing Window Retraction . . . . . . . . . . . . . . . 10
3. RTTM -- Round-Trip Time Measurement . . . . . . . . . . . . . 12 3. TCP Timestamp Option . . . . . . . . . . . . . . . . . . . . . 12
3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 12 3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 12
3.2. TCP Timestamp Option . . . . . . . . . . . . . . . . . . . 13 3.2. Timestamp Option . . . . . . . . . . . . . . . . . . . . . 12
3.3. The RTTM Mechanism . . . . . . . . . . . . . . . . . . . . 14 3.3. The RTTM Mechanism . . . . . . . . . . . . . . . . . . . . 13
3.4. Which Timestamp to Echo . . . . . . . . . . . . . . . . . 16 3.4. Updating the RTO value . . . . . . . . . . . . . . . . . . 15
3.5. Which Timestamp to Echo . . . . . . . . . . . . . . . . . 15
4. PAWS -- Protection Against Wrapped Sequence Numbers . . . . . 18 4. PAWS -- Protection Against Wrapped Sequence Numbers . . . . . 18
4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 18 4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 18
4.2. The PAWS Mechanism . . . . . . . . . . . . . . . . . . . . 18 4.2. The PAWS Mechanism . . . . . . . . . . . . . . . . . . . . 18
4.3. Basic PAWS Algorithm . . . . . . . . . . . . . . . . . . . 20 4.3. Basic PAWS Algorithm . . . . . . . . . . . . . . . . . . . 19
4.4. Timestamp Clock . . . . . . . . . . . . . . . . . . . . . 22 4.4. Timestamp Clock . . . . . . . . . . . . . . . . . . . . . 21
4.5. Outdated Timestamps . . . . . . . . . . . . . . . . . . . 23 4.5. Outdated Timestamps . . . . . . . . . . . . . . . . . . . 23
4.6. Header Prediction . . . . . . . . . . . . . . . . . . . . 24 4.6. Header Prediction . . . . . . . . . . . . . . . . . . . . 23
4.7. IP Fragmentation . . . . . . . . . . . . . . . . . . . . . 25 4.7. IP Fragmentation . . . . . . . . . . . . . . . . . . . . . 25
4.8. Duplicates from Earlier Incarnations of Connection . . . . 25 4.8. Duplicates from Earlier Incarnations of Connection . . . . 25
5. Conclusions and Acknowledgements . . . . . . . . . . . . . . . 26 5. Conclusions and Acknowledgements . . . . . . . . . . . . . . . 25
6. Security Considerations . . . . . . . . . . . . . . . . . . . 26 6. Security Considerations . . . . . . . . . . . . . . . . . . . 26
7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 28 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 27
8. References . . . . . . . . . . . . . . . . . . . . . . . . . . 28 8. References . . . . . . . . . . . . . . . . . . . . . . . . . . 28
8.1. Normative References . . . . . . . . . . . . . . . . . . . 28 8.1. Normative References . . . . . . . . . . . . . . . . . . . 28
8.2. Informative References . . . . . . . . . . . . . . . . . . 28 8.2. Informative References . . . . . . . . . . . . . . . . . . 28
Appendix A. Implementation Suggestions . . . . . . . . . . . . . 30 Appendix A. Implementation Suggestions . . . . . . . . . . . . . 31
Appendix B. Duplicates from Earlier Connection Incarnations . . . 31 Appendix B. Duplicates from Earlier Connection Incarnations . . . 32
B.1. System Crash with Loss of State . . . . . . . . . . . . . 32 B.1. System Crash with Loss of State . . . . . . . . . . . . . 32
B.2. Closing and Reopening a Connection . . . . . . . . . . . . 32 B.2. Closing and Reopening a Connection . . . . . . . . . . . . 33
Appendix C. Summary of Notation . . . . . . . . . . . . . . . . . 34 Appendix C. Summary of Notation . . . . . . . . . . . . . . . . . 34
Appendix D. Event Processing Summary . . . . . . . . . . . . . . 35 Appendix D. Event Processing Summary . . . . . . . . . . . . . . 35
Appendix E. Timestamps Edge Cases . . . . . . . . . . . . . . . . 40 Appendix E. Timestamps Edge Cases . . . . . . . . . . . . . . . . 40
Appendix F. Window Retraction Example . . . . . . . . . . . . . . 41 Appendix F. Window Retraction Example . . . . . . . . . . . . . . 41
Appendix G. Changes from RFC 1323 . . . . . . . . . . . . . . . . 41 Appendix G. Changes from RFC 1323 . . . . . . . . . . . . . . . . 42
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 43 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 44
1. Introduction 1. Introduction
The TCP protocol [RFC0793] was designed to operate reliably over The TCP protocol [RFC0793] was designed to operate reliably over
almost any transmission medium regardless of transmission rate, almost any transmission medium regardless of transmission rate,
delay, corruption, duplication, or reordering of segments. Over the delay, corruption, duplication, or reordering of segments. Over the
years, advances in networking technology has resulted in ever-higher years, advances in networking technology has resulted in ever-higher
transmission speeds, and the fastest paths are well beyond the domain transmission speeds, and the fastest paths are well beyond the domain
for which TCP was originally engineered. for which TCP was originally engineered.
skipping to change at page 4, line 34 skipping to change at page 4, line 34
[RFC1323] should be consulted for reference. It is recommended that [RFC1323] should be consulted for reference. It is recommended that
a modern TCP stack implements and make use of the extensions a modern TCP stack implements and make use of the extensions
described in this document. described in this document.
1.1. TCP Performance 1.1. TCP Performance
TCP performance problems arise when the bandwidth * delay product is TCP performance problems arise when the bandwidth * delay product is
large. A network having such paths is referred to as "long, fat large. A network having such paths is referred to as "long, fat
network" (LFN). network" (LFN).
There are three fundamental performance problems with basic TCP over There are two fundamental performance problems with basic TCP over
LFN paths: LFN paths:
(1) Window Size Limit (1) Window Size Limit
The TCP header uses a 16 bit field to report the receive window The TCP header uses a 16 bit field to report the receive window
size to the sender. Therefore, the largest window that can be size to the sender. Therefore, the largest window that can be
used is 2^16 = 64 KiB. used is 2^16 = 64 KiB. For LFN paths where the bandwidth *
delay product exceeds 64 KiB, the receive window limits the
maximum throughput of the TCP connection over the path, i.e.,
the amount of unacknowledged data that TCP can send in order to
keep the pipeline full.
To circumvent this problem, Section 2 of this memo defines a TCP To circumvent this problem, Section 2 of this memo defines a TCP
option, "Window Scale", to allow windows larger than 2^16. This option, "Window Scale", to allow windows larger than 2^16. This
option defines an implicit scale factor, which is used to option defines an implicit scale factor, which is used to
multiply the window size value found in a TCP header to obtain multiply the window size value found in a TCP header to obtain
the true window size. the true window size.
(2) Recovery from Losses (2) Recovery from Losses
Packet losses in an LFN can have a catastrophic effect on Packet losses in an LFN can have a catastrophic effect on
throughput. throughput.
To generalize the Fast Retransmit/Fast Recovery mechanism to To generalize the Fast Retransmit/Fast Recovery mechanism to
handle multiple packets dropped per window, selective handle multiple packets dropped per window, selective
acknowledgments are required. Unlike the normal cumulative acknowledgments are required. Unlike the normal cumulative
acknowledgments of TCP, selective acknowledgments give the acknowledgments of TCP, selective acknowledgments give the
sender a complete picture of which segments are queued at the sender a complete picture of which segments are queued at the
receiver and which have not yet arrived. receiver and which have not yet arrived.
Selective acknowledgements are specified in a separate document, Selective acknowledgements and their use are specified in
"A Conservative Selective Acknowledgment (SACK)-based Loss separate documents, "TCP Selective Acknowledgment Options"
Recovery Algorithm for TCP" [RFC6675], and not further discussed [RFC2018], "An Extension to the Selective Acknowledgement (SACK)
in this document. Option for TCP" [RFC2883], and "A Conservative Selective
Acknowledgment (SACK)-based Loss Recovery Algorithm for TCP"
(3) Round-Trip Measurement [RFC6675], and not further discussed in this document.
TCP implements reliable data delivery by retransmitting segments
that are not acknowledged within some retransmission timeout
(RTO) interval. Accurate dynamic determination of an
appropriate RTO is essential to TCP performance. RTO is
determined by estimating the mean and variance of the measured
round-trip time (RTT), i.e., the time interval between sending a
segment and receiving an acknowledgment for it [Jacobson88a].
Section 3.2 defines a TCP option, "Timestamp", and then
specifies a mechanism using this option that allows nearly every
segment, including retransmissions, to be timed at negligible
computational cost. We use the mnemonic RTTM (Round Trip Time
Measurement) for this mechanism, to distinguish it from other
uses of the Timestamp Option.
1.2. TCP Reliability 1.2. TCP Reliability
An especially serious kind of error may result from an accidental An especially serious kind of error may result from an accidental
reuse of TCP sequence numbers in data segments. TCP reliability reuse of TCP sequence numbers in data segments. TCP reliability
depends upon the existence of a bound on the lifetime of a segment: depends upon the existence of a bound on the lifetime of a segment:
the "Maximum Segment Lifetime" or MSL. the "Maximum Segment Lifetime" or MSL.
Duplication of sequence numbers might happen in either of two ways: Duplication of sequence numbers might happen in either of two ways:
skipping to change at page 6, line 36 skipping to change at page 6, line 26
Section 3.2 to protect against old duplicates from the same Section 3.2 to protect against old duplicates from the same
connection. connection.
1.3. Using TCP options 1.3. Using TCP options
The extensions defined in this document all use TCP options. The extensions defined in this document all use TCP options.
When [RFC1323] was published, there was concern that some buggy TCP When [RFC1323] was published, there was concern that some buggy TCP
implementation might be crashed by the first appearance of an option implementation might be crashed by the first appearance of an option
on a non-<SYN> segment. However, bugs like that can lead to DOS on a non-<SYN> segment. However, bugs like that can lead to DOS
attacks against a TCP, so it is now expected that most TCP attacks against a TCP. Research has shown that most TCP
implementations will properly handle unknown options on non-<SYN> implementations will properly handle unknown options on non-<SYN>
segments. But it is still prudent to be conservative in what you segments ([Medina04], [Medina05]). But it is still prudent to be
send, and avoiding buggy TCP implementation is not the only reason conservative in what you send, and avoiding buggy TCP implementation
for negotiating TCP options on <SYN> segments. is not the only reason for negotiating TCP options on <SYN> segments.
The window scale option negotiates fundamental parameters of the TCP The window scale option negotiates fundamental parameters of the TCP
session. Therefore, it is only sent during the initial handshake. session. Therefore, it is only sent during the initial handshake.
Furthermore, the window scale option will be sent in a <SYN,ACK> Furthermore, the window scale option will be sent in a <SYN,ACK>
segment only if the corresponding option was received in the initial segment only if the corresponding option was received in the initial
<SYN> segment. <SYN> segment.
The timestamp option may appear in any data or <ACK> segment, adding The timestamp option may appear in any data or <ACK> segment, adding
12 bytes to the 20-byte TCP header. We recognize there is a trade- 12 bytes to the 20-byte TCP header. It is required that this TCP
off between the bandwidth saved by reducing unnecessary option will be sent on all non-<SYN> segments after an exchange of
retransmission timeouts, and the extra header bandwidth used by this options on the <SYN> segments has indicated that both sides
option. It is required that this TCP option will be sent on non- understand this extension.
<SYN> segments only after an exchange of options on the <SYN>
segments has indicated that both sides understand this extension. Research has shown that the use of the Timestamp option to arrive at
an optimal retransmission timeout value has only limited benefit
([Allman99]. However, there are other uses of the Timestamp option,
such as the Eifel mechanism [RFC3522], [RFC4015], and PAWS (see
Section 4) which improve overall TCP security and performance. The
extra header bandwidth used by this option should be evaluated for
the gains in performance and security in an actual deployment.
Appendix A contains a recommended layout of the options in TCP Appendix A contains a recommended layout of the options in TCP
headers to achieve reasonable data field alignment. headers to achieve reasonable data field alignment.
Finally, we observe that most of the mechanisms defined in this Finally, we observe that most of the mechanisms defined in this
document are important for LFN's and/or very high-speed networks. document are important for LFN's and/or very high-speed networks.
For low-speed networks, it might be a performance optimization to NOT For low-speed networks, it might be a performance optimization to NOT
use these mechanisms. A TCP vendor concerned about optimal use these mechanisms. A TCP vendor concerned about optimal
performance over low-speed paths might consider turning these performance over low-speed paths might consider turning these
extensions off for low-speed paths, or allow a user or installation extensions off for low-speed paths, or allow a user or installation
skipping to change at page 8, line 10 skipping to change at page 8, line 10
In this document, these words will appear with that interpretation In this document, these words will appear with that interpretation
only when in UPPER CASE. Lower case uses of these words are not to only when in UPPER CASE. Lower case uses of these words are not to
be interpreted as carrying [RFC2119] significance. be interpreted as carrying [RFC2119] significance.
2. TCP Window Scale Option 2. TCP Window Scale Option
2.1. Introduction 2.1. Introduction
The window scale extension expands the definition of the TCP window The window scale extension expands the definition of the TCP window
to 32 bits and then uses a scale factor to carry this 32-bit value in to 30 bits and then uses an implicit scale factor to carry this 30-
the 16-bit Window field of the TCP header (SEG.WND in RFC 793). The bit value in the 16-bit Window field of the TCP header (SEG.WND in
scale factor is carried in a TCP option, Window Scale. This option [RFC0793]). The exponent of the scale factor is carried in a TCP
is sent only in a <SYN> segment (a segment with the SYN bit on), option, Window Scale. This option is sent only in a <SYN> segment (a
hence the window scale is fixed in each direction when a connection segment with the SYN bit on), hence the window scale is fixed in each
is opened. direction when a connection is opened.
The maximum receive window, and therefore the scale factor, is The maximum receive window, and therefore the scale factor, is
determined by the maximum receive buffer space. In a typical modern determined by the maximum receive buffer space. In a typical modern
implementation, this maximum buffer space is set by default but can implementation, this maximum buffer space is set by default but can
be overridden by a user program before a TCP connection is opened. be overridden by a user program before a TCP connection is opened.
This determines the scale factor, and therefore no new user interface This determines the scale factor, and therefore no new user interface
is needed for window scaling. is needed for window scaling.
2.2. Window Scale Option 2.2. Window Scale Option
The three-byte Window Scale option MAY be sent in a <SYN> segment by The three-byte Window Scale option MAY be sent in a <SYN> segment by
a TCP. It has two purposes: (1) indicate that the TCP is prepared to a TCP. It has two purposes: (1) indicate that the TCP is prepared to
do both send and receive window scaling, and (2) communicate a scale do both send and receive window scaling, and (2) communicate the
factor to be applied to its receive window. Thus, a TCP that is exponent of a scale factor to be applied to its receive window.
prepared to scale windows SHOULD send the option, even if its own Thus, a TCP that is prepared to scale windows SHOULD send the option,
scale factor is 1. The scale factor is limited to a power of two and even if its own scale factor is 1 and the exponent 0. The scale
encoded logarithmically, so it may be implemented by binary shift factor is limited to a power of two and encoded logarithmically, so
operations. it may be implemented by binary shift operations. The maximum scale
exponent is limited to 14 for a maximum permissible receive window
size of 1 GiB (2^(14+16)).
TCP Window Scale Option (WSopt): TCP Window Scale Option (WSopt):
Kind: 3 Kind: 3
Length: 3 bytes Length: 3 bytes
+---------+---------+---------+ +---------+---------+---------+
| Kind=3 |Length=3 |shift.cnt| | Kind=3 |Length=3 |shift.cnt|
+---------+---------+---------+ +---------+---------+---------+
skipping to change at page 9, line 9 skipping to change at page 9, line 11
either direction. If window scaling is enabled, then the TCP that either direction. If window scaling is enabled, then the TCP that
sent this option will right-shift its true receive-window values by sent this option will right-shift its true receive-window values by
'shift.cnt' bits for transmission in SEG.WND. The value 'shift.cnt' 'shift.cnt' bits for transmission in SEG.WND. The value 'shift.cnt'
MAY be zero (offering to scale, while applying a scale factor of 1 to MAY be zero (offering to scale, while applying a scale factor of 1 to
the receive window). the receive window).
This option MAY be sent in an initial <SYN> segment (i.e., a segment This option MAY be sent in an initial <SYN> segment (i.e., a segment
with the SYN bit on and the ACK bit off). It MAY also be sent in a with the SYN bit on and the ACK bit off). It MAY also be sent in a
<SYN,ACK> segment, but only if a Window Scale option was received in <SYN,ACK> segment, but only if a Window Scale option was received in
the initial <SYN> segment. A Window Scale option in a segment the initial <SYN> segment. A Window Scale option in a segment
without a SYN bit SHOULD be ignored. without a SYN bit MUST be ignored.
The window field in a segment where the SYN bit is set (i.e., a <SYN> The window field in a segment where the SYN bit is set (i.e., a <SYN>
or <SYN,ACK>) is never scaled. or <SYN,ACK>) is never scaled.
2.3. Using the Window Scale Option 2.3. Using the Window Scale Option
A model implementation of window scaling is as follows, using the A model implementation of window scaling is as follows, using the
notation of [RFC0793]: notation of [RFC0793]:
o All windows are treated as 32-bit quantities for storage in the o All windows are treated as 32-bit quantities for storage in the
connection control block and for local calculations. This connection control block and for local calculations. This
includes the send-window (SND.WND) and the receive-window includes the send-window (SND.WND) and the receive-window
(RCV.WND) values, as well as the congestion window. (RCV.WND) values, as well as the congestion window.
o The connection state is augmented by two window shift counts, o The connection state is augmented by two window shift counters,
Snd.Wind.Scale and Rcv.Wind.Scale, to be applied to the incoming Snd.Wind.Shift and Rcv.Wind.Shift, to be applied to the incoming
and outgoing window fields, respectively. and outgoing window fields, respectively.
o If a TCP receives a <SYN> segment containing a Window Scale o If a TCP receives a <SYN> segment containing a Window Scale
option, it sends its own Window Scale option in the <SYN,ACK> option, it sends its own Window Scale option in the <SYN,ACK>
segment. segment.
o The Window Scale option is sent with shift.cnt = R, where R is the o The Window Scale option is sent with shift.cnt = R, where R is the
value that the TCP would like to use for its receive window. value that the TCP would like to use for its receive window.
o Upon receiving a <SYN> segment with a Window Scale option o Upon receiving a <SYN> segment with a Window Scale option
containing shift.cnt = S, a TCP sets Snd.Wind.Scale to S and sets containing shift.cnt = S, a TCP sets Snd.Wind.Shift to S and sets
Rcv.Wind.Scale to R; otherwise, it sets both Snd.Wind.Scale and Rcv.Wind.Shift to R; otherwise, it sets both Snd.Wind.Shift and
Rcv.Wind.Scale to zero. Rcv.Wind.Shift to zero.
o The window field (SEG.WND) in the header of every incoming o The window field (SEG.WND) in the header of every incoming
segment, with the exception of <SYN> segments, is left-shifted by segment, with the exception of <SYN> segments, is left-shifted by
Snd.Wind.Scale bits before updating SND.WND: Snd.Wind.Shift bits before updating SND.WND:
SND.WND = SEG.WND << Snd.Wind.Scale SND.WND = SEG.WND << Snd.Wind.Shift
(assuming the other conditions of [RFC0793] are met, and using the (assuming the other conditions of [RFC0793] are met, and using the
"C" notation "<<" for left-shift). "C" notation "<<" for left-shift).
o The window field (SEG.WND) of every outgoing segment, with the o The window field (SEG.WND) of every outgoing segment, with the
exception of <SYN> segments, is right-shifted by Rcv.Wind.Scale exception of <SYN> segments, is right-shifted by Rcv.Wind.Shift
bits: bits:
SND.WND = RCV.WND >> Rcv.Wind.Scale SND.WND = RCV.WND >> Rcv.Wind.Shift
TCP determines if a data segment is "old" or "new" by testing whether TCP determines if a data segment is "old" or "new" by testing whether
its sequence number is within 2^31 bytes of the left edge of the its sequence number is within 2^31 bytes of the left edge of the
window, and if it is not, discarding the data as "old". To insure window, and if it is not, discarding the data as "old". To insure
that new data is never mistakenly considered old and vice versa, the that new data is never mistakenly considered old and vice versa, the
left edge of the sender's window has to be at most 2^31 away from the left edge of the sender's window has to be at most 2^31 away from the
right edge of the receiver's window. Similarly with the sender's right edge of the receiver's window. Similarly with the sender's
right edge and receiver's left edge. Since the right and left edges right edge and receiver's left edge. Since the right and left edges
of either the sender's or receiver's window differ by the window of either the sender's or receiver's window differ by the window
size, and since the sender and receiver windows can be out of phase size, and since the sender and receiver windows can be out of phase
by at most the window size, the above constraints imply that two by at most the window size, the above constraints imply that two
times the maximum window size must be less than 2^31, or times the maximum window size must be less than 2^31, or
max window < 2^30 max window < 2^30
Since the max window is 2^S (where S is the scaling shift count) Since the max window is 2^S (where S is the scaling shift count)
times at most 2^16 - 1 (the maximum unscaled window), the maximum times at most 2^16 - 1 (the maximum unscaled window), the maximum
window is guaranteed to be < 2^30 if S <= 14. Thus, the shift count window is guaranteed to be < 2^30 if S <= 14. Thus, the shift count
MUST be limited to 14 (which allows windows of 2^30 = 1 GiB). If a MUST be limited to 14 (which allows windows of 2^30 = 1 GiB). If a
Window Scale option is received with a shift.cnt value exceeding 14, Window Scale option is received with a shift.cnt value larger than
the TCP SHOULD log the error but use 14 instead of the specified 14, the TCP SHOULD log the error but MUST use 14 instead of the
value. specified value. This is safe as a sender can always choose to only
partially use any signaled receive window.
The scale factor applies only to the Window field as transmitted in The scale factor applies only to the Window field as transmitted in
the TCP header; each TCP using extended windows will maintain the the TCP header; each TCP using extended windows will maintain the
window values locally as 32-bit numbers. For example, the window values locally as 32-bit numbers. For example, the
"congestion window" computed by Slow Start and Congestion Avoidance "congestion window" computed by Slow Start and Congestion Avoidance
is not affected by the scale factor, so window scaling will not (see [RFC5681]) is not affected by the scale factor, so window
introduce quantization into the congestion window. scaling will not introduce quantization into the congestion window.
2.4. Addressing Window Retraction 2.4. Addressing Window Retraction
When a non-zero scale factor is in use, there are instances when a When a non-zero scale factor is in use, there are instances when a
retracted window can be offered - see Appendix F for a detailed retracted window can be offered - see Appendix F for a detailed
example. The end of the window will be on a boundary based on the example. The end of the window will be on a boundary based on the
granularity of the scale factor being used. If the sequence number granularity of the scale factor being used. If the sequence number
is then updated by a number of bytes smaller than that granularity, is then updated by a number of bytes smaller than that granularity,
the TCP will have to either advertise a new window that is beyond the TCP will have to either advertise a new window that is beyond
what it previously advertised (and perhaps beyond the buffer), or what it previously advertised (and perhaps beyond the buffer), or
skipping to change at page 11, line 20 skipping to change at page 11, line 22
greater than the window announced by the most recent <ACK>, if greater than the window announced by the most recent <ACK>, if
more than one segment has arrived since the application consumed more than one segment has arrived since the application consumed
any data in the receive buffer). any data in the receive buffer).
On the sender side: On the sender side:
3) The initial transmission MUST be within the window announced by 3) The initial transmission MUST be within the window announced by
the most recent <ACK>. the most recent <ACK>.
4) On first retransmission, or if the sequence number is out-of- 4) On first retransmission, or if the sequence number is out-of-
window by less than (2^Rcv.Wind.Scale) then do normal window by less than 2^Rcv.Wind.Shift then do normal
retransmission(s) without regard to receiver window as long as retransmission(s) without regard to receiver window as long as
the original segment was in window when it was sent. the original segment was in window when it was sent.
5) Subsequent retransmissions MAY only be sent, if they are within 5) Subsequent retransmissions MAY only be sent, if they are within
the window announced by the most recent <ACK>. the window announced by the most recent <ACK>.
3. RTTM -- Round-Trip Time Measurement 3. TCP Timestamp Option
3.1. Introduction 3.1. Introduction
Accurate and current RTT estimates are necessary to adapt to changing TCP measures the round trip time (RTT), primarily for the purpose of
traffic conditions and to avoid an instability known as "congestion arriving at a reasonable value for the Retransmission Timeout (RTO)
collapse" [RFC0896] in a busy network. However, accurate measurement timer interval. Accurate and current RTT estimates are necessary to
of RTT may be difficult both in theory and in implementation. adapt to changing traffic conditions, while a conservative estimate
of the RTO inveral is necessary to minimize spurious RTOs.
Many TCP implementations base their RTT measurements upon a sample of
one segment per window or less. While this yields an adequate
approximation to the RTT for small windows, it results in an
unacceptably poor RTT estimate for a LFN. If we look at RTT
estimation as a signal processing problem (which it is), a data
signal at some frequency, the packet rate, is being sampled at a
lower frequency, the window rate. This lower sampling frequency
violates Nyquist's criteria and may therefore introduce "aliasing"
artifacts into the estimated RTT [Hamming77].
A good RTT estimator with a conservative retransmission timeout
calculation can tolerate aliasing when the sampling frequency is
"close" to the data frequency. For example, with a window of 8
segments, the sample rate is 1/8 the data frequency -- less than an
order of magnitude different. However, when the window is tens or
hundreds of segments, the RTT estimator may be seriously in error,
resulting in spurious retransmissions.
If there are dropped segments, the problem becomes worse. Zhang When [RFC1323] was originally written, it was perceived that taking
[Zhang86], Jain [Jain86] and Karn [Karn87] have shown that it is not RTT measurements for each segment, and also during retransmissions,
possible to accumulate reliable RTT estimates if retransmitted would contribute to reduce spurious RTOs, while maintaining the
segments are included in the estimate. Since a full window of data timeliness of necessary RTOs. At the time, RTO was also the only
will have been transmitted prior to a retransmission, all of the mechanism to make use of the measured RTT. It has been shown, that
segments in that window will have to be ACKed before the next RTT taking more RTT samples has only a very limited effect to optimize
sample can be taken. This means at least an additional window's RTOs [Allman99].
worth of time between RTT measurements and, as the error rate
approaches one per window of data (e.g., 10^-6 errors per bit for the
Wideband satellite network), it becomes effectively impossible to
obtain a valid RTT measurement.
A solution to these problems, which actually simplifies the sender This document makes a clear distinction between the round trip time
substantially, is as follows: using TCP options, the sender places a measurement (RTTM) mechanism, and subsequent mechanisms using the RTT
timestamp in each data segment, and the receiver reflects these signal as input, such as RTO (see Section 3.4).
timestamps back in <ACK> segments. Then a single subtract gives the
sender an accurate RTT measurement for every <ACK> segment (which
will correspond to every other data segment, with a sensible
receiver). We call this the RTTM (Round-Trip Time Measurement)
mechanism.
It is vitally important to use the RTTM mechanism with big windows; The timestamp option is important when large receive windows are
otherwise, the door is opened to some dangerous instabilities due to used, to allow the use of the PAWS mechanism (see Section 4).
aliasing. Furthermore, the option is probably useful for all TCP's, Furthermore, the option is useful for all TCP's, since it simplifies
since it simplifies the sender. the sender and allows the use of additional optimizations such as
Eifel ([RFC3522], [RFC4015]) and others.
3.2. TCP Timestamp Option 3.2. Timestamp Option
TCP is a symmetric protocol, allowing data to be sent at any time in TCP is a symmetric protocol, allowing data to be sent at any time in
either direction, and therefore timestamp echoing may occur in either either direction, and therefore timestamp echoing may occur in either
direction. For simplicity and symmetry, we specify that timestamps direction. For simplicity and symmetry, we specify that timestamps
always be sent and echoed in both directions. For efficiency, we always be sent and echoed in both directions. For efficiency, we
combine the timestamp and timestamp reply fields into a single TCP combine the timestamp and timestamp reply fields into a single TCP
Timestamp Option. Timestamp Option.
TCP Timestamp Option (TSopt): TCP Timestamp Option (TSopt):
skipping to change at page 13, line 49 skipping to change at page 13, line 24
generally be from the most recent Timestamp Option that was received; generally be from the most recent Timestamp Option that was received;
however, there are exceptions that are explained below. however, there are exceptions that are explained below.
A TCP MAY send the Timestamp option (TSopt) in an initial <SYN> A TCP MAY send the Timestamp option (TSopt) in an initial <SYN>
segment (i.e., segment containing a SYN bit and no ACK bit), and MAY segment (i.e., segment containing a SYN bit and no ACK bit), and MAY
send a TSopt in other segments only if it received a TSopt in the send a TSopt in other segments only if it received a TSopt in the
initial <SYN> or <SYN,ACK> segment for the connection. initial <SYN> or <SYN,ACK> segment for the connection.
Once TSopt has been successfully negotiated (sent and received) Once TSopt has been successfully negotiated (sent and received)
during the <SYN>, <SYN,ACK> exchange, TSopt MUST be sent in every during the <SYN>, <SYN,ACK> exchange, TSopt MUST be sent in every
non-<RST> segment for the duration of the connection. If a non-<RST> non-<RST> segment for the duration of the connection, and SHOULD be
segment is received without a TSopt, a TCP MAY drop the segment and sent in a <RST> segment (see Section 4.2 for details). If a non-
send an <ACK> for the last in-sequence segment. A TCP MUST NOT abort <RST> segment is received without a TSopt, a TCP MAY drop the segment
a TCP connection if a non-<RST> segment is received without a TSopt. and send an <ACK> for the last in-sequence segment. A TCP MUST NOT
abort a TCP connection if a non-<RST> segment is received without a
TSopt.
If a TSopt is received on a connection where TSopt was not negotiated If a TSopt is received on a connection where TSopt was not negotiated
in the initial three-way handshake, the TSopt MUST be ignored and the in the initial three-way handshake, the TSopt MUST be ignored and the
packet processed normally. packet processed normally.
In the case of crossing <SYN> segments where one <SYN> contains a In the case of crossing <SYN> segments where one <SYN> contains a
TSopt and the other doesn't, both sides MAY send a TSopt in the TSopt and the other doesn't, both sides MAY send a TSopt in the
<SYN,ACK> segment. <SYN,ACK> segment.
TSopt is required for the two mechanisms described in sections 3.3 TSopt is required for the two mechanisms described in sections 3.3
skipping to change at page 15, line 39 skipping to change at page 15, line 5
RTTM Rule: A TSecr value received in a segment MAY be used to update RTTM Rule: A TSecr value received in a segment MAY be used to update
the averaged RTT measurement only if the segment advances the averaged RTT measurement only if the segment advances
the left edge of the send window, i.e. SND.UNA is the left edge of the send window, i.e. SND.UNA is
increased. increased.
Since TCP B is not sending data, the data segment C does not Since TCP B is not sending data, the data segment C does not
acknowledge any new data when it arrives at B. Thus, the inflated acknowledge any new data when it arrives at B. Thus, the inflated
RTTM measurement is not used to update B's RTTM measurement. RTTM measurement is not used to update B's RTTM measurement.
3.4. Updating the RTO value
[Ekstroem04] and [Floyd05] have highlighted the problem that an
unmodified RTO calculation, which is updated with per-packet RTT
samples, will truncate the path history too soon. This can lead to
an increase in spurious retransmissions, when the path properties
vary in the order of a few RTTs, but a high number of RTT samples are
taken on a much shorter timescale.
Implementers should note that with timestamps multiple RTTMs can be Implementers should note that with timestamps multiple RTTMs can be
taken per RTT. Many RTO estimators have a weighting factor based on taken per RTT. The [RFC6298] RTO estimator has weighting factors,
an implicit assumption that at most one RTTM will be sampled per RTT. alpha and beta, based on an implicit assumption that at most one RTTM
When using multiple RTTMs per RTT to update the RTO estimator, the will be sampled per RTT. When using multiple RTTMs per RTT to update
weighting factor needs to be decreased to take into account the more the RTO estimator, the weighting factor SHOULD be decreased to take
frequent RTTMs. For example, an implementation could choose to just into account the more frequent RTTMs.
use one sample per RTT to update the RTO estimator, or vary the gain
based on the congestion window, or take an average of all the RTT
measurements received over one RTT, and then use that value to update
the RTO estimator. This document does not prescribe any particular
method for modifying the RTO estimator.
3.4. Which Timestamp to Echo For example, an implementation could choose to
o just use one sample per RTT to update the RTO estimator, or
o vary the gain based on the congestion window, or
o take an average of all the RTT measurements (and the maximum of
the variance) received over one RTT,
and then use that value to update the RTO estimator. This document
does not prescribe any particular method for modifying the RTO
estimator.
3.5. Which Timestamp to Echo
If more than one Timestamp Option is received before a reply segment If more than one Timestamp Option is received before a reply segment
is sent, the TCP must choose only one of the TSvals to echo, ignoring is sent, the TCP must choose only one of the TSvals to echo, ignoring
the others. To minimize the state kept in the receiver (i.e., the the others. To minimize the state kept in the receiver (i.e., the
number of unprocessed TSvals), the receiver should be required to number of unprocessed TSvals), the receiver should be required to
retain at most one timestamp in the connection control block. retain at most one timestamp in the connection control block.
There are three situations to consider: There are three situations to consider:
(A) Delayed ACKs. (A) Delayed ACKs.
Many TCP's acknowledge only every Kth segment out of a group of Many TCP's acknowledge only every second segment out of a group
segments arriving within a short time interval; this policy is of segments arriving within a short time interval; this policy
known generally as "delayed ACKs". The data-sender TCP must is known generally as "delayed ACKs". The data-sender TCP must
measure the effective RTT, including the additional time due to measure the effective RTT, including the additional time due to
delayed ACKs, or else it will retransmit unnecessarily. Thus, delayed ACKs, or else it will retransmit unnecessarily. Thus,
when delayed ACKs are in use, the receiver SHOULD reply with the when delayed ACKs are in use, the receiver SHOULD reply with the
TSval field from the earliest unacknowledged segment. TSval field from the earliest unacknowledged segment.
(B) A hole in the sequence space (segment(s) have been lost). (B) A hole in the sequence space (segment(s) have been lost).
The sender will continue sending until the window is filled, and The sender will continue sending until the window is filled, and
the receiver may be generating <ACK>s as these out-of-order the receiver may be generating <ACK>s as these out-of-order
segments arrive (e.g., to aid "fast retransmit"). segments arrive (e.g., to aid "fast retransmit").
skipping to change at page 16, line 43 skipping to change at page 16, line 23
retransmission. Furthermore, it is better to overestimate than retransmission. Furthermore, it is better to overestimate than
underestimate the RTT. An <ACK> for an out-of-order segment underestimate the RTT. An <ACK> for an out-of-order segment
SHOULD therefore contain the timestamp from the most recent SHOULD therefore contain the timestamp from the most recent
segment that advanced the window. segment that advanced the window.
The same situation occurs if segments are re-ordered by the The same situation occurs if segments are re-ordered by the
network. network.
(C) A filled hole in the sequence space. (C) A filled hole in the sequence space.
The segment that fills the hole represents the most recent The segment that fills the hole and advances the window
measurement of the network characteristics. A RTT computed from represents the most recent measurement of the network
an earlier segment would probably include the sender's characteristics. A RTT computed from an earlier segment would
retransmit time-out, badly biasing the sender's average RTT probably include the sender's retransmit time-out, badly biasing
estimate. Thus, the timestamp from the latest segment (which the sender's average RTT estimate. Thus, the timestamp from the
filled the hole) MUST be echoed. latest segment (which filled the hole) MUST be echoed.
An algorithm that covers all three cases is described in the An algorithm that covers all three cases is described in the
following rules for Timestamp Option processing on a synchronized following rules for Timestamp Option processing on a synchronized
connection: connection:
(1) The connection state is augmented with two 32-bit slots: (1) The connection state is augmented with two 32-bit slots:
TS.Recent holds a timestamp to be echoed in TSecr whenever a TS.Recent holds a timestamp to be echoed in TSecr whenever a
segment is sent, and Last.ACK.sent holds the ACK field from the segment is sent, and Last.ACK.sent holds the ACK field from the
last segment sent. Last.ACK.sent will equal RCV.NXT except when last segment sent. Last.ACK.sent will equal RCV.NXT except when
skipping to change at page 19, line 15 skipping to change at page 18, line 41
s < t if 0 < (t - s) < 2^31, s < t if 0 < (t - s) < 2^31,
computed in unsigned 32-bit arithmetic. computed in unsigned 32-bit arithmetic.
The choice of incoming timestamps to be saved for this comparison The choice of incoming timestamps to be saved for this comparison
MUST guarantee a value that is monotonically increasing. For MUST guarantee a value that is monotonically increasing. For
example, we might save the timestamp from the segment that last example, we might save the timestamp from the segment that last
advanced the left edge of the receive window, i.e., the most recent advanced the left edge of the receive window, i.e., the most recent
in-sequence segment. Instead, we choose the value TS.Recent in-sequence segment. Instead, we choose the value TS.Recent
introduced in Section 3.4 for the RTTM mechanism, since using a introduced in Section 3.5 for the RTTM mechanism, since using a
common value for both PAWS and RTTM simplifies the implementation of common value for both PAWS and RTTM simplifies the implementation of
both. As Section 3.4 explained, TS.Recent differs from the timestamp both. As Section 3.5 explained, TS.Recent differs from the timestamp
from the last in-sequence segment only in the case of delayed <ACK>s, from the last in-sequence segment only in the case of delayed <ACK>s,
and therefore by less than one window. Either choice will therefore and therefore by less than one window. Either choice will therefore
protect against sequence number wrap-around. protect against sequence number wrap-around.
RTTM was specified in a symmetrical manner, so that TSval timestamps RTTM was specified in a symmetrical manner, so that TSval timestamps
are carried in both data and <ACK> segments and are echoed in TSecr are carried in both data and <ACK> segments and are echoed in TSecr
fields carried in returning <ACK> or data segments. PAWS submits all fields carried in returning <ACK> or data segments. PAWS submits all
incoming segments to the same test, and therefore protects against incoming segments to the same test, and therefore protects against
duplicate <ACK> segments as well as data segments. (An alternative duplicate <ACK> segments as well as data segments. (An alternative
non-symmetric algorithm would protect against old duplicate <ACK>s: non-symmetric algorithm would protect against old duplicate <ACK>s:
skipping to change at page 20, line 37 skipping to change at page 20, line 14
Note: it is necessary to send an <ACK> segment in order to Note: it is necessary to send an <ACK> segment in order to
retain TCP's mechanisms for detecting and recovering from retain TCP's mechanisms for detecting and recovering from
half-open connections. For example, see Figure 10 of half-open connections. For example, see Figure 10 of
[RFC0793]. [RFC0793].
R2) If the segment is outside the window, reject it (normal TCP R2) If the segment is outside the window, reject it (normal TCP
processing) processing)
R3) If an arriving segment satisfies: SEG.SEQ <= Last.ACK.sent (see R3) If an arriving segment satisfies: SEG.SEQ <= Last.ACK.sent (see
Section 3.4), then record its timestamp in TS.Recent. Section 3.5), then record its timestamp in TS.Recent.
R4) If an arriving segment is in-sequence (i.e., at the left window R4) If an arriving segment is in-sequence (i.e., at the left window
edge), then accept it normally. edge), then accept it normally.
R5) Otherwise, treat the segment as a normal in-window, out-of- R5) Otherwise, treat the segment as a normal in-window, out-of-
sequence TCP segment (e.g., queue it for later delivery to the sequence TCP segment (e.g., queue it for later delivery to the
user). user).
Steps R2, R4, and R5 are the normal TCP processing steps specified by Steps R2, R4, and R5 are the normal TCP processing steps specified by
[RFC0793]. [RFC0793].
skipping to change at page 27, line 33 skipping to change at page 27, line 12
than 64 KiB. When larger TCP segments are used, the TCP checksum than 64 KiB. When larger TCP segments are used, the TCP checksum
becomes weaker. becomes weaker.
Mechanisms to protect the TCP header from modification should also Mechanisms to protect the TCP header from modification should also
protect the TCP options. protect the TCP options.
Middleboxes and TCP options: Middleboxes and TCP options:
Some middleboxes have been known to remove the TCP options Some middleboxes have been known to remove the TCP options
described in this document from the <SYN> segment. Middleboxes described in this document from the <SYN> segment. Middleboxes
should not remove TCP options described in this document from the that remove TCP options described in this document from the <SYN>
<SYN> segment, and must not remove any of these options in a segment interfere with the selection of parameters appropriate for
<SYN,ACK> segment. Examples of issues that can arise when the session. Removing any of these options in a <SYN,ACK> segment
middleboxes remove these TCP options include: will leave the end hosts in a state that destroys the proper
operation of the protocol.
* If a Window Scale option is removed from a <SYN,ACK> segment, * If a Window Scale option is removed from a <SYN,ACK> segment,
the end hosts will not negotiate the window scaling factor the end hosts will not negotiate the window scaling factor
correctly. Middleboxes must not remove or modify the Window correctly. Middleboxes must not remove or modify the Window
Scale option from <SYN,ACK> segments. Scale option from <SYN,ACK> segments.
* If a stateful firewall uses the window field to detect whether * If a stateful firewall uses the window field to detect whether
a received segment is inside the current window, and does not a received segment is inside the current window, and does not
support the Window Scale option, it will not be able to support the Window Scale option, it will not be able to
correctly determine whether or not a packet is in the window. correctly determine whether or not a packet is in the window.
skipping to change at page 28, line 36 skipping to change at page 28, line 17
RFC 793, September 1981. RFC 793, September 1981.
[RFC1191] Mogul, J. and S. Deering, "Path MTU discovery", RFC 1191, [RFC1191] Mogul, J. and S. Deering, "Path MTU discovery", RFC 1191,
November 1990. November 1990.
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, March 1997. Requirement Levels", BCP 14, RFC 2119, March 1997.
8.2. Informative References 8.2. Informative References
[Allman99]
Allman, M. and V. Paxson, "On Estimating End-to-End
Network Path Properties", Proc. ACM SIGCOMM Technical
Symposium, Cambridge, MA, September 1999,
<http://aciri.org/mallman/papers/estimation-la.pdf>.
[Ekstroem04]
Ekstroem, H. and R. Ludwig, "The Peak-Hopper: A New End-
to-End Retransmission Timer for Reliable Unicast
Transport", INFOCOM 2004 IEEE, March 2004, <http://
citeseerx.ist.psu.edu/viewdoc/
download?doi=10.1.1.76.2748&rep=rep1&type=pdf>.
[Floyd05] Floyd, S., "[tcpm] How the RTO should be estimated with
timestamps", Message from 26.Jan.2007 to the tcpm mailing
list, August 2005, <http://www.ietf.org/mail-archive/web/
tcpm/current/msg02508.html>.
[Garlick77] [Garlick77]
Garlick, L., Rom, R., and J. Postel, "Issues in Reliable Garlick, L., Rom, R., and J. Postel, "Issues in Reliable
Host-to-Host Protocols", Proc. Second Berkeley Workshop on Host-to-Host Protocols", Proc. Second Berkeley Workshop on
Distributed Data Management and Computer Networks, Distributed Data Management and Computer Networks,
May 1977, <http://www.rfc-editor.org/ien/ien12.txt>. May 1977, <http://www.rfc-editor.org/ien/ien12.txt>.
[Hamming77] [Hamming77]
Hamming, R., "Digital Filters", Prentice Hall, Englewood Hamming, R., "Digital Filters", Prentice Hall, Englewood
Cliffs, N.J. ISBN 0-13-212571-4, 1977. Cliffs, N.J. ISBN 0-13-212571-4, 1977.
skipping to change at page 29, line 31 skipping to change at page 29, line 28
Reliable Transport Protocols", Proc. SIGCOMM '87, Reliable Transport Protocols", Proc. SIGCOMM '87,
August 1987. August 1987.
[Martin03] [Martin03]
Martin, D., "[Tsvwg] RFC 1323.bis", Message to the tsvwg Martin, D., "[Tsvwg] RFC 1323.bis", Message to the tsvwg
mailing list, September 2003, <http://www.ietf.org/ mailing list, September 2003, <http://www.ietf.org/
mail-archive/web/tsvwg/current/msg04435.html>. mail-archive/web/tsvwg/current/msg04435.html>.
[Mathis08] [Mathis08]
Mathis, M., "[tcpm] Example of 1323 window retraction Mathis, M., "[tcpm] Example of 1323 window retraction
problem", Message to the tcpm mailing list, March 2008, problem", Message to the tcpm mailing list, March 2008, <h
<http://www.ietf.org/mail-archive/web/tcpm/current/ ttp://www.ietf.org/mail-archive/web/tcpm/current/
msg03564.html>. msg03564.html>.
[Medina04]
Medina, A., Allman, M., and S. Floyd, "Measuring
Interactions Between Transport Protocols and Middleboxes",
Proc. ACM SIGCOMM/USENIX Internet Measurement Conference.
October 2004, August 2004,
<http://www.icir.net/tbit/tbit-Aug2004.pdf>.
[Medina05]
Medina, A., Allman, M., and S. Floyd, "Measuring the
Evolution of Transport Protocols in the Internet", ACM
Computer Communication Review 35(2), April 2005,
<http://icir.net/floyd/papers/TCPevolution-Mar2005.pdf>.
[RFC0896] Nagle, J., "Congestion control in IP/TCP internetworks", [RFC0896] Nagle, J., "Congestion control in IP/TCP internetworks",
RFC 896, January 1984. RFC 896, January 1984.
[RFC1072] Jacobson, V. and R. Braden, "TCP extensions for long-delay [RFC1072] Jacobson, V. and R. Braden, "TCP extensions for long-delay
paths", RFC 1072, October 1988. paths", RFC 1072, October 1988.
[RFC1110] McKenzie, A., "Problem with the TCP big window option", [RFC1110] McKenzie, A., "Problem with the TCP big window option",
RFC 1110, August 1989. RFC 1110, August 1989.
[RFC1122] Braden, R., "Requirements for Internet Hosts - [RFC1122] Braden, R., "Requirements for Internet Hosts -
skipping to change at page 30, line 24 skipping to change at page 30, line 34
[RFC2675] Borman, D., Deering, S., and R. Hinden, "IPv6 Jumbograms", [RFC2675] Borman, D., Deering, S., and R. Hinden, "IPv6 Jumbograms",
RFC 2675, August 1999. RFC 2675, August 1999.
[RFC2883] Floyd, S., Mahdavi, J., Mathis, M., and M. Podolsky, "An [RFC2883] Floyd, S., Mahdavi, J., Mathis, M., and M. Podolsky, "An
Extension to the Selective Acknowledgement (SACK) Option Extension to the Selective Acknowledgement (SACK) Option
for TCP", RFC 2883, July 2000. for TCP", RFC 2883, July 2000.
[RFC3522] Ludwig, R. and M. Meyer, "The Eifel Detection Algorithm [RFC3522] Ludwig, R. and M. Meyer, "The Eifel Detection Algorithm
for TCP", RFC 3522, April 2003. for TCP", RFC 3522, April 2003.
[RFC4015] Ludwig, R. and A. Gurtov, "The Eifel Response Algorithm
for TCP", RFC 4015, February 2005.
[RFC4821] Mathis, M. and J. Heffner, "Packetization Layer Path MTU [RFC4821] Mathis, M. and J. Heffner, "Packetization Layer Path MTU
Discovery", RFC 4821, March 2007. Discovery", RFC 4821, March 2007.
[RFC4963] Heffner, J., Mathis, M., and B. Chandler, "IPv4 Reassembly [RFC4963] Heffner, J., Mathis, M., and B. Chandler, "IPv4 Reassembly
Errors at High Data Rates", RFC 4963, July 2007. Errors at High Data Rates", RFC 4963, July 2007.
[RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion
Control", RFC 5681, September 2009. Control", RFC 5681, September 2009.
[RFC6298] Paxson, V., Allman, M., Chu, J., and M. Sargent,
"Computing TCP's Retransmission Timer", RFC 6298,
June 2011.
[RFC6675] Blanton, E., Allman, M., Wang, L., Jarvinen, I., Kojo, M., [RFC6675] Blanton, E., Allman, M., Wang, L., Jarvinen, I., Kojo, M.,
and Y. Nishida, "A Conservative Loss Recovery Algorithm and Y. Nishida, "A Conservative Loss Recovery Algorithm
Based on Selective Acknowledgment (SACK) for TCP", Based on Selective Acknowledgment (SACK) for TCP",
RFC 6675, August 2012. RFC 6675, August 2012.
[RFC6691] Borman, D., "TCP Options and Maximum Segment Size (MSS)", [RFC6691] Borman, D., "TCP Options and Maximum Segment Size (MSS)",
RFC 6691, July 2012. RFC 6691, July 2012.
[Watson81] [Watson81]
Watson, R., "Timer-based Mechanisms in Reliable Transport Watson, R., "Timer-based Mechanisms in Reliable Transport
skipping to change at page 34, line 39 skipping to change at page 35, line 14
my.TSclock.rate: Period of my.TSclock (1 ms to 1 sec) my.TSclock.rate: Period of my.TSclock (1 ms to 1 sec)
Snd.TSoffset: A offset for randomizing Snd.TSclock Snd.TSoffset: A offset for randomizing Snd.TSclock
Snd.TSclock: my.TSclock + Snd.TSoffset Snd.TSclock: my.TSclock + Snd.TSoffset
Per-Connection State Variables Per-Connection State Variables
TS.Recent: Latest received Timestamp TS.Recent: Latest received Timestamp
Last.ACK.sent: Last ACK field sent Last.ACK.sent: Last ACK field sent
Snd.TS.OK: 1-bit flag Snd.TS.OK: 1-bit flag
Snd.WS.OK: 1-bit flag Snd.WS.OK: 1-bit flag
Rcv.Wind.Scale: Receive window scale power Rcv.Wind.Shift: Receive window scale exponent
Snd.Wind.Scale: Send window scale power Snd.Wind.Shift: Send window scale exponent
Start.Time: Snd.TSclock value when segment being timed was Start.Time: Snd.TSclock value when segment being timed was
sent (used by pre-1323 code). sent (used by pre-1323 code).
Procedure Procedure
Update_SRTT(m) Procedure to update the smoothed RTT and RTT Update_SRTT(m) Procedure to update the smoothed RTT and RTT
variance estimates, using the rules of variance estimates, using the rules of
[Jacobson88a], given m, a new RTT measurement [Jacobson88a], given m, a new RTT measurement
Appendix D. Event Processing Summary Appendix D. Event Processing Summary
OPEN Call OPEN Call
... ...
An initial send sequence number (ISS) is selected. Send a <SYN> An initial send sequence number (ISS) is selected. Send a <SYN>
segment of the form: segment of the form:
<SEQ=ISS><CTL=SYN><TSval=Snd.TSclock><WSopt=Rcv.Wind.Scale> <SEQ=ISS><CTL=SYN><TSval=Snd.TSclock><WSopt=Rcv.Wind.Shift>
... ...
SEND Call SEND Call
CLOSED STATE (i.e., TCB does not exist) CLOSED STATE (i.e., TCB does not exist)
... ...
LISTEN STATE LISTEN STATE
If the foreign socket is specified, then change the connection If the foreign socket is specified, then change the connection
from passive to active, select an ISS. Send a <SYN> segment from passive to active, select an ISS. Send a <SYN> segment
containing the options: <TSval=Snd.TSclock> and containing the options: <TSval=Snd.TSclock> and
<WSopt=Rcv.Wind.Scale>. Set SND.UNA to ISS, SND.NXT to ISS+1. <WSopt=Rcv.Wind.Shift>. Set SND.UNA to ISS, SND.NXT to ISS+1.
Enter SYN-SENT state. ... Enter SYN-SENT state. ...
SYN-SENT STATE SYN-SENT STATE
SYN-RECEIVED STATE SYN-RECEIVED STATE
... ...
ESTABLISHED STATE ESTABLISHED STATE
CLOSE-WAIT STATE CLOSE-WAIT STATE
skipping to change at page 35, line 52 skipping to change at page 36, line 25
If the urgent flag is set ... If the urgent flag is set ...
If the Snd.TS.OK flag is set, then include the TCP Timestamp If the Snd.TS.OK flag is set, then include the TCP Timestamp
Option <TSval=Snd.TSclock,TSecr=TS.Recent> in each data Option <TSval=Snd.TSclock,TSecr=TS.Recent> in each data
segment. segment.
Scale the receive window for transmission in the segment Scale the receive window for transmission in the segment
header: header:
SEG.WND = (RCV.WND >> Rcv.Wind.Scale). SEG.WND = (RCV.WND >> Rcv.Wind.Shift).
SEGMENT ARRIVES SEGMENT ARRIVES
... ...
If the state is LISTEN then If the state is LISTEN then
first check for an RST first check for an RST
... ...
skipping to change at page 36, line 28 skipping to change at page 36, line 50
third check for a SYN third check for a SYN
if the SYN bit is set, check the security. If the ... if the SYN bit is set, check the security. If the ...
... ...
if the SEG.PRC is less than the TCB.PRC then continue. if the SEG.PRC is less than the TCB.PRC then continue.
Check for a Window Scale option (WSopt); if one is found, Check for a Window Scale option (WSopt); if one is found,
save SEG.WSopt in Snd.Wind.Scale and set Snd.WS.OK flag on. save SEG.WSopt in Snd.Wind.Shift and set Snd.WS.OK flag on.
Otherwise, set both Snd.Wind.Scale and Rcv.Wind.Scale to Otherwise, set both Snd.Wind.Shift and Rcv.Wind.Shift to
zero and clear Snd.WS.OK flag. zero and clear Snd.WS.OK flag.
Check for a TSopt option; if one is found, save SEG.TSval in Check for a TSopt option; if one is found, save SEG.TSval in
the variable TS.Recent and turn on the Snd.TS.OK bit. the variable TS.Recent and turn on the Snd.TS.OK bit.
Set RCV.NXT to SEG.SEQ+1, IRS is set to SEG.SEQ and any Set RCV.NXT to SEG.SEQ+1, IRS is set to SEG.SEQ and any
other control or text should be queued for processing later. other control or text should be queued for processing later.
ISS should be selected and a <SYN> segment sent of the form: ISS should be selected and a <SYN> segment sent of the form:
<SEQ=ISS><ACK=RCV.NXT><CTL=SYN,ACK> <SEQ=ISS><ACK=RCV.NXT><CTL=SYN,ACK>
If the Snd.WS.OK bit is on, include a WSopt option If the Snd.WS.OK bit is on, include a WSopt option
<WSopt=Rcv.Wind.Scale> in this segment. If the Snd.TS.OK <WSopt=Rcv.Wind.Shift> in this segment. If the Snd.TS.OK
bit is on, include a TSopt <TSval=Snd.TSclock, bit is on, include a TSopt <TSval=Snd.TSclock,
TSecr=TS.Recent> in this segment. Last.ACK.sent is set to TSecr=TS.Recent> in this segment. Last.ACK.sent is set to
RCV.NXT. RCV.NXT.
SND.NXT is set to ISS+1 and SND.UNA to ISS. The connection SND.NXT is set to ISS+1 and SND.UNA to ISS. The connection
state should be changed to SYN-RECEIVED. Note that any state should be changed to SYN-RECEIVED. Note that any
other incoming control or data (combined with SYN) will be other incoming control or data (combined with SYN) will be
processed in the SYN-RECEIVED state, but processing of SYN processed in the SYN-RECEIVED state, but processing of SYN
and ACK should not be repeated. If the listen was not fully and ACK should not be repeated. If the listen was not fully
specified (i.e., the foreign socket was not fully specified (i.e., the foreign socket was not fully
skipping to change at page 37, line 30 skipping to change at page 37, line 52
... ...
If the SYN bit is on and the security/compartment and If the SYN bit is on and the security/compartment and
precedence are acceptable then, RCV.NXT is set to SEG.SEQ+1, precedence are acceptable then, RCV.NXT is set to SEG.SEQ+1,
IRS is set to SEG.SEQ, and any acknowledgements on the IRS is set to SEG.SEQ, and any acknowledgements on the
retransmission queue which are thereby acknowledged should retransmission queue which are thereby acknowledged should
be removed. be removed.
Check for a Window Scale option (WSopt); if it is found, Check for a Window Scale option (WSopt); if it is found,
save SEG.WSopt in Snd.Wind.Scale; otherwise, set both save SEG.WSopt in Snd.Wind.Shift; otherwise, set both
Snd.Wind.Scale and Rcv.Wind.Scale to zero. Snd.Wind.Shift and Rcv.Wind.Shift to zero.
Check for a TSopt option; if one is found, save SEG.TSval in Check for a TSopt option; if one is found, save SEG.TSval in
variable TS.Recent and turn on the Snd.TS.OK bit in the variable TS.Recent and turn on the Snd.TS.OK bit in the
connection control block. If the ACK bit is set, use connection control block. If the ACK bit is set, use
Snd.TSclock - SEG.TSecr as the initial RTT estimate. Snd.TSclock - SEG.TSecr as the initial RTT estimate.
If SND.UNA > ISS (our <SYN> has been ACKed), change the If SND.UNA > ISS (our <SYN> has been ACKed), change the
connection state to ESTABLISHED, form an <ACK> segment: connection state to ESTABLISHED, form an <ACK> segment:
<SEQ=SND.NXT><ACK=RCV.NXT><CTL=ACK> <SEQ=SND.NXT><ACK=RCV.NXT><CTL=ACK>
skipping to change at page 38, line 12 skipping to change at page 38, line 32
segment then continue processing at the sixth step below segment then continue processing at the sixth step below
where the URG bit is checked, otherwise return. where the URG bit is checked, otherwise return.
Otherwise enter SYN-RECEIVED, form a <SYN,ACK> segment: Otherwise enter SYN-RECEIVED, form a <SYN,ACK> segment:
<SEQ=ISS><ACK=RCV.NXT><CTL=SYN,ACK> <SEQ=ISS><ACK=RCV.NXT><CTL=SYN,ACK>
and send it. If the Snd.Echo.OK bit is on, include a TSopt and send it. If the Snd.Echo.OK bit is on, include a TSopt
option <TSval=Snd.TSclock,TSecr=TS.Recent> in this segment. option <TSval=Snd.TSclock,TSecr=TS.Recent> in this segment.
If the Snd.WS.OK bit is on, include a WSopt option If the Snd.WS.OK bit is on, include a WSopt option
<WSopt=Rcv.Wind.Scale> in this segment. Last.ACK.sent is <WSopt=Rcv.Wind.Shift> in this segment. Last.ACK.sent is
set to RCV.NXT. set to RCV.NXT.
If there are other controls or text in the segment, queue If there are other controls or text in the segment, queue
them for processing after the ESTABLISHED state has been them for processing after the ESTABLISHED state has been
reached, return. reached, return.
fifth, if neither of the SYN or RST bits is set then drop the fifth, if neither of the SYN or RST bits is set then drop the
segment and return. segment and return.
Otherwise, Otherwise,
skipping to change at page 38, line 43 skipping to change at page 39, line 15
TIME-WAIT STATE TIME-WAIT STATE
Segments are processed in sequence. Initial tests on Segments are processed in sequence. Initial tests on
arrival are used to discard old duplicates, but further arrival are used to discard old duplicates, but further
processing is done in SEG.SEQ order. If a segment's processing is done in SEG.SEQ order. If a segment's
contents straddle the boundary between old and new, only the contents straddle the boundary between old and new, only the
new parts should be processed. new parts should be processed.
Rescale the received window field: Rescale the received window field:
TrueWindow = SEG.WND << Snd.Wind.Scale, TrueWindow = SEG.WND << Snd.Wind.Shift,
and use "TrueWindow" in place of SEG.WND in the following and use "TrueWindow" in place of SEG.WND in the following
steps. steps.
Check whether the segment contains a Timestamp Option and Check whether the segment contains a Timestamp Option and
bit Snd.TS.OK is on. If so: bit Snd.TS.OK is on. If so:
If SEG.TSval < TS.Recent and the RST bit is off, then If SEG.TSval < TS.Recent and the RST bit is off, then
test whether connection has been idle less than 24 days; test whether connection has been idle less than 24 days;
if all are true, then the segment is not acceptable; if all are true, then the segment is not acceptable;
skipping to change at page 41, line 9 skipping to change at page 41, line 31
One thing to note about this situation is that it is somewhat bounded One thing to note about this situation is that it is somewhat bounded
by RTO + RTT, limiting how far off the RTTM calculation will be. by RTO + RTT, limiting how far off the RTTM calculation will be.
While more complex scenarios can be constructed that produce larger While more complex scenarios can be constructed that produce larger
inflations (e.g., retransmissions are lost), those scenarios involve inflations (e.g., retransmissions are lost), those scenarios involve
multiple segment losses, and the connection will have other more multiple segment losses, and the connection will have other more
serious operational problems than using an inflated RTTM in the RTO serious operational problems than using an inflated RTTM in the RTO
calculation. calculation.
Appendix F. Window Retraction Example Appendix F. Window Retraction Example
Consider a established TCP connection with WSCALE=7 (128 byte Consider an established TCP connection using a scale factor of 128,
receiver window quantization), that is running with a very small Snd.Wind.Shift=7 and Rcv.Wind.Shift=7, that is running with a very
windows because the receiver is bottlenecked and both ends are doing small window because the receiver is bottlenecked and both ends are
small reads and writes. doing small reads and writes.
Consider the ACKs coming back: Consider the ACKs coming back:
SEG.ACK SEG.WIN computed SND.WIN receiver's actual window SEG.ACK SEG.WIN computed SND.WIN receiver's actual window
1000 2 1256 1300 1000 2 1256 1300
The sender writes 40 bytes and receiver ACKs: The sender writes 40 bytes and receiver ACKs:
1040 2 1296 1300 1040 2 1296 1300
The sender writes 5 additional bytes and the receiver has a problem. The sender writes 5 additional bytes and the receiver has a problem.
Two choices: Two choices:
1045 2 1301 1300 - BEYOND BUFFER 1045 2 1301 1300 - BEYOND BUFFER
1045 1 1173 1300 - RETRACTED WINDOW 1045 1 1173 1300 - RETRACTED WINDOW
This is a general problem and can happen any time the sender does a
This problems is completely general and can in principle happen any write which is smaller than the window scale factor.
time the sender does a write which is smaller than the window scale
quanta.
In most stacks it is at least partially obscured when the window size In most stacks it is at least partially obscured when the window size
is larger than some small number of segments because the stacks is larger than some small number of segments because the stacks
prefer to announce windows that are integral numbers of segments prefer to announce windows that are an integral number of segments,
(rounded up to the next window quanta). This plus silly window rounded up to the next scale factor. This plus silly window
suppression tends to cause less frequent, larger window updates. If suppression tends to cause less frequent, larger window updates. If
the window was rounded down to a segment size there is more the window was rounded down to a segment size there is more
opportunity to advance it ("beyond buffer" case above) rather than opportunity to advance the window, the BEYOND BUFFER case above,
retracting it. rather than retracting it.
Appendix G. Changes from RFC 1323 Appendix G. Changes from RFC 1323
Several important updates and clarifications to the specification in Several important updates and clarifications to the specification in
RFC 1323 are made in these document. The technical changes are RFC 1323 are made in these document. The technical changes are
summarized below: summarized below:
(a) Section 2.4 was added describing the unavoidable window (a) Section 2.4 was added describing the unavoidable window
retraction issue, and explicitly describing the mitigation steps retraction issue, and explicitly describing the mitigation steps
necessary. necessary.
skipping to change at page 43, line 39 skipping to change at page 44, line 12
introduced in [RFC1323] already. Changed the text in introduced in [RFC1323] already. Changed the text in
Section 1.3 to specifically address TS and WS options. Section 1.3 to specifically address TS and WS options.
(c) Section 1.4 was added for RFC2119 wording. Normative text was (c) Section 1.4 was added for RFC2119 wording. Normative text was
updated with the appropriate phrases. updated with the appropriate phrases.
(d) Added < > brackets to mark specific types of segments, and (d) Added < > brackets to mark specific types of segments, and
replaced most occurances of "packet" with "segment", where TCP replaced most occurances of "packet" with "segment", where TCP
segments are referred. segments are referred.
(e) Removed the list of changes between RFC 1323 and prior versions. (e) Updated the text in section 3 to take into account what has been
learned since [RFC1323].
(f) Removed the list of changes between RFC 1323 and prior versions.
These changes are mentioned in Appendix C of RFC 1323. These changes are mentioned in Appendix C of RFC 1323.
(f) Moved Appendix "Changes" at the end of the appendices for easier (g) Moved Appendix "Changes" at the end of the appendices for easier
lookup. In addition, the entries were split into a technical lookup. In addition, the entries were split into a technical
and an editorial part, and sorted to roughly correspond with the and an editorial part, and sorted to roughly correspond with the
sections in the text where they apply. sections in the text where they apply.
Authors' Addresses Authors' Addresses
David Borman David Borman
Quantum Corporation Quantum Corporation
Mendota Heights MN 55120 Mendota Heights MN 55120
USA USA
 End of changes. 67 change blocks. 
185 lines changed or deleted 218 lines changed or added

This html diff was produced by rfcdiff 1.41. The latest version is available from http://tools.ietf.org/tools/rfcdiff/