--- 1/draft-ietf-mptcp-rfc6824bis-11.txt 2018-10-03 08:13:14.064359270 -0700 +++ 2/draft-ietf-mptcp-rfc6824bis-12.txt 2018-10-03 08:13:14.232363326 -0700 @@ -1,25 +1,25 @@ Internet Engineering Task Force A. Ford Internet-Draft Pexip Obsoletes: 6824 (if approved) C. Raiciu Intended status: Standards Track U. Politechnica of Bucharest -Expires: November 16, 2018 M. Handley +Expires: April 6, 2019 M. Handley U. College London O. Bonaventure U. catholique de Louvain C. Paasch Apple, Inc. - May 15, 2018 + October 3, 2018 TCP Extensions for Multipath Operation with Multiple Addresses - draft-ietf-mptcp-rfc6824bis-11 + draft-ietf-mptcp-rfc6824bis-12 Abstract TCP/IP communication is currently restricted to a single path per connection, yet multiple paths often exist between peers. The simultaneous use of these multiple paths for a TCP/IP session would improve resource usage within the network and, thus, improve user experience through higher throughput and improved resilience to network failure. @@ -42,21 +42,21 @@ Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at https://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." - This Internet-Draft will expire on November 16, 2018. + This Internet-Draft will expire on April 6, 2019. Copyright Notice Copyright (c) 2018 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents @@ -70,75 +70,75 @@ 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1. Design Assumptions . . . . . . . . . . . . . . . . . . . 4 1.2. Multipath TCP in the Networking Stack . . . . . . . . . . 5 1.3. Terminology . . . . . . . . . . . . . . . . . . . . . . . 6 1.4. MPTCP Concept . . . . . . . . . . . . . . . . . . . . . . 7 1.5. Requirements Language . . . . . . . . . . . . . . . . . . 8 2. Operation Overview . . . . . . . . . . . . . . . . . . . . . 8 2.1. Initiating an MPTCP Connection . . . . . . . . . . . . . 9 2.2. Associating a New Subflow with an Existing MPTCP - Connection . . . . . . . . . . . . . . . . . . . . . . . 9 - 2.3. Informing the Other Host about Another Potential Address 10 - 2.4. Data Transfer Using MPTCP . . . . . . . . . . . . . . . . 11 - 2.5. Requesting a Change in a Path's Priority . . . . . . . . 11 - 2.6. Closing an MPTCP Connection . . . . . . . . . . . . . . . 12 - 2.7. Notable Features . . . . . . . . . . . . . . . . . . . . 12 - 3. MPTCP Protocol . . . . . . . . . . . . . . . . . . . . . . . 12 - 3.1. Connection Initiation . . . . . . . . . . . . . . . . . . 14 - 3.2. Starting a New Subflow . . . . . . . . . . . . . . . . . 20 - 3.3. General MPTCP Operation . . . . . . . . . . . . . . . . . 25 - 3.3.1. Data Sequence Mapping . . . . . . . . . . . . . . . . 27 - 3.3.2. Data Acknowledgments . . . . . . . . . . . . . . . . 30 - 3.3.3. Closing a Connection . . . . . . . . . . . . . . . . 31 - 3.3.4. Receiver Considerations . . . . . . . . . . . . . . . 32 - 3.3.5. Sender Considerations . . . . . . . . . . . . . . . . 33 - 3.3.6. Reliability and Retransmissions . . . . . . . . . . . 34 - 3.3.7. Congestion Control Considerations . . . . . . . . . . 35 - 3.3.8. Subflow Policy . . . . . . . . . . . . . . . . . . . 36 - 3.4. Address Knowledge Exchange (Path Management) . . . . . . 37 - 3.4.1. Address Advertisement . . . . . . . . . . . . . . . . 38 - 3.4.2. Remove Address . . . . . . . . . . . . . . . . . . . 42 - 3.5. Fast Close . . . . . . . . . . . . . . . . . . . . . . . 43 - 3.6. Subflow Reset . . . . . . . . . . . . . . . . . . . . . . 44 - 3.7. Fallback . . . . . . . . . . . . . . . . . . . . . . . . 46 - 3.8. Error Handling . . . . . . . . . . . . . . . . . . . . . 50 - 3.9. Heuristics . . . . . . . . . . . . . . . . . . . . . . . 50 - 3.9.1. Port Usage . . . . . . . . . . . . . . . . . . . . . 51 - 3.9.2. Delayed Subflow Start and Subflow Symmetry . . . . . 51 - 3.9.3. Failure Handling . . . . . . . . . . . . . . . . . . 52 - 4. Semantic Issues . . . . . . . . . . . . . . . . . . . . . . . 53 - 5. Security Considerations . . . . . . . . . . . . . . . . . . . 54 - 6. Interactions with Middleboxes . . . . . . . . . . . . . . . . 57 - 7. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 60 - 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 60 - 8.1. MPTCP Option Subtypes . . . . . . . . . . . . . . . . . . 61 - 8.2. MPTCP Handshake Algorithms . . . . . . . . . . . . . . . 62 - 8.3. MP_TCPRST Reason Codes . . . . . . . . . . . . . . . . . 62 - 9. References . . . . . . . . . . . . . . . . . . . . . . . . . 63 - 9.1. Normative References . . . . . . . . . . . . . . . . . . 63 - 9.2. Informative References . . . . . . . . . . . . . . . . . 63 - Appendix A. Notes on Use of TCP Options . . . . . . . . . . . . 67 - Appendix B. TCP Fast Open . . . . . . . . . . . . . . . . . . . 68 - B.1. TFO cookie request with MPTCP . . . . . . . . . . . . . . 69 - B.2. Data sequence mapping under TFO . . . . . . . . . . . . . 69 - B.3. Connection establishment examples . . . . . . . . . . . . 70 - Appendix C. Control Blocks . . . . . . . . . . . . . . . . . . . 72 - C.1. MPTCP Control Block . . . . . . . . . . . . . . . . . . . 72 - C.1.1. Authentication and Metadata . . . . . . . . . . . . . 72 - C.1.2. Sending Side . . . . . . . . . . . . . . . . . . . . 73 - C.1.3. Receiving Side . . . . . . . . . . . . . . . . . . . 73 - C.2. TCP Control Blocks . . . . . . . . . . . . . . . . . . . 73 - C.2.1. Sending Side . . . . . . . . . . . . . . . . . . . . 74 - C.2.2. Receiving Side . . . . . . . . . . . . . . . . . . . 74 - Appendix D. Finite State Machine . . . . . . . . . . . . . . . . 74 - Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 75 + Connection . . . . . . . . . . . . . . . . . . . . . . . 10 + 2.3. Informing the Other Host about Another Potential Address 11 + 2.4. Data Transfer Using MPTCP . . . . . . . . . . . . . . . . 12 + 2.5. Requesting a Change in a Path's Priority . . . . . . . . 13 + 2.6. Closing an MPTCP Connection . . . . . . . . . . . . . . . 13 + 2.7. Notable Features . . . . . . . . . . . . . . . . . . . . 14 + 3. MPTCP Protocol . . . . . . . . . . . . . . . . . . . . . . . 15 + 3.1. Connection Initiation . . . . . . . . . . . . . . . . . . 16 + 3.2. Starting a New Subflow . . . . . . . . . . . . . . . . . 23 + 3.3. General MPTCP Operation . . . . . . . . . . . . . . . . . 28 + 3.3.1. Data Sequence Mapping . . . . . . . . . . . . . . . . 30 + 3.3.2. Data Acknowledgments . . . . . . . . . . . . . . . . 33 + 3.3.3. Closing a Connection . . . . . . . . . . . . . . . . 34 + 3.3.4. Receiver Considerations . . . . . . . . . . . . . . . 36 + 3.3.5. Sender Considerations . . . . . . . . . . . . . . . . 37 + 3.3.6. Reliability and Retransmissions . . . . . . . . . . . 38 + 3.3.7. Congestion Control Considerations . . . . . . . . . . 39 + 3.3.8. Subflow Policy . . . . . . . . . . . . . . . . . . . 39 + 3.4. Address Knowledge Exchange (Path Management) . . . . . . 41 + 3.4.1. Address Advertisement . . . . . . . . . . . . . . . . 42 + 3.4.2. Remove Address . . . . . . . . . . . . . . . . . . . 45 + 3.5. Fast Close . . . . . . . . . . . . . . . . . . . . . . . 46 + 3.6. Subflow Reset . . . . . . . . . . . . . . . . . . . . . . 48 + 3.7. Fallback . . . . . . . . . . . . . . . . . . . . . . . . 50 + 3.8. Error Handling . . . . . . . . . . . . . . . . . . . . . 53 + 3.9. Heuristics . . . . . . . . . . . . . . . . . . . . . . . 54 + 3.9.1. Port Usage . . . . . . . . . . . . . . . . . . . . . 54 + 3.9.2. Delayed Subflow Start and Subflow Symmetry . . . . . 54 + 3.9.3. Failure Handling . . . . . . . . . . . . . . . . . . 55 + 4. Semantic Issues . . . . . . . . . . . . . . . . . . . . . . . 56 + 5. Security Considerations . . . . . . . . . . . . . . . . . . . 57 + 6. Interactions with Middleboxes . . . . . . . . . . . . . . . . 60 + 7. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 63 + 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 64 + 8.1. MPTCP Option Subtypes . . . . . . . . . . . . . . . . . . 64 + 8.2. MPTCP Handshake Algorithms . . . . . . . . . . . . . . . 65 + 8.3. MP_TCPRST Reason Codes . . . . . . . . . . . . . . . . . 66 + 9. References . . . . . . . . . . . . . . . . . . . . . . . . . 67 + 9.1. Normative References . . . . . . . . . . . . . . . . . . 67 + 9.2. Informative References . . . . . . . . . . . . . . . . . 67 + Appendix A. Notes on Use of TCP Options . . . . . . . . . . . . 71 + Appendix B. TCP Fast Open and MPTCP . . . . . . . . . . . . . . 72 + B.1. TFO cookie request with MPTCP . . . . . . . . . . . . . . 73 + B.2. Data sequence mapping under TFO . . . . . . . . . . . . . 73 + B.3. Connection establishment examples . . . . . . . . . . . . 74 + Appendix C. Control Blocks . . . . . . . . . . . . . . . . . . . 76 + C.1. MPTCP Control Block . . . . . . . . . . . . . . . . . . . 76 + C.1.1. Authentication and Metadata . . . . . . . . . . . . . 76 + C.1.2. Sending Side . . . . . . . . . . . . . . . . . . . . 77 + C.1.3. Receiving Side . . . . . . . . . . . . . . . . . . . 77 + C.2. TCP Control Blocks . . . . . . . . . . . . . . . . . . . 77 + C.2.1. Sending Side . . . . . . . . . . . . . . . . . . . . 78 + C.2.2. Receiving Side . . . . . . . . . . . . . . . . . . . 78 + Appendix D. Finite State Machine . . . . . . . . . . . . . . . . 78 + Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 79 1. Introduction Multipath TCP (MPTCP) is a set of extensions to regular TCP [RFC0793] to provide a Multipath TCP [RFC6182] service, which enables a transport connection to operate across multiple paths simultaneously. This document presents the protocol changes required to add multipath capability to TCP; specifically, those for signaling and setting up multiple paths ("subflows"), managing these subflows, reassembly of data, and termination of sessions. This is not the only information @@ -334,22 +334,24 @@ | |--------------------->| | | |<---------------------| | | | | | | | | | Figure 2: Example MPTCP Usage Scenario 1.5. Requirements Language The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", - "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this - document are to be interpreted as described in RFC 2119 [RFC2119]. + "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and + "OPTIONAL" in this document are to be interpreted as described in + BCP 14 RFC 2119 [RFC2119] RFC 8174 [RFC8174] when, and only when, + they appear in all capitals, as shown here. 2. Operation Overview This section presents a single description of common MPTCP operation, with reference to the protocol operation. This is a high-level overview of the key functions; the full specification follows in Section 3. Extensibility and negotiated features are not discussed here. Considerable reference is made to symbolic names of MPTCP options throughout this section -- these are subtypes of the IANA- assigned MPTCP option (see Section 8), and their formats are defined @@ -369,36 +371,73 @@ during the lifetime of the Multipath TCP connection. All MPTCP operations are signaled with a TCP option -- a single numerical type for MPTCP, with "sub-types" for each MPTCP message. What follows is a summary of the purpose and rationale of these messages. 2.1. Initiating an MPTCP Connection This is the same signaling as for initiating a normal TCP connection, - but the SYN, SYN/ACK, and initial ACK packets also carry the - MP_CAPABLE option. This option is variable length and serves + but the SYN, SYN/ACK, and initial ACK (and data) packets also carry + the MP_CAPABLE option. This option has a variable length and serves multiple purposes. Firstly, it verifies whether the remote host supports Multipath TCP; secondly, this option allows the hosts to exchange some information to authenticate the establishment of additional subflows. Further details are given in Section 3.1. Host A Host B ------ ------ MP_CAPABLE -> [flags] <- MP_CAPABLE [B's key, flags] ACK + MP_CAPABLE (+ data) -> [A's key, B's key, flags, (data-level details)] + Retransmission of the ACK + MP_CAPABLE can occur if it is not known + if it has been received. The following diagrams show all possible + exchanges for the initial subflow setup to ensure this reliability. + + Host A (with data to send immediately) Host B + ------ ------ + MP_CAPABLE -> + [flags] + <- MP_CAPABLE + [B's key, flags] + ACK + MP_CAPABLE + data -> + [A's key, B's key, flags, data-level details] + + Host A (with data to send later) Host B + ------ ------ + MP_CAPABLE -> + [flags] + <- MP_CAPABLE + [B's key, flags] + ACK + MP_CAPABLE -> + [A's key, B's key, flags] + + ACK + MP_CAPABLE + data -> + [A's key, B's key, flags, data-level details] + + Host A Host B (sending first) + ------ ------ + MP_CAPABLE -> + [flags] + <- MP_CAPABLE + [B's key, flags] + ACK + MP_CAPABLE -> + [A's key, B's key, flags] + + <- ACK + DSS + data + [data-level details] + 2.2. Associating a New Subflow with an Existing MPTCP Connection The exchange of keys in the MP_CAPABLE handshake provides material that can be used to authenticate the endpoints when new subflows will be set up. Additional subflows begin in the same way as initiating a normal TCP connection, but the SYN, SYN/ACK, and ACK packets also carry the MP_JOIN option. Host A initiates a new subflow between one of its addresses and one of Host B's addresses. The token -- generated from the key -- is @@ -432,28 +471,36 @@ port pair IP#-A1 and wants to open a second subflow starting at address/port pair IP#-A2, it simply initiates the establishment of the subflow as explained above. The remote host will then be implicitly informed about the new address. In some circumstances, a host may want to advertise to the remote host the availability of an address without establishing a new subflow, for example, when a NAT prevents setup in one direction. In the example below, Host A informs Host B about its alternative IP address/port pair (IP#-A2). Host B may later send an MP_JOIN to this - new address. This option contains a HMAC to authenticate the address - as having been sent from the originator of the connection. Further - details are in Section 3.4.1. + new address. The ADD_ADDR option contains a HMAC to authenticate the + address as having been sent from the originator of the connection. + The receiver of this option echoes it back to the client to indicate + successful reception. Further details are in Section 3.4.1. Host A Host B ------ ------ ADD_ADDR -> - [IP#-A2, + [Echo-flag=0, + IP#-A2, + IP#-A2's Address ID, + HMAC of IP#-A2] + + <- ADD_ADDR + [Echo-flag=1, + IP#-A2, IP#-A2's Address ID, HMAC of IP#-A2] There is a corresponding signal for address removal, making use of the Address ID that is signaled in the add address handshake. Further details in Section 3.4.2. Host A Host B ------ ------ REMOVE_ADDR -> @@ -463,59 +510,62 @@ To ensure reliable, in-order delivery of data over subflows that may appear and disappear at any time, MPTCP uses a 64-bit data sequence number (DSN) to number all data sent over the MPTCP connection. Each subflow has its own 32-bit sequence number space, utilising the regular TCP sequence number header, and an MPTCP option maps the subflow sequence space to the data sequence space. In this way, data can be retransmitted on different subflows (mapped to the same DSN) in the event of failure. - The "Data Sequence Signal" carries the "Data Sequence Mapping". The - data sequence mapping consists of the subflow sequence number, data - sequence number, and length for which this mapping is valid. This - option can also carry a connection-level acknowledgment (the "Data - ACK") for the received DSN. + The Data Sequence Signal (DSS) carries the Data Sequence Mapping. + The Data Sequence Mapping consists of the subflow sequence number, + data sequence number, and length for which this mapping is valid. + This option can also carry a connection-level acknowledgment (the + "Data ACK") for the received DSN. With MPTCP, all subflows share the same receive buffer and advertise the same receive window. There are two levels of acknowledgment in MPTCP. Regular TCP acknowledgments are used on each subflow to acknowledge the reception of the segments sent over the subflow independently of their DSN. In addition, there are connection-level acknowledgments for the data sequence space. These acknowledgments track the advancement of the bytestream and slide the receiving window. Further details are in Section 3.3. Host A Host B ------ ------ - DATA_SEQUENCE_SIGNAL -> + DSS -> [Data Sequence Mapping] [Data ACK] [Checksum] 2.5. Requesting a Change in a Path's Priority Hosts can indicate at initial subflow setup whether they wish the subflow to be used as a regular or backup path -- a backup path only being used if there are no regular paths available. During a connection, Host A can request a change in the priority of a subflow through the MP_PRIO signal to Host B. Further details are in Section 3.3.8. Host A Host B ------ ------ MP_PRIO -> 2.6. Closing an MPTCP Connection + When a host wants to close an existing subflow, but not the whole + connection, it can initiate a regular TCP FIN/ACK exchange. + When Host A wants to inform Host B that it has no more data to send, it signals this "Data FIN" as part of the Data Sequence Signal (see above). It has the same semantics and behavior as a regular TCP FIN, but at the connection level. Once all the data on the MPTCP connection has been successfully received, then this message is acknowledged at the connection level with a DATA_ACK. Further details are in Section 3.3.3. Host A Host B ------ ------ @@ -514,55 +564,86 @@ above). It has the same semantics and behavior as a regular TCP FIN, but at the connection level. Once all the data on the MPTCP connection has been successfully received, then this message is acknowledged at the connection level with a DATA_ACK. Further details are in Section 3.3.3. Host A Host B ------ ------ DATA_SEQUENCE_SIGNAL -> [Data FIN] - <- (MPTCP DATA_ACK) + There is an additional method of connection closure, referred to as + "Fast Close", which is analogous to closing a single-path TCP + connection with a RST signal. The MP_FASTCLOSE signal is used to + indicate to the peer that the connection will be abruptly closed and + no data will be accepted anymore. This can be used on an ACK + (ensuring reliability of the signal), or a RST (which is not). Both + examples are shown in the following diagrams. Further details are in + Section 3.5. + + Host A Host B + ------ ------ + ACK + MP_FASTCLOSE -> + [B's key] + + [RST on all other subflows] -> + + <- [RST on all subflows] + + Host A Host B + ------ ------ + RST + MP_FASTCLOSE -> + [B's key] [on all subflows] + + <- [RST on all subflows] + 2.7. Notable Features It is worth highlighting that MPTCP's signaling has been designed with several key requirements in mind: o To cope with NATs on the path, addresses are referred to by Address IDs, in case the IP packet's source address gets changed by a NAT. Setting up a new TCP flow is not possible if the - passive opener is behind a NAT; to allow subflows to be created - when either end is behind a NAT, MPTCP uses the ADD_ADDR message. + receiver of the SYN is behind a NAT; to allow subflows to be + created when either end is behind a NAT, MPTCP uses the ADD_ADDR + message. o MPTCP falls back to ordinary TCP if MPTCP operation is not possible, for example, if one host is not MPTCP capable or if a - middlebox alters the payload. + middlebox alters the payload. This is discussed in Section 3.7. - o To meet the threats identified in [RFC6181], the following steps - are taken: keys are sent in the clear in the MP_CAPABLE messages; - MP_JOIN messages are secured with HMAC-SHA256 ([RFC2104], [SHS]) - using those keys; and standard TCP validity checks are made on the - other messages (ensuring sequence numbers are in-window - [RFC5961]). Further information can be found in Section 5. + o To address the threats identified in [RFC6181], the following + steps are taken: keys are sent in the clear in the MP_CAPABLE + messages; MP_JOIN messages are secured with HMAC-SHA256 + ([RFC2104], [SHS]) using those keys; and standard TCP validity + checks are made on the other messages (ensuring sequence numbers + are in-window [RFC5961]). Residual threats to MPTCP v0 [RFC6824] + were identified in [RFC7430], and those affecting the protocol + (i.e. modification to ADD_ADDR) have been incorporated in this + document. Further discussion of security can be found in + Section 5. 3. MPTCP Protocol This section describes the operation of the MPTCP protocol, and is subdivided into sections for each key part of the protocol operation. All MPTCP operations are signaled using optional TCP header fields. A single TCP option number ("Kind") has been assigned by IANA for MPTCP (see Section 8), and then individual messages will be determined by a "subtype", the values of which are also stored in an - IANA registry (and are also listed in Section 8). + IANA registry (and are also listed in Section 8). As with all TCP + options, the Length field is specified in bytes, and includes the 2 + bytes of Kind and Length. Throughout this document, when reference is made to an MPTCP option by symbolic name, such as "MP_CAPABLE", this refers to a TCP option with the single MPTCP option type, and with the subtype value of the symbolic name as defined in Section 8. This subtype is a 4-bit field -- the first 4 bits of the option payload, as shown in Figure 3. The MPTCP messages are defined in the following sections. 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 @@ -592,46 +673,48 @@ segment [RFC5681] in regular TCP. Therefore, an MPTCP implementation receiving a duplicate ACK that contains an MPTCP option MUST NOT treat it as a signal of congestion. Additionally, an MPTCP implementation SHOULD NOT send more than two duplicate ACKs in a row for the purposes of sending MPTCP options alone, in order to ensure no middleboxes misinterpret this as a sign of congestion. Furthermore, standard TCP validity checks (such as ensuring the sequence number and acknowledgment number are within window) MUST be undertaken before processing any MPTCP signals, as described in - [RFC5961], and initial subfow sequence numbers SHOULD be generated + [RFC5961], and initial subflow sequence numbers SHOULD be generated according to the recommendations in [RFC6528]. 3.1. Connection Initiation Connection initiation begins with a SYN, SYN/ACK, ACK exchange on a single path. Each packet contains the Multipath Capable (MP_CAPABLE) MPTCP option (Figure 4). This option declares its sender is capable of performing Multipath TCP and wishes to do so on this particular connection. The MP_CAPABLE exchange in this specification (v1) is different to that specified in v0 [RFC6824]. If a host supports multiple versions of MPTCP, the sender of the MP_CAPABLE option SHOULD signal the - highest version number it supports. The passive opener, on receipt - of this, will signal the version number it wishes to use, which MUST - be equal to or lower than the version number indicated in the initial - MP_CAPABLE. Given the SYN exchange is different between v1 and v0 - the exchange cannot be immediately downgraded, and therefore if the - far end has requested a lower version then the initiator SHOULD - respond with an ACK without any MP_CAPABLE option, to fall back to - regular TCP. If the initiator supports the requsted version, on - future connections to the target host, the initiator MAY cache the - version preference. Alternatively, the initiator MAY close the - connection with a TCP RST and immediately re-establish with the - requested version of MPTCP. + highest version number it supports. In return, in its MP_CAPABLE + option, the receiver will signal the version number it wishes to use, + which MUST be equal to or lower than the version number indicated in + the initial MP_CAPABLE. There is a caveat though with respect to + this version negotiation with old listeners that only support v0. A + listener that supports v0 expects that the MP_CAPABLE option in the + SYN-segment includes the initiator's key. If the initiator however + already upgraded to v1, it won't include the key in the SYN-segment. + Thus, the listener will ignore the MP_CAPABLE of this SYN-segment and + reply with a SYN/ACK that does not include an MP_CAPABLE, thus + leading to a fallback to regular TCP. An initiator MAY cache this + information about a peer and for future connections, MAY choose to + attempt using MPTCP v0, if supported, before recording the host as + not supporting MPTCP. The MP_CAPABLE option is variable-length, with different fields included depending on which packet the option is used on. The full MP_CAPABLE option is shown in Figure 4. 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +---------------+---------------+-------+-------+---------------+ | Kind | Length |Subtype|Version|A|B|C|D|E|F|G|H| +---------------+---------------+-------+-------+---------------+ @@ -643,78 +726,79 @@ | (if option Length > 12) | | | +-------------------------------+-------------------------------+ | Data-Level Length (16 bits) | Checksum (16 bits, optional) | +-------------------------------+-------------------------------+ Figure 4: Multipath Capable (MP_CAPABLE) Option The MP_CAPABLE option is carried on the SYN, SYN/ACK, and ACK packets that start the first subflow of an MPTCP connection, as well as the - first packet that carries data, if the initiator wishs to send first. - The data carried by each option is as follows, where A = initiator - and B = listener. + first packet that carries data, if the initiator wishes to send + first. The data carried by each option is as follows, where A = + initiator and B = listener. o SYN (A->B): only the first four octets (Length = 4). o SYN/ACK (B->A): B's Key for this connection (Length = 12). o ACK (no data) (A->B): A's Key followed by B's Key (Length = 20). o ACK (with first data) (A->B): A's Key followed by B's Key followed by Data-Level Length, and optional Checksum (Length = 22 or 24). The contents of the option is determined by the SYN and ACK flags of the packet, along with the option's length field. For the diagram shown in Figure 4, "sender" and "receiver" refer to the sender or receiver of the TCP packet (which can be either host). The initial SYN, containing just the MP_CAPABLE header, is used to define the version of MPTCP being requested, as well as exchanging flags to negotiate connection features, described later. This option is used to declare the 64-bit keys that the end hosts - have generated for this MPTCP connection. This key is used to + have generated for this MPTCP connection. These keys are used to authenticate the addition of future subflows to this connection. This is the only time the key will be sent in clear on the wire (unless "fast close", Section 3.5, is used); all future subflows will identify the connection using a 32-bit "token". This token is a cryptographic hash of this key. The algorithm for this process is dependent on the authentication algorithm selected; the method of selection is defined later in this section. Upon reception of the initial SYN-segment, a stateful server generates a random key and replies with a SYN/ACK. The key's method of generation is implementation specific. The key MUST be hard to - guess, and it MUST be unique for the sending host at any one time. - Recommendations for generating random numbers for use in keys are - given in [RFC4086]. Connections will be indexed at each host by the - token (a one-way hash of the key). Therefore, an implementation will - require a mapping from each token to the corresponding connection, - and in turn to the keys for the connection. + guess, and it MUST be unique for the sending host across all its + current MPTCP connections. Recommendations for generating random + numbers for use in keys are given in [RFC4086]. Connections will be + indexed at each host by the token (a one-way hash of the key). + Therefore, an implementation will require a mapping from each token + to the corresponding connection, and in turn to the keys for the + connection. There is a risk that two different keys will hash to the same token. The risk of hash collisions is usually small, unless the host is handling many tens of thousands of connections. Therefore, an implementation SHOULD check its list of connection tokens to ensure - there is not a collision before sending its key, and if there is, - then it should generate a new key. This would, however, be costly - for a server with thousands of connections. The subflow handshake + there is no collision before sending its key, and if there is, then + it should generate a new key. This would, however, be costly for a + server with thousands of connections. The subflow handshake mechanism (Section 3.2) will ensure that new subflows only join the correct connection, however, through the cryptographic handshake, as well as checking the connection tokens in both directions, and ensuring sequence numbers are in-window. So in the worst case if there was a token collision, the new subflow would not succeed, but the MPTCP connection would continue to provide a regular TCP service. Since key generation is implementation-specific, there is no - requirement that they be simply random numbers. An implemention is + requirement that they be simply random numbers. An implementation is free to exchange cryptographic material out-of-band and generate these keys from this, in order to provide additional mechanisms by which to verify the identity of the communicating entities. For example, an implementation could choose to link its MPTCP keys to those used in higher-layer TLS or SSH connections. If the server behaves in a stateless manner, it has to generate its own key in a verifiable fashion. This verifiable way of generating the key can be done by using a hash of the 4-tuple, sequence number and a local secret (similar to what is done for the TCP-sequence @@ -725,40 +809,41 @@ generate an alternative verifiable key, then the connection MUST fall back to using regular TCP by not sending a MP_CAPABLE in the SYN/ACK. The ACK carries both A's key and B's key. This is the first time that A's key is seen on the wire, although it is expected that A will have generated a key locally before the initial SYN. The echoing of B's key allows B to operate statelessly, as described above. Therefore, A's key must be delivered reliably to B, and in order to do this, the transmission of this packet must be made reliable. - If B has data to send first, then the reliable delivery of the ACK - can be inferred by the receipt of this data with a MPTCP Data - Sequence Signal (DSS) option (Section 3.3). If, however, A wishes to - send data first, it would not know whether the ACK has successfully - been received, and thus whether the MPTCP is successfully - established. Therefore, on the first data A has to send (if it has - not received any data from B), it MUST also include a MP_CAPABLE - option, with additional data parameters (the Data-Level Length and - optional Checksum as shown in Figure 4). This packet may be the - third ACK if data is ready to be sent by the application, or may be a - later packet if the application only later has data to send. This + If B has data to send first, then the reliable delivery of the + ACK+MP_CAPABLE can be inferred by the receipt of this data with a + MPTCP Data Sequence Signal (DSS) option (Section 3.3). If, however, + A wishes to send data first, it has two options to ensure the + reliable delivery of the ACK+MP_CAPABLE. If it immediately has data + to send, then the third ACK (with data) would also contain an + MP_CAPABLE option with additional data parameters (the Data-Level + Length and optional Checksum as shown in Figure 4). If A does not + immediately have data to send, it MUST include the MP_CAPABLE on the + third ACK, but without the additional data parameters. When A does + have data to send, it must repeat the sending of the MP_CAPABLE + option from the third ACK, with additional data parameters. This MP_CAPABLE option is in place of the DSS, and simply specifies the data-level length of the payload, and the checksum (if the use of checksums is negotiated). This is the minimal data required to establish a MPTCP connection - it allows validation of the payload, and given it is the first data, the Initial Data Sequence Number (IDSN) is also known (as it is generated from the key, as described below). Conveying the keys on the first data packet allows the TCP reliability mechanisms to ensure the packet is successfully - delivered. The receiver will acknowledge this data a the connection + delivered. The receiver will acknowledge this data at the connection level with a Data ACK, as if a DSS option has been received. There could be situations where both A and B attempt to transmit initial data at the same time. For example, if A did not initially have data to send, but then needed to transmit data before it had received anything from B, it would use a MP_CAPABLE option with data parameters (since it would not know if the MP_CAPABLE on the ACK was received). In such a situation, B may also have transmitted data with a DSS option, but it had not yet been received at A. Therefore, B has received data with a MP_CAPABLE mapping after it has sent data @@ -779,39 +864,43 @@ the handshake either party thinks the MPTCP negotiation is compromised, for example by a middlebox corrupting the TCP options, or unexpected ACK numbers being present, the host MUST stop using MPTCP and no longer include MPTCP options in future TCP packets. The other host will then also fall back to regular TCP using the fall back mechanism. Note that new subflows MUST NOT be established (using the process documented in Section 3.2) until a Data Sequence Signal (DSS) option has been successfully received across the path (as documented in Section 3.3). - The first 4 bits of the first octet in the MP_CAPABLE option - (Figure 4) define the MPTCP option subtype (see Section 8; for - MP_CAPABLE, this is 0), and the remaining 4 bits of this octet - specify the MPTCP version in use (for this specification, this is 1). + Like all MPTCP options, the MP_CAPABLE option starts with the Kind + and Length to specify the TCP-option kind and its length. Followed + by that is the MP_CAPABLE option. The first 4 bits of the first + octet in the MP_CAPABLE option (Figure 4) define the MPTCP option + subtype (see Section 8; for MP_CAPABLE, this is 0x0), and the + remaining 4 bits of this octet specify the MPTCP version in use (for + this specification, this is 1). The second octet is reserved for flags, allocated as follows: A: The leftmost bit, labeled "A", SHOULD be set to 1 to indicate "Checksum Required", unless the system administrator has decided that checksums are not required (for example, if the environment is controlled and no middleboxes exist that might adjust the payload). B: The second bit, labeled "B", is an extensibility flag, and MUST be set to 0 for current implementations. This will be used for an extensibility mechanism in a future specification, and the impact of this flag will be defined at a later date. If receiving a message with the 'B' flag set to 1, and this is not understood, - then this SYN MUST be silently ignored; the sender is expected to + then the MP_CAPABLE in this SYN MUST be silently ignored, which + triggers a fallback to regular TCP; the sender is expected to retry with a format compatible with this legacy specification. Note that the length of the MP_CAPABLE option, and the meanings of bits "C" through "H", may be altered by setting B=1. C: The third bit, labeled "C", is set to "1" to indicate that the sender of this option will not accept additional MPTCP subflows to the source address and port, and therefore the receiver MUST NOT try to open any additional subflows towards this address and port. This is an efficiency improvement for situations where the sender knows a restriction is in place, for example if the sender is @@ -867,21 +956,21 @@ load. If a responder does not support (or does not want to support) any of the initiator's proposals, it can respond without an MP_CAPABLE option, thus forcing a fallback to regular TCP. The MP_CAPABLE option is only used in the first subflow of a connection, in order to identify the connection; all following subflows will use the "Join" option (see Section 3.2) to join the existing connection. If a SYN contains an MP_CAPABLE option but the SYN/ACK does not, it - is assumed that the passive opener is not multipath capable; thus, + is assumed that sender of the SYN/ACK is not multipath capable; thus, the MPTCP session MUST operate as a regular, single-path TCP. If a SYN does not contain a MP_CAPABLE option, the SYN/ACK MUST NOT contain one in response. If the third packet (the ACK) does not contain the MP_CAPABLE option, then the session MUST fall back to operating as a regular, single-path TCP. This is to maintain compatibility with middleboxes on the path that drop some or all TCP options. Note that an implementation MAY choose to attempt sending MPTCP options more than one time before making this decision to operate as regular TCP (see Section 3.9). @@ -996,26 +1085,26 @@ When receiving a SYN with an MP_JOIN option that contains a valid token for an existing MPTCP connection, the recipient SHOULD respond with a SYN/ACK also containing an MP_JOIN option containing a random number and a truncated (leftmost 64 bits) Hash-based Message Authentication Code (HMAC). This version of the option is shown in Figure 6. If the token is unknown, or the host wants to refuse subflow establishment (for example, due to a limit on the number of subflows it will permit), the receiver will send back a reset (RST) signal, analogous to an unknown port in TCP, containing a MP_TCPRST - option (Section 3.6) with an appropriate reason code. Although - calculating an HMAC requires cryptographic operations, it is believed - that the 32-bit token in the MP_JOIN SYN gives sufficient protection - against blind state exhaustion attacks; therefore, there is no need - to provide mechanisms to allow a responder to operate statelessly at - the MP_JOIN stage. + option (Section 3.6) with a "MPTCP specific error" reason code. + Although calculating an HMAC requires cryptographic operations, it is + believed that the 32-bit token in the MP_JOIN SYN gives sufficient + protection against blind state exhaustion attacks; therefore, there + is no need to provide mechanisms to allow a responder to operate + statelessly at the MP_JOIN stage. An HMAC is sent by both hosts -- by the initiator (Host A) in the third packet (the ACK) and by the responder (Host B) in the second packet (the SYN/ACK). Doing the HMAC exchange at this stage allows both hosts to have first exchanged random data (in the first two SYN packets) that is used as the "message". This specification defines that HMAC as defined in [RFC2104] is used, along with the SHA-256 hash algorithm [SHS] (potentially implemented as in [RFC6234]), thus generating a 160-bit / 20-octet HMAC. Due to option space limitations, the HMAC included in the SYN/ACK is truncated to the @@ -1097,22 +1186,23 @@ | |<-------------------------------| | | ACK | HMAC-A = HMAC(Key=(Key-A+Key-B), Msg=(R-A+R-B)) HMAC-B = HMAC(Key=(Key-B+Key-A), Msg=(R-B+R-A)) Figure 8: Example Use of MPTCP Authentication If the token received at Host B is unknown or local policy prohibits the acceptance of the new subflow, the recipient MUST respond with a - TCP RST for the subflow, with a MP_TCPRST option (Section 3.6) with - an appropriate reason code. + TCP RST for the subflow. If appropriate, a MP_TCPRST option with a + "Administratively prohibited" reason code (Section 3.6) should be + included. If the token is accepted at Host B, but the HMAC returned to Host A does not match the one expected, Host A MUST close the subflow with a TCP RST. In this, and all following cases of sending a RST in this section, the sender SHOULD send a MP_TCPRST option (Section 3.6) on this RST packet with the reason code for a "MPTCP specific error". If Host B does not receive the expected HMAC, or the MP_JOIN option is missing from the ACK, it MUST close the subflow with a TCP RST with a MP_TCPRST (Section 3.6) option with the reason code for "MPTCP @@ -1122,27 +1212,27 @@ authenticated each other as being the same peers as existed at the start of the connection, and they have agreed of which connection this subflow will become a part. If the SYN/ACK as received at Host A does not have an MP_JOIN option, Host A MUST close the subflow with a TCP RST with a MP_TCPRST (Section 3.6) option with the reason code for "MPTCP specific error". This covers all cases of the loss of an MP_JOIN. In more detail, if MP_JOIN is stripped from the SYN on the path from A to B, and Host B - does not have a passive opener on the relevant port, it will respond - with a RST in the normal way. If in response to a SYN with an - MP_JOIN option, a SYN/ACK is received without the MP_JOIN option - (either since it was stripped on the return path, or it was stripped - on the outgoing path but the passive opener on Host B responded as if - it were a new regular TCP session), then the subflow is unusable and - Host A MUST close it with a RST. + does not have a listener on the relevant port, it will respond with a + RST in the normal way. If in response to a SYN with an MP_JOIN + option, a SYN/ACK is received without the MP_JOIN option (either + since it was stripped on the return path, or it was stripped on the + outgoing path but Host B responded as if it were a new regular TCP + session), then the subflow is unusable and Host A MUST close it with + a RST. Note that additional subflows can be created between any pair of ports (but see Section 3.9 for heuristics); no explicit application- level accept calls or bind calls are required to open additional subflows. To associate a new subflow with an existing connection, the token supplied in the subflow's SYN exchange is used for demultiplexing. This then binds the 5-tuple of the TCP subflow to the local token of the connection. A consequence is that it is possible to allow any port pairs to be used for a connection. @@ -1188,36 +1278,37 @@ Figure 9: Data Sequence Signal (DSS) Option The flags, when set, define the contents of this option, as follows: o A = Data ACK present o a = Data ACK is 8 octets (if not set, Data ACK is 4 octets) o M = Data Sequence Number (DSN), Subflow Sequence Number (SSN), - Data-Level Length, and Checksum present + Data-Level Length, and Checksum (if negotiated) present o m = Data sequence number is 8 octets (if not set, DSN is 4 octets) The flags 'a' and 'm' only have meaning if the corresponding 'A' or 'M' flags are set; otherwise, they will be ignored. The maximum length of this option, with all flags set, is 28 octets. The 'F' flag indicates "DATA_FIN". If present, this means that this mapping covers the final data from the sender. This is the connection-level equivalent to the FIN flag in single-path TCP. A - connection is not closed unless there has been a DATA_FIN exchange or - a timeout. The purpose of the DATA_FIN and the interactions between - this flag, the subflow-level FIN flag, and the data sequence mapping - are described in Section 3.3.3. The remaining reserved bits MUST be - set to zero by an implementation of this specification. + connection is not closed unless there has been a DATA_FIN exchange, + or an implementation-specific, connection-level timeout. The purpose + of the DATA_FIN and the interactions between this flag, the subflow- + level FIN flag, and the data sequence mapping are described in + Section 3.3.3. The remaining reserved bits MUST be set to zero by an + implementation of this specification. Note that the checksum is only present in this option if the use of MPTCP checksumming has been negotiated at the MP_CAPABLE handshake (see Section 3.1). The presence of the checksum can be inferred from the length of the option. If a checksum is present, but its use had not been negotiated in the MP_CAPABLE handshake, the checksum field MUST be ignored. If a checksum is not present when its use has been negotiated, the receiver MUST close the subflow with a RST as it is considered broken. This RST SHOULD be accompanied with a MP_TCPRST option (Section 3.6) with the reason code for a "MPTCP specific @@ -1251,27 +1342,29 @@ the data sequence number after the mapping has been processed. A sender MUST NOT change this mapping after it has been declared; however, the same data sequence number can be mapped to by different subflows for retransmission purposes (see Section 3.3.6). This would also permit the same data to be sent simultaneously on multiple subflows for resilience or efficiency purposes, especially in the case of lossy links. Although the detailed specification of such operation is outside the scope of this document, an implementation SHOULD treat the first data that is received at a subflow for the data sequence space as that which should be delivered to the - application, and any later data for that sequence space ignored. + application, and any later data for that sequence space should be + ignored. The data sequence number is specified as an absolute value, whereas the subflow sequence numbering is relative (the SYN at the start of the subflow has relative subflow sequence number 0). This is to allow middleboxes to change the initial sequence number of a subflow, - such as firewalls that undertake ISN randomization. + such as firewalls that undertake Initial Sequence Number (ISN) + randomization. The data sequence mapping also contains a checksum of the data that this mapping covers, if use of checksums has been negotiated at the MP_CAPABLE exchange. Checksums are used to detect if the payload has been adjusted in any way by a non-MPTCP-aware middlebox. If this checksum fails, it will trigger a failure of the subflow, or a fallback to regular TCP, as documented in Section 3.7, since MPTCP can no longer reliably know the subflow sequence space at the receiver to build data sequence mappings. @@ -1326,21 +1419,21 @@ not arrive within a receive window of data, that subflow SHOULD be treated as broken, closed with a RST, and any unmapped data silently discarded. Data sequence numbers are always 64-bit quantities, and MUST be maintained as such in implementations. If a connection is progressing at a slow rate, so protection against wrapped sequence numbers is not required, then an implementation MAY include just the lower 32 bits of the data sequence number in the data sequence mapping and/or Data ACK as an optimization, and an implementation can - make this choice independently for each packet. An implementaton + make this choice independently for each packet. An implementation MUST be able to receive and process both 64-bit or 32-bit sequence number values, but it is not required that an implementation is able to send both. An implementation MUST send the full 64-bit data sequence number if it is transmitting at a sufficiently high rate that the 32-bit value could wrap within the Maximum Segment Lifetime (MSL) [RFC1323]. The lengths of the DSNs used in these values (which may be different) are declared with flags in the DSS option. Implementations MUST accept a 32-bit DSN and implicitly promote it to a 64-bit quantity by @@ -1469,25 +1562,26 @@ necessary to retransmit data on different subflows. Essentially, a host MUST NOT close all functioning subflows unless it is safe to do so, i.e., until all outstanding data has been DATA_ACKed, or until the segment with the DATA_FIN flag set is the only outstanding segment. Once a DATA_FIN has been acknowledged, all remaining subflows MUST be closed with standard FIN exchanges. Both hosts SHOULD send FINs on all subflows, as a courtesy to allow middleboxes to clean up state even if an individual subflow has failed. It is also encouraged to - reduce the timeouts (Maximum Segment Life) on subflows at end hosts. - In particular, any subflows where there is still outstanding data - queued (which has been retransmitted on other subflows in order to - get the DATA_FIN acknowledged) MAY be closed with a RST with - MP_TCPRST (Section 3.6) error code for "too much outstanding data". + reduce the timeouts (Maximum Segment Lifetime) on subflows at end + hosts after receiving a DATA_FIN. In particular, any subflows where + there is still outstanding data queued (which has been retransmitted + on other subflows in order to get the DATA_FIN acknowledged) MAY be + closed with a RST with MP_TCPRST (Section 3.6) error code for "too + much outstanding data". A connection is considered closed once both hosts' DATA_FINs have been acknowledged by DATA_ACKs. As specified above, a standard TCP FIN on an individual subflow only shuts down the subflow on which it was sent. If all subflows have been closed with a FIN exchange, but no DATA_FIN has been received and acknowledged, the MPTCP connection is treated as closed only after a timeout. This implies that an implementation will have TIME_WAIT states at both the subflow and connection levels (see @@ -1631,23 +1725,24 @@ and will keep trying to retransmit the data on the failed subflow too. The sender will declare the subflow failed after a predefined upper bound on retransmissions is reached (which MAY be lower than the usual TCP limits of the Maximum Segment Life), or on the receipt of an ICMP error, and only then delete the outstanding data segments. Multiple retransmissions are triggers that will indicate that a subflow performs badly and could lead to a host resetting the subflow with a RST. However, additional research is required to understand the heuristics of how and when to reset underperforming subflows. + For example, a highly asymmetric path may be misdiagnosed as underperforming. A RST for this purpose SHOULD be accompanied with - an appropriate MP_TCPRST option (Section 3.6). + an "Unacceptable performance" MP_TCPRST option (Section 3.6). 3.3.7. Congestion Control Considerations Different subflows in an MPTCP connection have different congestion windows. To achieve fairness at bottlenecks and resource pooling, it is necessary to couple the congestion windows in use on each subflow, in order to push most traffic to uncongested links. One algorithm for achieving this is presented in [RFC6356]; the algorithm does not achieve perfect resource pooling but is "safe" in that it is readily deployable in the current Internet. By this, we mean that it does @@ -1704,20 +1799,25 @@ subflow where the receiver has indicated B=1 SHOULD NOT be used to send data unless there are no usable subflows where B=0). In the event that the available set of paths changes, a host may wish to signal a change in priority of subflows to the peer (e.g., a subflow that was previously set as backup should now take priority over all remaining subflows). Therefore, the MP_PRIO option, shown in Figure 11, can be used to change the 'B' flag of the subflow on which it is sent. + Another use of the MP_PRIO option is to set the 'B' flag on a subflow + to cleanly retire its use before closing it and removing it with + REMOVE_ADDR Section 3.4.2, for example to support make-before-break + session continuity. + 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +---------------+---------------+-------+-----+-+ | Kind | Length |Subtype| |B| +---------------+---------------+-------+-----+-+ Figure 11: Change Subflow Priority (MP_PRIO) Option It should be noted that the backup flag is a request from a data receiver to a data sender only, and the data sender SHOULD adhere to @@ -1788,38 +1888,38 @@ The Add Address (ADD_ADDR) MPTCP option announces additional addresses (and optionally, ports) on which a host can be reached (Figure 12). This option can be used at any time during a connection, depending on when the sender wishes to enable multiple paths and/or when paths become available. As with all MPTCP signals, the receiver MUST undertake standard TCP validity checks, e.g. [RFC5961], before acting upon it. Every address has an Address ID that can be used for uniquely - identifying the address within a connection for address removal. - This is also used to identify MP_JOIN options (see Section 3.2) + identifying the address within a connection for address removal. The + Address ID is also used to identify MP_JOIN options (see Section 3.2) relating to the same address, even when address translators are in - use. The Address ID MUST uniquely identify the address to the sender - (within the scope of the connection), but the mechanism for - allocating such IDs is implementation specific. + use. The Address ID MUST uniquely identify the address for the + sender of the option (within the scope of the connection), but the + mechanism for allocating such IDs is implementation specific. All address IDs learned via either MP_JOIN or ADD_ADDR SHOULD be stored by the receiver in a data structure that gathers all the Address ID to address mappings for a connection (identified by a token pair). In this way, there is a stored mapping between Address ID, observed source address, and token pair for future processing of control information for a connection. Note that an implementation MAY discard incoming address advertisements at will, for example, for - avoiding the required mapping state, or because advertised addresses - are of no use to it (for example, IPv6 addresses when it has IPv4 - only). Therefore, a host MUST treat address advertisements as soft - state, and it MAY choose to refresh advertisements periodically. + avoiding updating mapping state, or because advertised addresses are + of no use to it (for example, IPv6 addresses when it has IPv4 only). + Therefore, a host MUST treat address advertisements as soft state, + and it MAY choose to refresh advertisements periodically. This option is shown in Figure 12. The illustration is sized for IPv4 addresses. For IPv6, the length of the address will be 16 octets (instead of 4). The 2 octets that specify the TCP port number to use are optional and their presence can be inferred from the length of the option. Although it is expected that the majority of use cases will use the same port pairs as used for the initial subflow (e.g., port 80 remains port 80 on all subflows, as does the ephemeral port at the @@ -1838,32 +1938,31 @@ implemented as in [RFC6234]. In the same way as for MP_JOIN, the key for the HMAC algorithm, in the case of the message transmitted by Host A, will be Key-A followed by Key-B, and in the case of Host B, Key-B followed by Key-A. These are the keys that were exchanged in the original MP_CAPABLE handshake. The message for the HMAC is the Address ID, IP Address, and Port which precede the HMAC in the ADD_ADDR option. If the port is not present in the ADD_ADDR option, the HMAC message will nevertheless include two octets of value zero. The rationale for the HMAC is to prevent unauthorized entities from injecting ADD_ADDR signals in an attempt to hijack a connection. - Note that additionally the presence of this HMAC prevents the address being changed in flight unless the key is known by an intermediary. If a host receives an ADD_ADDR option for which it cannot validate the HMAC, it SHOULD silently ignore the option. A set of four flags are present after the subtype and before the - Address ID. Only the rightmost bit - labelled 'E' - is assinged + Address ID. Only the rightmost bit - labelled 'E' - is assigned today. The other bits are currently unassigned and MUST be set to zero by a sender and MUST be ignored by the receiver. - The 'E' bit exists to provide reliability for this option. Because + The 'E' flag exists to provide reliability for this option. Because this option will often be sent on pure ACKs, there is no guarantee of reliability. Therefore, a receiver receiving a fresh ADD_ADDR option (where E=0), will send the same option back to the sender, but not including the HMAC, and with E=1. The lack of this echo can be used by the initial ADD_ADDR sender to retransmit the ADD_ADDR according to local policy. 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +---------------+---------------+-------+-------+---------------+ @@ -1882,90 +1981,99 @@ Due to the proliferation of NATs, it is reasonably likely that one host may attempt to advertise private addresses [RFC1918]. It is not desirable to prohibit this, since there may be cases where both hosts have additional interfaces on the same private network, and a host MAY want to advertise such addresses. The MP_JOIN handshake to create a new subflow (Section 3.2) provides mechanisms to minimize security risks. The MP_JOIN message contains a 32-bit token that uniquely identifies the connection to the receiving host. If the token is unknown, the host will return with a RST. In the unlikely - event that the token is known, subflow setup will continue, but the - HMAC exchange must occur for authentication. This will fail, and - will provide sufficient protection against two unconnected hosts - accidentally setting up a new subflow upon the signal of a private - address. Further security considerations around the issue of - ADD_ADDR messages that accidentally misdirect, or maliciously direct, - new MP_JOIN attempts are discussed in Section 5. + event that the token is valid at the receiving host, subflow setup + will continue, but the HMAC exchange must occur for authentication. + This will fail, and will provide sufficient protection against two + unconnected hosts accidentally setting up a new subflow upon the + signal of a private address. Further security considerations around + the issue of ADD_ADDR messages that accidentally misdirect, or + maliciously direct, new MP_JOIN attempts are discussed in Section 5. Ideally, ADD_ADDR and REMOVE_ADDR options would be sent reliably, and in order, to the other end. This would ensure that this address management does not unnecessarily cause an outage in the connection when remove/add addresses are processed in reverse order, and also to ensure that all possible paths are used. Note, however, that losing reliability and ordering will not break the multipath connections, it will just reduce the opportunity to open multipath paths and to survive different patterns of path failures. Therefore, implementing reliability signals for these MPTCP options is not necessary. In order to minimize the impact of the loss of these options, however, it is RECOMMENDED that a sender should send these options on all available subflows. If these options need to be received in order, an implementation SHOULD only send one ADD_ADDR/ REMOVE_ADDR option per RTT, to minimize the risk of misordering. - A host can send an ADD_ADDR message with an already assigned Address - ID, but the Address MUST be the same as previously assigned to this - Address ID, and the Port MUST be different from one already in use - for this Address ID. If these conditions are not met, the receiver - SHOULD silently ignore the ADD_ADDR. A host wishing to replace an - existing Address ID MUST first remove the existing one - (Section 3.4.2). - A host that receives an ADD_ADDR but finds a connection set up to that IP address and port number is unsuccessful SHOULD NOT perform further connection attempts to this address/port combination for this connection. A sender that wants to trigger a new incoming connection attempt on a previously advertised address/port combination can therefore refresh ADD_ADDR information by sending the option again. + A host can therefore send an ADD_ADDR message with an already + assigned Address ID, but the Address MUST be the same as previously + assigned to this Address ID. A new ADD_ADDR may have the same, or + different, port number. If the port number is different, the + receiving host SHOULD try to set up a new subflow to this new + address/port combination. + + A host wishing to replace an existing Address ID MUST first remove + the existing one (Section 3.4.2). + During normal MPTCP operation, it is unlikely that there will be sufficient TCP option space for ADD_ADDR to be included along with those for data sequence numbering (Section 3.3.1). Therefore, it is expected that an MPTCP implementation will send the ADD_ADDR option on separate ACKs. As discussed earlier, however, an MPTCP implementation MUST NOT treat duplicate ACKs with any MPTCP option, with the exception of the DSS option, as indications of congestion [RFC5681], and an MPTCP implementation SHOULD NOT send more than two duplicate ACKs in a row for signaling purposes. 3.4.2. Remove Address If, during the lifetime of an MPTCP connection, a previously announced address becomes invalid (e.g., if the interface disappears), the affected host SHOULD announce this so that the peer - can remove subflows related to this address. + can remove subflows related to this address. A host MAY also choose + to announce that a valid IP address should not be used any longer, + for example for make-before-break session continuity. This is achieved through the Remove Address (REMOVE_ADDR) option (Figure 13), which will remove a previously added address (or list of addresses) from a connection and terminate any subflows currently using that address. For security purposes, if a host receives a REMOVE_ADDR option, it must ensure the affected path(s) are no longer in use before it instigates closure. The receipt of REMOVE_ADDR SHOULD first trigger the sending of a TCP keepalive [RFC1122] on the path, and if a - response is received the path SHOULD NOT be removed. Typical TCP - validity tests on the subflow (e.g., ensuring sequence and ACK - numbers are correct) MUST also be undertaken. An implementation can - use indications of these test failures as part of intrusion detection - or error logging. + response is received the path SHOULD NOT be removed. If the path is + found to still be alive, the receiving host SHOULD no longer use the + specified address for future connections, but it is the + responsibility of the host which sent the REMOVE_ADDR to shut down + the subflow. The requesting host MAY also use MP_PRIO + (Section 3.3.8) to request a path is no longer used, before removal. + Typical TCP validity tests on the subflow (e.g., ensuring sequence + and ACK numbers are correct) MUST also be undertaken. An + implementation can use indications of these test failures as part of + intrusion detection or error logging. The sending and receipt (if no keepalive response was received) of this message SHOULD trigger the sending of RSTs by both hosts on the affected subflow(s) (if possible), as a courtesy to cleaning up middlebox state, before cleaning up any local state. Address removal is undertaken by ID, so as to permit the use of NATs and other middleboxes that rewrite source addresses. If there is no address at the requested ID, the receiver will silently ignore the request. @@ -2022,64 +2130,63 @@ option on one subflow, containing the key of Host B as declared in the initial connection handshake. On all the other subflows, Host A sends a regular TCP RST to close these subflows, and tears them down. Host A now enters FASTCLOSE_WAIT state. o Option R (RST) : Host A sends a RST containing the MP_FASTCLOSE option on all subflows, containing the key of Host B as declared in the initial connection handshake. Host A can tear the subflows and the connection down immediately. - If a host receives a packet with a valid MP_FASTCLOSE option, it - shall process it as follows : + If host A decides to force the closure by using Option A and sending + an ACK with the MP_FASTCLOSE option, the connection shall proceed as + follows: - o Upon receipt of an ACK with MP_FASTCLOSE, containing the valid - key, Host B answers on the same subflow with a TCP RST and tears - down all subflows. Host B can now close the whole MPTCP - connection (it transitions directly to CLOSED state). + o Upon receipt of an ACK with MP_FASTCLOSE by Host B, containing the + valid key, Host B answers on the same subflow with a TCP RST and + tears down all subflows also through sending TCP RST signals. + Host B can now close the whole MPTCP connection (it transitions + directly to CLOSED state). o As soon as Host A has received the TCP RST on the remaining subflow, it can close this subflow and tear down the whole connection (transition from FASTCLOSE_WAIT to CLOSED states). If Host A receives an MP_FASTCLOSE instead of a TCP RST, both hosts attempted fast closure simultaneously. Host A should reply with a TCP RST and tear down the connection. o If Host A does not receive a TCP RST in reply to its MP_FASTCLOSE after one retransmission timeout (RTO) (the RTO of the subflow where the MP_FASTCLOSE has been sent), it SHOULD retransmit the MP_FASTCLOSE. The number of retransmissions SHOULD be limited to avoid this connection from being retained for a long time, but this limit is implementation specific. A RECOMMENDED number is 3. If no TCP RST is received in response, Host A SHOULD send a TCP RST with the MP_FASTCLOSE option itself when it releases state in order to clear any remaining state at middleboxes. - o Upon receipt of a RST with MP_FASTCLOSE, containing the valid key, - Host B tears down all subflows. Host B can now close the whole - MPTCP connection (it transitions directly to CLOSED state). + If however host A decides to force the closure by using Option R and + sending a RST with the MP_FASTCLOSE option, Host B will act as + follows: Upon receipt of a RST with MP_FASTCLOSE, containing the + valid key, Host B tears down all subflows by sending a TCP RST. Host + B can now close the whole MPTCP connection (it transitions directly + to CLOSED state). 3.6. Subflow Reset - As discussed in Section 3.5 above, the MP_FASTCLOSE option provides a - connection-level reset roughly analagous to a TCP RST. Regular TCP - RST options remain used to at the subflow-level to indicate the - receiving host has no knowledge of the MPTCP subflow or TCP - connection to which the packet belongs. - - However, in MPTCP, there may be many reasons for rejecting the - opening of a subflow, but these semantics cannot be carried in a - standard TCP RST. It would be beneficial for a host to the reasons - why its subflow has been closed with a RST, and thus whether it - should try to re-establish the subflow immediately, later, or never - again. These semantics are carried in the MP_TCPRST option that can - be included on a TCP RST packet. + An implementation of MPTCP may also need to send a regular TCP RST to + force the closure of a subflow. A host sends a TCP RST in order to + close a subflow or reject an attempt to open a subflow (MP_JOIN). In + order to inform the receiving host why a subflow is being closed or + rejected, the TCP RST packet MAY include the MP_TCPRST Option. The + host MAY use this information to decide, for example, whether it + tries to re-establish the subflow immediately, later, or never. 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +---------------+---------------+-------+-----------------------+ | Kind | Length |Subtype|U|V|W|T| Reason | +---------------+---------------+-------+-----------------------+ Figure 15: TCP RST Reason (MP_TCPRST) Option The MP_TCPRST option contains a reason code that allows the sender of @@ -2094,39 +2201,39 @@ condition that is reported is Transient (T bit set to 1) or Permanent (T bit set to 0). If the error condition is considered to be Transient by the sender of the RST segment, the recipient of this segment MAY try to reestablish a subflow for this connection over the failed path. The time at which a receiver may try to re-establish this is implementation-specific, but SHOULD take into account the properties of the failure defined by the following reason code. If the error condition is considered to be permanent, the receiver of the RST segment SHOULD NOT try to reestablish a subflow for this connection over this path. The "U", "V" and "W" flags are not - defined by this specification and are reserved for future use. + defined by this specification and are reserved for future use. An + implementation of this specification MUST set these flags to 0, and a + receiver MUST ignore them. The "Reason" code is an 8-bit field that indicates the reason for the termination of the subflow. The following codes are defined in this document: o Unspecified error (code 0x0). This is the default error implying - the subflow is not longer available. The receiving host SHOULD - take account of the 'T' bit in deciding whether to re-estbalish - this subflow. The presence of this option shows that the RST was - generated by a MPTCP-aware device. + the subflow is no longer available. The presence of this option + shows that the RST was generated by a MPTCP-aware device. o MPTCP specific error (code 0x01). An error has been detected in the processing of MPTCP options. This is the usual reason code to return in the cases where a RST is being sent to close a subflow for reasons of an invalid response. o Lack of resources (code 0x02). This code indicates that the - sending host does not have enough ressources to support the + sending host does not have enough resources to support the terminated subflow. o Administratively prohibited (code 0x03). This code indicates that the requested subflow is prohibited by the policies of the sending host. o Too much outstanding data (code 0x04). This code indicates that there is an excessive amount of data that need to be transmitted over the terminated subflow while having already been acknowledged over one or more other subflows. This may occur if a path has @@ -2141,21 +2248,21 @@ been detected over this subflow making MPTCP signaling invalid. For example, this may be sent if the checksum does not validate. 3.7. Fallback Sometimes, middleboxes will exist on a path that could prevent the operation of MPTCP. MPTCP has been designed in order to cope with many middlebox modifications (see Section 6), but there are still some cases where a subflow could fail to operate within the MPTCP requirements. These cases are notably the following: the loss of - MPTCP options on a path and the modification of payload data. If + MPTCP options on a path, and the modification of payload data. If such an event occurs, it is necessary to "fall back" to the previous, safe operation. This may be either falling back to regular TCP or removing a problematic subflow. At the start of an MPTCP connection (i.e., the first subflow), it is important to ensure that the path is fully MPTCP capable and the necessary MPTCP options can reach each host. The handshake as described in Section 3.1 SHOULD fall back to regular TCP if either of the SYN messages do not have the MPTCP options: this is the same, and desired, behavior in the case where a host is not MPTCP capable, or @@ -2174,159 +2281,150 @@ every segment until one of the sent segments has been acknowledged with a DSS option containing a Data ACK. Upon reception of the acknowledgment, the sender has the confirmation that the DSS option passes in both directions and may choose to send fewer DSS options than once per segment. If, however, an ACK is received for data (not just for the SYN) without a DSS option containing a Data ACK, the sender determines the path is not MPTCP capable. In the case of this occurring on an additional subflow (i.e., one started with MP_JOIN), the host MUST - close the subflow with a RST. In the case of the first subflow - (i.e., that started with MP_CAPABLE), it MUST drop out of an MPTCP - mode back to regular TCP. The sender will send one final data - sequence mapping, with the Data-Level Length value of 0 indicating an - infinite mapping (in case the path drops options in one direction - only), and then revert to sending data on the single subflow without - any MPTCP options. + close the subflow with a RST, which SHOULD contain a MP_TCPRST option + (Section 3.6) with a "Middlebox interferance" reason code. - Note that this rule essentially prohibits the sending of data on the - third packet of an MP_CAPABLE or MP_JOIN handshake, since both that - option and a DSS cannot fit in TCP option space. If the initiator is - to send first, another segment must be sent that contains the data - and DSS. Note also that an additional subflow cannot be used until - the initial path has been verified as MPTCP capable. + In the case of such an ACK being received on the first subflow (i.e., + that started with MP_CAPABLE), before any additional subflows are + added, the implementation MUST drop out of an MPTCP mode, back to + regular TCP. The sender will send one final data sequence mapping, + with the Data-Level Length value of 0 indicating an infinite mapping + (to inform the other end in case the path drops options in one + direction only), and then revert to sending data on the single + subflow without any MPTCP options. If a subflow breaks during operation, e.g. if it is re-routed and MPTCP options are no longer permitted, then once this is detected (by the subflow-level receive buffer filling up), the subflow SHOULD be treated as broken and closed with a RST, since no data can be delivered to the application layer, and no fallback signal can be reliably sent. This RST SHOULD include the MP_TCPRST option - (Section 3.6) with an appropriate reason code. + (Section 3.6) with a "Middlebox interferance" reason code. These rules should cover all cases where such a failure could happen: whether it's on the forward or reverse path and whether the server or the client first sends data. If lost options on data packets occur on any other subflow apart from the initial subflow, it should be treated as a standard path failure. The data would not be DATA_ACKed (since there is no mapping for the data), and the subflow can be - closed with a RST, containing a MP_TCPRST option (Section 3.6) with - an appropriate reason code. - - The case described above is a specialized case of fallback, for when - the lack of MPTCP support is detected before any data is acknowledged - at the connection level on a subflow. More generally, fallback - (either closing a subflow, or to regular TCP) can become necessary at - any point during a connection if a non-MPTCP-aware middlebox changes - the data stream. + closed with a RST, containing a MP_TCPRST option (Section 3.6) with a + "Middlebox interferance" reason code. - As described in Section 3.3, each portion of data for which there is - a mapping is protected by a checksum, if checksums have been - negotiated. This mechanism is used to detect if middleboxes have - made any adjustments to the payload (added, removed, or changed - data). A checksum will fail if the data has been changed in any way. - This will also detect if the length of data on the subflow is - increased or decreased, and this means the data sequence mapping is - no longer valid. The sender no longer knows what subflow-level - sequence number the receiver is genuinely operating at (the middlebox - will be faking ACKs in return), and it cannot signal any further - mappings. Furthermore, in addition to the possibility of payload - modifications that are valid at the application layer, there is the - possibility that false positives could be hit across MPTCP segment - boundaries, corrupting the data. Therefore, all data from the start - of the segment that failed the checksum onwards is not trustworthy. + So far this section has discussed the lost of MPTCP options, either + initially, or during the course of the connection. As described in + Section 3.3, each portion of data for which there is a mapping is + protected by a checksum, if checksums have been negotiated. This + mechanism is used to detect if middleboxes have made any adjustments + to the payload (added, removed, or changed data). A checksum will + fail if the data has been changed in any way. This will also detect + if the length of data on the subflow is increased or decreased, and + this means the data sequence mapping is no longer valid. The sender + no longer knows what subflow-level sequence number the receiver is + genuinely operating at (the middlebox will be faking ACKs in return), + and it cannot signal any further mappings. Furthermore, in addition + to the possibility of payload modifications that are valid at the + application layer, there is the possibility that such modifications + could be triggered across MPTCP segment boundaries, corrupting the + data. Therefore, all data from the start of the segment that failed + the checksum onwards is not trustworthy. Note that if checksum usage has not been negotiated, this fallback mechanism cannot be used unless there is some higher or lower layer signal to inform the MPTCP implementation that the payload has been tampered with. When multiple subflows are in use, the data in flight on a subflow will likely involve data that is not contiguously part of the connection-level stream, since segments will be spread across the multiple subflows. Due to the problems identified above, it is not - possible to determine what the adjustment has done to the data - (notably, any changes to the subflow sequence numbering). Therefore, - it is not possible to recover the subflow, and the affected subflow - must be immediately closed with a RST, featuring an MP_FAIL option + possible to determine what adjustment has done to the data (notably, + any changes to the subflow sequence numbering). Therefore, it is not + possible to recover the subflow, and the affected subflow must be + immediately closed with a RST, featuring an MP_FAIL option (Figure 16), which defines the data sequence number at the start of the segment (defined by the data sequence mapping) that had the checksum failure. Note that the MP_FAIL option requires the use of the full 64-bit sequence number, even if 32-bit sequence numbers are normally in use in the DSS signals on the path. 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +---------------+---------------+-------+----------------------+ | Kind | Length=12 |Subtype| (reserved) | +---------------+---------------+-------+----------------------+ | | | Data Sequence Number (8 octets) | | | +--------------------------------------------------------------+ Figure 16: Fallback (MP_FAIL) Option - The receiver MUST discard all data following the data sequence number - specified. Failed data MUST NOT be DATA_ACKed and so will be - retransmitted on other subflows (Section 3.3.6). + The receiver of this option MUST discard all data following the data + sequence number specified. Failed data MUST NOT be DATA_ACKed and so + will be retransmitted on other subflows (Section 3.3.6). A special case is when there is a single subflow and it fails with a checksum error. If it is known that all unacknowledged data in flight is contiguous (which will usually be the case with a single subflow), an infinite mapping can be applied to the subflow without the need to close it first, and essentially turn off all further MPTCP signaling. In this case, if a receiver identifies a checksum failure when there is only one path, it will send back an MP_FAIL option on the subflow-level ACK, referring to the data-level sequence number of the start of the segment on which the checksum error was detected. The sender will receive this, and if all unacknowledged data in flight is contiguous, will signal an infinite mapping. This infinite mapping will be a DSS option (Section 3.3) on the first new packet, containing a data sequence mapping that acts retroactively, referring to the start of the subflow sequence number of the most recent segment that was known to be delivered intact (i.e. was successfully DATA_ACKed). From that point onwards, data can be altered by a middlebox without affecting MPTCP, as the data stream is - equivalent to a regular, legacy TCP session. The MP_FAIL signal - affects only one direction of traffic. It is not mandatory for the - reciever of an MP_FAIL to also respond with an MP_FAIL, since the - paths may only be damaged in one direction. However, implementations - MAY choose to send a MP_FAIL in the reverse direction and entirely - revert to a regular TCP session. + equivalent to a regular, legacy TCP session. Whilst in theory paths + may only be damaged in one direction, and the MP_FAIL signal affects + only one direction of traffic, for implementation simplicity, the + receiver of an MP_FAIL MUST also respond with an MP_FAIL in the + reverse direction and entirely revert to a regular TCP session. In the rare case that the data is not contiguous (which could happen when there is only one subflow but it is retransmitting data from a subflow that has recently been uncleanly closed), the receiver MUST close the subflow with a RST with MP_FAIL. The receiver MUST discard all data that follows the data sequence number specified. The sender MAY attempt to create a new subflow belonging to the same connection, and, if it chooses to do so, SHOULD place the single subflow immediately in single-path mode by setting an infinite data sequence mapping. This mapping will begin from the data-level sequence number that was declared in the MP_FAIL. After a sender signals an infinite mapping, it MUST only use subflow ACKs to clear its send buffer. This is because Data ACKs may become misaligned with the subflow ACKs when middleboxes insert or delete data. The receive SHOULD stop generating Data ACKs after it receives an infinite mapping. - When a connection has fallen back, only one subflow can send data; - otherwise, the receiver would not know how to reorder the data. In - practice, this means that all MPTCP subflows will have to be - terminated except one. Once MPTCP falls back to regular TCP, it MUST - NOT revert to MPTCP later in the connection. + When a connection has fallen back with an infinite mapping, only one + subflow can send data; otherwise, the receiver would not know how to + reorder the data. In practice, this means that all MPTCP subflows + will have to be terminated except one. Once MPTCP falls back to + regular TCP, it MUST NOT revert to MPTCP later in the connection. - It should be emphasized that we are not attempting to prevent the use - of middleboxes that want to adjust the payload. An MPTCP-aware + It should be emphasized that MPTCP is not attempting to prevent the + use of middleboxes that want to adjust the payload. An MPTCP-aware middlebox could provide such functionality by also rewriting checksums. 3.8. Error Handling In addition to the fallback mechanism as described above, the standard classes of TCP errors may need to be handled in an MPTCP- specific way. Note that changing semantics -- such as the relevance of a RST -- are covered in Section 4. Where possible, we do not want to deviate from regular TCP behavior. @@ -2358,28 +2456,27 @@ ports as already in use. In other words, the destination port of a SYN containing an MP_JOIN option SHOULD be the same as the remote port of the first subflow in the connection. The local port for such SYNs SHOULD also be the same as for the first subflow (and as such, an implementation SHOULD reserve ephemeral ports across all local IP addresses), although there may be cases where this is infeasible. This strategy is intended to maximize the probability of the SYN being permitted by a firewall or NAT at the recipient and to avoid confusing any network monitoring software. - There may also be cases, however, where the passive opener wishes to - signal to the other host that a specific port should be used, and - this facility is provided in the Add Address option as documented in - Section 3.4.1. It is therefore feasible to allow multiple subflows - between the same two addresses but using different port pairs, and - such a facility could be used to allow load balancing within the - network based on 5-tuples (e.g., some ECMP implementations - [RFC2992]). + There may also be cases, however, where a host wishes to signal that + a specific port should be used, and this facility is provided in the + ADD_ADDR option as documented in Section 3.4.1. It is therefore + feasible to allow multiple subflows between the same two addresses + but using different port pairs, and such a facility could be used to + allow load balancing within the network based on 5-tuples (e.g., some + ECMP implementations [RFC2992]). 3.9.2. Delayed Subflow Start and Subflow Symmetry Many TCP connections are short-lived and consist only of a few segments, and so the overheads of using MPTCP outweigh any benefits. A heuristic is required, therefore, to decide when to start using additional subflows in an MPTCP connection. We expect that experience gathered from deployments will provide further guidance on this, and will be affected by particular application characteristics (which are likely to change over time). However, a suggested @@ -2392,44 +2489,43 @@ subflow for each initial window's worth of data that is buffered. Consideration should also be given to limiting the rate of adding new subflows, as well as limiting the total number of subflows open for a particular connection. A host may choose to vary these values based on its load or knowledge of traffic and path characteristics. Note that this heuristic alone is probably insufficient. Traffic for many common applications, such as downloads, is highly asymmetric and the host that is multihomed may well be the client that will never - fill its buffers, and thus never use MPTCP. Advanced APIs that allow - an application to signal its traffic requirements would aid in these - decisions. + fill its buffers, and thus never use MPTCP according to this + heuristic. Advanced APIs that allow an application to signal its + traffic requirements would aid in these decisions. An additional time-based heuristic could be applied, opening additional subflows after a given period of time has passed. This would alleviate the above issue, and also provide resilience for low- bandwidth but long-lived applications. - If the two communicating hosts immediately try to set up subflows - from all available addresses to all available addresses on the other - host, this could end up creating two subflows per path. This is an - inefficient use of resources. + Another issue is that both communicating hosts may simultaneously try + to set up a subflow between the same pair of addresses. This leads + to an inefficient use of resources. If the the same ports are used on all subflows, as recommended above, then standard TCP simultaneous open logic should take care of this situation and only one subflow will be established between the address pairs. However, this relies on the same ports being used at both end hosts. If a host does not support TCP simultaneous open, it is RECOMMENDED that some element of randomization is applied to the - time waited before opening new subflows, so that only one subflow - exists between a given address pair. If, however, hosts signal + time to wait before opening new subflows, so that only one subflow is + created between a given address pair. If, however, hosts signal additional ports to use (for example, for leveraging ECMP on-path), - this heuristic need not apply. + this heuristic is not appropriate. This section has shown some of the considerations that an implementer should give when developing MPTCP heuristics, but is not intended to be prescriptive. 3.9.3. Failure Handling Requirements for MPTCP's handling of unexpected signals have been given in Section 3.8. There are other failure cases, however, where a hosts can choose appropriate behavior. @@ -2463,23 +2559,24 @@ 4. Semantic Issues In order to support multipath operation, the semantics of some TCP components have changed. To aid clarity, this section collects these semantic changes as a reference. Sequence number: The (in-header) TCP sequence number is specific to the subflow. To allow the receiver to reorder application data, an additional data-level sequence space is used. In this data- level sequence space, the initial SYN and the final DATA_FIN - occupy 1 octet of sequence space. There is an explicit mapping of - data sequence space to subflow sequence space, which is signaled - through TCP options in data packets. + occupy 1 octet of sequence space. This is to ensure these signals + are acknowledged at the connection level. There is an explicit + mapping of data sequence space to subflow sequence space, which is + signaled through TCP options in data packets. ACK: The ACK field in the TCP header acknowledges only the subflow sequence number, not the data-level sequence space. Implementations SHOULD NOT attempt to infer a data-level acknowledgment from the subflow ACKs. This separates subflow- and connection-level processing at an end host. Duplicate ACK: A duplicate ACK that includes any MPTCP signaling (with the exception of the DSS option) MUST NOT be treated as a signal of congestion. To limit the chances of non-MPTCP-aware @@ -2554,49 +2651,67 @@ hash of this key as the connection identification "token". The keys are concatenated and used as keys for creating Hash-based Message Authentication Codes (HMACs) used on subflow setup, in order to verify that the parties in the handshake are the same as in the original connection setup. It also provides verification that the peer can receive traffic at this new address. Replay attacks would still be possible when only keys are used; therefore, the handshakes use single-use random numbers (nonces) at both ends -- this ensures the HMAC will never be the same on two handshakes. Guidance on generating random numbers suitable for use as keys is given in - [RFC4086] and discussed in Section 3.1. + [RFC4086] and discussed in Section 3.1. HMAC is also used to secure + the ADD_ADDR option, due to the threats identified in [RFC7430]. The use of crypto capability bits in the initial connection handshake to negotiate use of a particular algorithm allows the deployment of additional crypto mechanisms in the future. Note that this would be susceptible to bid-down attacks only if the attacker was on-path (and thus would be able to modify the data anyway). The security mechanism presented in this document should therefore protect against all forms of flooding and hijacking attacks discussed in [RFC6181]. + The version negotiation specified in Section 3.1, if differing MPTCP + versions shared a common negotiation format, would allow an on-path + attacker to apply a theoretical bid-down attack. However, since the + v1 and v0 protocols have a different handshake, this is not an attack + that can be applied here. Furthermore, an on-path attacker would + have access to the raw data, negating any other TCP-level security + mechanisms. Also a change from [RFC6824] has removed the subflow + identifier from the MP_PRIO option (Section 3.3.8), to remove the + theoretical attack where a subflow could be placed in "backup" mode + by an attacker. + During normal operation, regular TCP protection mechanisms (such as ensuring sequence numbers are in-window) will provide the same level of protection against attacks on individual TCP subflows as exists for regular TCP today. Implementations will introduce additional buffers compared to regular TCP, to reassemble data at the connection level. The application of window sizing will minimize the risk of denial-of-service attacks consuming resources. As discussed in Section 3.4.1, a host may advertise its private addresses, but these might point to different hosts in the receiver's network. The MP_JOIN handshake (Section 3.2) will ensure that this does not succeed in setting up a subflow to the incorrect host. However, it could still create unwanted TCP handshake traffic. This feature of MPTCP could be a target for denial-of-service exploits, with malicious participants in MPTCP connections encouraging the recipient to target other hosts in the network. Therefore, implementations should consider heuristics (Section 3.9) at both the sender and receiver to reduce the impact of this. + To further protect against malicious ADD_ADDR messages sent by an + off-path attacker, the ADD_ADDR includes an HMAC using the keys + negotiated during the handshake. This effectively prevents an + attacker from diverting an MPTCP connection through an off-path + ADD_ADDR injection into the stream. + A small security risk could theoretically exist with key reuse, but in order to accomplish a replay attack, both the sender and receiver keys, and the sender and receiver random numbers, in the MP_JOIN handshake (Section 3.2) would have to match. Whilst this specification defines a "medium" security solution, meeting the criteria specified at the start of this section and the threat analysis ([RFC6181]), since attacks only ever get worse, it is likely that a future Standards Track version of MPTCP would need to be able to support stronger security. There are several ways the @@ -2698,21 +2813,21 @@ Figure 17: Connection Setup with Middleboxes that Strip Options from Packets We now examine data flow with MPTCP, assuming the flow is correctly set up, which implies the options in the SYN packets were allowed through by the relevant middleboxes. If options are allowed through and there is no resegmentation or coalescing to TCP segments, Multipath TCP flows can proceed without problems. The case when options get stripped on data packets has been discussed - in the Fallback section. If a fraction of options are stripped, + in the Fallback section. If only some MPTCP options are stripped, behavior is not deterministic. If some data sequence mappings are lost, the connection can continue so long as mappings exist for the subflow-level data (e.g., if multiple maps have been sent that reinforce each other). If some subflow-level space is left unmapped, however, the subflow is treated as broken and is closed, through the process described in Section 3.7. MPTCP should survive with a loss of some Data ACKs, but performance will degrade as the fraction of stripped options increases. We do not expect such cases to appear in practice, though: most middleboxes will either strip all options or let them all through. @@ -2720,23 +2835,24 @@ We end this section with a list of middlebox classes, their behavior, and the elements in the MPTCP design that allow operation through such middleboxes. Issues surrounding dropping packets with options or stripping options were discussed above, and are not included here: o NATs [RFC3022] (Network Address (and Port) Translators) change the source address (and often source port) of packets. This means that a host will not know its public-facing address for signaling in MPTCP. Therefore, MPTCP permits implicit address addition via the MP_JOIN option, and the handshake mechanism ensures that - connection attempts to private addresses [RFC1918] do not cause - problems. Explicit address removal is undertaken by an Address ID - to allow no knowledge of the source address. + connection attempts to private addresses [RFC1918], since they are + authenticated, will only set up subflows to the correct hosts. + Explicit address removal is undertaken by an Address ID to allow + no knowledge of the source address. o Performance Enhancing Proxies (PEPs) [RFC3135] might proactively ACK data to increase performance. MPTCP, however, relies on accurate congestion control signals from the end host, and non- MPTCP-aware PEPs will not be able to provide such signals. MPTCP will, therefore, fall back to single-path TCP, or close the problematic subflow (see Section 3.7). o Traffic Normalizers [norm] may not allow holes in sequence numbers, and may cache packets and retransmit the same data. @@ -2751,23 +2867,23 @@ numbers in data sequence mapping to cope with this. Like NATs, firewalls will not permit many incoming connections, so MPTCP supports address signaling (ADD_ADDR) so that a multiaddressed host can invite its peer behind the firewall/NAT to connect out to its additional interface. o Intrusion Detection Systems look out for traffic patterns and content that could threaten a network. Multipath will mean that such data is potentially spread, so it is more difficult for an IDS to analyze the whole traffic, and potentially increases the - risk of false positives. However, for an MPTCP-aware IDS, tokens - can be read by such systems to correlate multiple subflows and - reassemble for analysis. + risk of false positives. However, a MPTCP-aware IDS can read + tokens to correlate multiple subflows and reassemble them for + analysis. o Application-level middleboxes such as content-aware firewalls may alter the payload within a subflow, such as rewriting URIs in HTTP traffic. MPTCP will detect these using the checksum and close the affected subflow(s), if there are other subflows that can be used. If all subflows are affected, multipath will fall back to TCP, allowing such middleboxes to change the payload. MPTCP-aware middleboxes should be able to adjust the payload and MPTCP metadata in order not to break the connection. @@ -2795,26 +2911,26 @@ The authors gratefully acknowledge significant input into this document from Sebastien Barre and Andrew McDonald. The authors also wish to acknowledge reviews and contributions from Iljitsch van Beijnum, Lars Eggert, Marcelo Bagnulo, Robert Hancock, Pasi Sarolahti, Toby Moncaster, Philip Eardley, Sergio Lembo, Lawrence Conroy, Yoshifumi Nishida, Bob Briscoe, Stein Gjessing, Andrew McGregor, Georg Hampel, Anumita Biswas, Wes Eddy, Alexey Melnikov, Francis Dupont, Adrian Farrel, Barry Leiba, Robert Sparks, - Sean Turner, Stephen Farrell, Martin Stiemerling, Gregory Detal, and - Fabien Duchene. + Sean Turner, Stephen Farrell, Martin Stiemerling, Gregory Detal, + Fabien Duchene, Xavier de Foy, and Rahul Jadhav. 8. IANA Considerations - This document updates [RFC6824] and as such IANA is requested to + This document obsoletes [RFC6824] and as such IANA is requested to update the TCP option space registry to point to this document for Multipath TCP, as follows: +------+--------+-----------------------+---------------+ | Kind | Length | Meaning | Reference | +------+--------+-----------------------+---------------+ | 30 | N | Multipath TCP (MPTCP) | This document | +------+--------+-----------------------+---------------+ Table 1: TCP Option Kind Numbers @@ -2868,52 +2984,51 @@ reserved for use by private experiments. Its use may be formalized in a future specification. 8.2. MPTCP Handshake Algorithms IANA has created another sub-registry, "MPTCP Handshake Algorithms" under the "Transmission Control Protocol (TCP) Parameters" registry, based on the flags in MP_CAPABLE (Section 3.1). IANA is requested to update the references of this table to this document, as follows: - +---------+----------------------------------+----------------------+ + +-------+----------------------------------------+------------------+ | Flag | Meaning | Reference | | Bit | | | - +---------+----------------------------------+----------------------+ + +-------+----------------------------------------+------------------+ | A | Checksum required | This document, | | | | Section 3.1 | | B | Extensibility | This document, | | | | Section 3.1 | - | C | Do not attempt to connect to | This document, | - | | source address | Section 3.1 | + | C | Do not attempt to establish new | This document, | + | | subflows to the source address. | Section 3.1 | | D-G | Unassigned | | | H | HMAC-SHA256 | This document, | | | | Section 3.2 | - +---------+----------------------------------+----------------------+ + +-------+----------------------------------------+------------------+ Table 3: MPTCP Handshake Algorithms Note that the meanings of bits D through H can be dependent upon bit B, depending on how Extensibility is defined in future specifications; see Section 3.1 for more information. Future assignments in this registry are also to be defined by Standards Action as defined by [RFC5226]. Assignments consist of the value of the flags, a symbolic name for the algorithm, and a reference to its specification. 8.3. MP_TCPRST Reason Codes IANA is requested to create a further sub-registry, "MP_TCPRST Reason Codes" under the "Transmission Control Protocol (TCP) Parameters" - registry, based on the reason code in MP_TCPRST (Section 3.6). The - contents of this sub-registry are to to this document, as follows: + registry, based on the reason code in MP_TCPRST (Section 3.6): +------+-----------------------------+----------------------------+ | Code | Meaning | Reference | +------+-----------------------------+----------------------------+ | 0x00 | Unspecified TCP error | This document, Section 3.6 | | 0x01 | MPTCP specific error | This document, Section 3.6 | | 0x02 | Lack of resources | This document, Section 3.6 | | 0x03 | Administratively prohibited | This document, Section 3.6 | | 0x04 | Too much outstanding data | This document, Section 3.6 | | 0x05 | Unacceptable performance | This document, Section 3.6 | @@ -2933,20 +3048,24 @@ [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997, . [RFC6182] Ford, A., Raiciu, C., Handley, M., Barre, S., and J. Iyengar, "Architectural Guidelines for Multipath TCP Development", RFC 6182, DOI 10.17487/RFC6182, March 2011, . + [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC + 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, + May 2017, . + [SHS] National Institute of Science and Technology, "Secure Hash Standard", Federal Information Processing Standard (FIPS) 180-4, August 2015, . 9.2. Informative References [howhard] Raiciu, C., Paasch, C., Barre, S., Ford, A., Honda, M., Duchene, F., Bonaventure, O., and M. Handley, "How Hard @@ -3059,20 +3178,26 @@ . [RFC6897] Scharf, M. and A. Ford, "Multipath TCP (MPTCP) Application Interface Considerations", RFC 6897, DOI 10.17487/RFC6897, March 2013, . [RFC7413] Cheng, Y., Chu, J., Radhakrishnan, S., and A. Jain, "TCP Fast Open", RFC 7413, DOI 10.17487/RFC7413, December 2014, . + [RFC7430] Bagnulo, M., Paasch, C., Gont, F., Bonaventure, O., and C. + Raiciu, "Analysis of Residual Threats and Possible Fixes + for Multipath TCP (MPTCP)", RFC 7430, + DOI 10.17487/RFC7430, July 2015, + . + [TCPLO] Ramaiah, A., "TCP option space extension", Work in Progress, March 2012. Appendix A. Notes on Use of TCP Options The TCP option space is limited due to the length of the Data Offset field in the TCP header (4 bits), which defines the TCP header length in 32-bit words. With the standard TCP header being 20 bytes, this leaves a maximum of 40 bytes for options, and many of these may already be used by options such as timestamp and SACK. @@ -3144,211 +3269,215 @@ Finally, there are issues with reliable delivery of options. As options can also be sent on pure ACKs, these are not reliably sent. This is not an issue for DATA_ACK due to their cumulative nature, but may be an issue for ADD_ADDR/REMOVE_ADDR options. Here, it is recommended to send these options redundantly (whether on multiple paths or on the same path on a number of ACKs -- but interspersed with data in order to avoid interpretation as congestion). The cases where options are stripped by middleboxes are discussed in Section 6. -Appendix B. TCP Fast Open +Appendix B. TCP Fast Open and MPTCP TCP Fast Open (TFO) is an experimental TCP extension, described in - [RFC7413], which has been introduced with the objective of gaining - one RTT before transmitting data. This is considered a valuable gain - as very short connections are very common, especially for HTTP - request/response schemes. It achieves this by sending the SYN- - segment together with data and allowing the server to reply - immediately with data after the SYN/ACK. [RFC7413] secures this - mechanism, by using a new TCP option that includes a cookie which is - negotiated in a preceding connection. + [RFC7413], which has been introduced to allow sending data one RTT + earlier than with regular TCP. This is considered a valuable gain as + very short connections are very common, especially for HTTP request/ + response schemes. It achieves this by sending the SYN-segment + together with the application's data and allowing the listener to + reply immediately with data after the SYN/ACK. [RFC7413] secures + this mechanism, by using a new TCP option that includes a cookie + which is negotiated in a preceding connection. When using TCP Fast Open in conjunction with MPTCP, there are two key points to take into account, detailed hereafter. B.1. TFO cookie request with MPTCP - When a TFO client first connects to a server, it cannot immediately - include data in the SYN for security reasons [RFC7413]. Instead, it - requests a cookie that will be used in subsequent connections. This - is done with the TCP cookie request/response options, of resp. 2 - bytes and 6-18 bytes (depending on the chosen cookie length). + When a TFO initiator first connects to a listener, it cannot + immediately include data in the SYN for security reasons [RFC7413]. + Instead, it requests a cookie that will be used in subsequent + connections. This is done with the TCP cookie request/response + options, of respectively 2 bytes and 6-18 bytes (depending on the + chosen cookie length). - TFO and MPTCP can be combined provided that the total length of their - options does not exceed the maximum 40 bytes possible in TCP: + TFO and MPTCP can be combined provided that the total length of all + the options does not exceed the maximum 40 bytes possible in TCP: o In the SYN: MPTCP uses a 4-bytes long MP_CAPABLE option. The MPTCP and TFO options sum up to 6 bytes. With typical TCP-options using up to 19 bytes in the SYN (24 bytes if options are padded at a word boundary), there is enough space to combine the MP_CAPABLE with the TFO Cookie Request. o In the SYN+ACK: MPTCP uses a 12-bytes long MP_CAPABLE option, but now TFO can be as long as 18 bytes. Since the maximum option - length may be exceeded, it is up to the server to solve this by + length may be exceeded, it is up to the listener to solve this by using a shorter cookie. As an example, if we consider that 19 bytes are used for classical TCP options, the maximum possible cookie length would be of 7 bytes. Note that the same limitation applies to subsequent connections, for the SYN packet (because the - client then echoes back the cookie to the server). Finally, if - the security impact of reducing the cookie size is not deemed - acceptable, the server can reduce the amount of other TCP-options - by omitting the TCP timestamps (as outlined in Appendix A). + initiator then echoes back the cookie to the listener). Finally, + if the security impact of reducing the cookie size is not deemed + acceptable, the listener can reduce the amount of other TCP- + options by omitting the TCP timestamps (as outlined in + Appendix A). B.2. Data sequence mapping under TFO MPTCP uses, in the TCP establishment phase, a key exchange that is used to generate the Initial Data Sequence Numbers (IDSNs). In particular, the SYN with MP_CAPABLE occupies the first octet of the data sequence space. With TFO, one way to handle the data sent together with the SYN would be to consider an implicit DSS mapping that covers that SYN segment (since there is not enough space in the SYN to include a DSS option). The problem with that approach is that if a middlebox modifies the TFO data, this will not be noticed by MPTCP because of the absence of a DSS-checksum. For example, a TCP (but not MPTCP)-aware middlebox could insert bytes at the beginning of the stream and adapt the TCP checksum and sequence numbers - accordingly. With an implicit mapping, this would give to client and - server a different view on the DSS-mapping, with no way to detect - this inconsistency as the DSS checksum is not present. + accordingly. With an implicit mapping, this would give to initiator + and listener a different view on the DSS-mapping, with no way to + detect this inconsistency as the DSS checksum is not present. - To solve this, the TFO data should not be considered part of the Data + To solve this, the TFO data must not be considered part of the Data Sequence Number space: the SYN with MP_CAPABLE still occupies the first octet of data sequence space, but then the first non-TFO data byte occupies the second octet. This guarantees that, if the use of DSS-checksum is negotiated, all data in the data sequence number space is checksummed. We also note that this does not entail a loss - of functionality, because TFO-data is always sent when only one path - is active. + of functionality, because TFO-data is always only sent on the initial + subflow before any attempt to create additional subflows. B.3. Connection establishment examples The following shows a few examples of possible TFO+MPTCP establishment scenarios. - Before a client can send data together with the SYN, it must request - a cookie to the server, as shown in Figure Figure 18. This is done - by simply combining the TFO and MPTCP options. + Before an initiator can send data together with the SYN, it must + request a cookie to the listener, as shown in Figure Figure 18. This + is done by simply combining the TFO and MPTCP options. - client server +initiator listener | | - | S 0(0) , | + | S Seq=0(Length=0) , | | -----------------------------------------------------------> | | | | S. 0(0) ack 1 , | | <----------------------------------------------------------- | | | | . 0(0) ack 1 | | -----------------------------------------------------------> | | | - Figure 18: Cookie request + Figure 18: Cookie request - sequence number and length are annotated + as Seq(Length) and used hereafter in the figures. Once this is done, the received cookie can be used for TFO, as shown - in Figure Figure 19. In this example, the client first sends 20 - bytes in the SYN. The server immediately replies with 100 bytes - following the SYN-ACK upon which the client replies with 20 more + in Figure Figure 19. In this example, the initiator first sends 20 + bytes in the SYN. The listener immediately replies with 100 bytes + following the SYN-ACK upon which the initiator replies with 20 more bytes. Note that the last segment in the figure has a TCP sequence number of 21, while the DSS subflow sequence number is 1 (because the TFO data is not part of the data sequence number space, as explained in Section Appendix B.2. - client server +initiator listener | | | S 0(20) , | | -----------------------------------------------------------> | | | | S. 0(0) ack 21 | | <----------------------------------------------------------- | | | | . 1(100) ack 21 | | <----------------------------------------------------------- | | | | . 21(0) ack 1 | | -----------------------------------------------------------> | | | | . 21(20) ack 101 | | -----------------------------------------------------------> | | | - Figure 19: The server supports TFO + Figure 19: The listener supports TFO - In Figure Figure 20, the server does not support TFO. The client - detects that no state is created in the server (as no data is acked), - and now sends the MP_CAPABLE in the third ack, in order for the - server to build its MPTCP context at then end of the establishment. - Now, the tfo data, retransmitted, becomes part of the data sequence - mapping because it is effectively sent (in fact re-sent) after the - establishment. + In Figure Figure 20, the listener does not support TFO. The + initiator detects that no state is created in the listener (as no + data is acked), and now sends the MP_CAPABLE in the third ack, in + order for the listener to build its MPTCP context at then end of the + establishment. Now, the tfo data, retransmitted, becomes part of the + data sequence mapping because it is effectively sent (in fact re- + sent) after the establishment. - client server +initiator listener | | | S 0(20) , | | -----------------------------------------------------------> | | | | S. 0(0) ack 1 | | <----------------------------------------------------------- | | | | . 1(0) ack 1 | | -----------------------------------------------------------> | | | | . 1(20) ack 1 | | -----------------------------------------------------------> | | | | . 0(0) ack 21 | | <----------------------------------------------------------- | | | - Figure 20: The server does not support TFO + Figure 20: The listener does not support TFO - It is also possible that the server acknowledges only part of the TFO - data, as illustrated in Figure Figure 21. The client will simply - retransmit the missing data together with a DSS-mapping. + It is also possible that the listener acknowledges only part of the + TFO data, as illustrated in Figure Figure 21. The initiator will + simply retransmit the missing data together with a DSS-mapping. - client server +initiator listener | | | S 0(1000) , | | -----------------------------------------------------------> | | | | S. 0(0) ack 501 | | <----------------------------------------------------------- | | | | . 501(0) ack 1 | | -----------------------------------------------------------> | | | | . 501(500) ack 1 | | -----------------------------------------------------------> | | | Figure 21: Partial data acknowledgement Appendix C. Control Blocks Conceptually, an MPTCP connection can be represented as an MPTCP - control block that contains several variables that track the progress - and the state of the MPTCP connection and a set of linked TCP control - blocks that correspond to the subflows that have been established. + protocol control block (PCB) that contains several variables that + track the progress and the state of the MPTCP connection and a set of + linked TCP control blocks that correspond to the subflows that have + been established. RFC 793 [RFC0793] specifies several state variables. Whenever possible, we reuse the same terminology as RFC 793 to describe the state variables that are maintained by MPTCP. C.1. MPTCP Control Block The MPTCP control block contains the following variable per connection. C.1.1. Authentication and Metadata Local.Token (32 bits): This is the token chosen by the local host on - this MPTCP connection. The token MUST be unique among all + this MPTCP connection. The token must be unique among all established MPTCP connections, generated from the local key. Local.Key (64 bits): This is the key sent by the local host on this MPTCP connection. Remote.Token (32 bits): This is the token chosen by the remote host on this MPTCP connection, generated from the remote key. Remote.Key (64 bits): This is the key chosen by the remote host on this MPTCP connection @@ -3383,21 +3512,21 @@ used to specify the DATA_ACK that is sent in the DSS option on all subflows. RCV.WND (32 bits with RFC 1323, 16 bits otherwise): This is the connection-level receive window, which is the maximum of the RCV.WND on all the subflows. C.2. TCP Control Blocks The MPTCP control block also contains a list of the TCP control - blocks that are associated to the MPTCP connection. + blocks that are associated with the MPTCP connection. Note that the TCP control block on the TCP subflows does not contain the RCV.WND and SND.WND state variables as these are maintained at the MPTCP connection level and not at the subflow level. Inside each TCP control block, the following state variables are defined. C.2.1. Sending Side