--- 1/draft-ietf-ipsecme-failure-detection-01.txt 2010-10-25 19:15:46.000000000 +0200 +++ 2/draft-ietf-ipsecme-failure-detection-02.txt 2010-10-25 19:15:46.000000000 +0200 @@ -1,22 +1,22 @@ IPsecME Working Group Y. Nir, Ed. Internet-Draft Check Point Intended status: Standards Track D. Wierbowski -Expires: April 13, 2011 IBM +Expires: April 28, 2011 IBM F. Detienne P. Sethi Cisco - October 10, 2010 + October 25, 2010 A Quick Crash Detection Method for IKE - draft-ietf-ipsecme-failure-detection-01 + draft-ietf-ipsecme-failure-detection-02 Abstract This document describes an extension to the IKEv2 protocol that allows for faster detection of SA desynchronization using a saved token. When an IPsec tunnel between two IKEv2 peers is disconnected due to a restart of one peer, it can take as much as several minutes for the other peer to discover that the reboot has occurred, thus delaying @@ -31,93 +31,94 @@ Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." - This Internet-Draft will expire on April 13, 2011. + This Internet-Draft will expire on April 28, 2011. Copyright Notice Copyright (c) 2010 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.1. Conventions Used in This Document . . . . . . . . . . . . 4 - 2. RFC 4306 Crash Recovery . . . . . . . . . . . . . . . . . . . 5 - 3. Protocol Outline . . . . . . . . . . . . . . . . . . . . . . . 5 - 4. Formats and Exchanges . . . . . . . . . . . . . . . . . . . . 6 - 4.1. Notification Format . . . . . . . . . . . . . . . . . . . 6 - 4.2. Passing a Token in the AUTH Exchange . . . . . . . . . . . 7 - 4.3. Replacing Tokens After Rekey or Resumption . . . . . . . . 8 + 2. RFC 5996 Crash Recovery . . . . . . . . . . . . . . . . . . . 5 + 3. Protocol Outline . . . . . . . . . . . . . . . . . . . . . . . 6 + 4. Formats and Exchanges . . . . . . . . . . . . . . . . . . . . 7 + 4.1. Notification Format . . . . . . . . . . . . . . . . . . . 7 + 4.2. Passing a Token in the AUTH Exchange . . . . . . . . . . . 8 + 4.3. Replacing Tokens After Rekey or Resumption . . . . . . . . 9 4.4. Replacing the Token for an Existing SA . . . . . . . . . . 9 - 4.5. Presenting the Token in an Unprotected Message . . . . . . 9 - 5. Token Generation and Verification . . . . . . . . . . . . . . 10 - 5.1. A Stateless Method of Token Generation . . . . . . . . . . 10 - 5.2. A Stateless Method with IP addresses . . . . . . . . . . . 11 - 5.3. Token Lifetime . . . . . . . . . . . . . . . . . . . . . . 11 - 6. Backup Gateways . . . . . . . . . . . . . . . . . . . . . . . 11 - 7. Alternative Solutions . . . . . . . . . . . . . . . . . . . . 12 - 7.1. Initiating a new IKE SA . . . . . . . . . . . . . . . . . 12 - 7.2. SIR . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 - 7.3. Birth Certificates . . . . . . . . . . . . . . . . . . . . 12 - 7.4. Reducing Liveness Check Length . . . . . . . . . . . . . . 13 - 8. Interaction with Session Resumption . . . . . . . . . . . . . 13 - 9. Operational Considerations . . . . . . . . . . . . . . . . . . 15 - 9.1. Who should implement this specification . . . . . . . . . 15 + 4.5. Presenting the Token in an Unprotected Message . . . . . . 10 + 5. Token Generation and Verification . . . . . . . . . . . . . . 11 + 5.1. A Stateless Method of Token Generation . . . . . . . . . . 11 + 5.2. A Stateless Method with IP addresses . . . . . . . . . . . 12 + 5.3. Token Lifetime . . . . . . . . . . . . . . . . . . . . . . 12 + 6. Backup Gateways . . . . . . . . . . . . . . . . . . . . . . . 12 + 7. Alternative Solutions . . . . . . . . . . . . . . . . . . . . 13 + 7.1. Initiating a new IKE SA . . . . . . . . . . . . . . . . . 13 + 7.2. SIR . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 + 7.3. Birth Certificates . . . . . . . . . . . . . . . . . . . . 13 + 7.4. Reducing Liveness Check Length . . . . . . . . . . . . . . 14 + 8. Interaction with Session Resumption . . . . . . . . . . . . . 14 + 9. Operational Considerations . . . . . . . . . . . . . . . . . . 16 + 9.1. Who should implement this specification . . . . . . . . . 16 9.2. Response to unknown child SPI . . . . . . . . . . . . . . 16 - 10. Security Considerations . . . . . . . . . . . . . . . . . . . 16 + 10. Security Considerations . . . . . . . . . . . . . . . . . . . 17 10.1. QCD Token Generation and Handling . . . . . . . . . . . . 17 - 10.2. QCD Token Transmission . . . . . . . . . . . . . . . . . . 17 + 10.2. QCD Token Transmission . . . . . . . . . . . . . . . . . . 18 10.3. QCD Token Enumeration . . . . . . . . . . . . . . . . . . 18 - 10.4. Selecting an Appropriate Token Generation Method . . . . . 18 - 11. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 19 - 12. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 19 - 13. Change Log . . . . . . . . . . . . . . . . . . . . . . . . . . 19 - 13.1. Changes from draft-ietf-ipsecme-failure-detection-00 . . . 19 - 13.2. Changes from draft-nir-ike-qcd-07 . . . . . . . . . . . . 20 - 13.3. Changes from draft-nir-ike-qcd-03 and -04 . . . . . . . . 20 - 13.4. Changes from draft-nir-ike-qcd-02 . . . . . . . . . . . . 20 - 13.5. Changes from draft-nir-ike-qcd-01 . . . . . . . . . . . . 20 - 13.6. Changes from draft-nir-ike-qcd-00 . . . . . . . . . . . . 20 - 13.7. Changes from draft-nir-qcr-00 . . . . . . . . . . . . . . 20 - 14. References . . . . . . . . . . . . . . . . . . . . . . . . . . 21 - 14.1. Normative References . . . . . . . . . . . . . . . . . . . 21 - 14.2. Informative References . . . . . . . . . . . . . . . . . . 21 - Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 21 + 10.4. Selecting an Appropriate Token Generation Method . . . . . 19 + 11. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 20 + 12. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 20 + 13. Change Log . . . . . . . . . . . . . . . . . . . . . . . . . . 20 + 13.1. Changes from draft-ietf-ipsecme-failure-detection-01 . . . 20 + 13.2. Changes from draft-ietf-ipsecme-failure-detection-00 . . . 20 + 13.3. Changes from draft-nir-ike-qcd-07 . . . . . . . . . . . . 21 + 13.4. Changes from draft-nir-ike-qcd-03 and -04 . . . . . . . . 21 + 13.5. Changes from draft-nir-ike-qcd-02 . . . . . . . . . . . . 21 + 13.6. Changes from draft-nir-ike-qcd-01 . . . . . . . . . . . . 21 + 13.7. Changes from draft-nir-ike-qcd-00 . . . . . . . . . . . . 21 + 13.8. Changes from draft-nir-qcr-00 . . . . . . . . . . . . . . 21 + 14. References . . . . . . . . . . . . . . . . . . . . . . . . . . 22 + 14.1. Normative References . . . . . . . . . . . . . . . . . . . 22 + 14.2. Informative References . . . . . . . . . . . . . . . . . . 22 + Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 22 1. Introduction IKEv2, as described in [RFC5996] and its predecessor RFC 4306, has a method for recovering from a reboot of one peer. As long as traffic flows in both directions, the rebooted peer should re-establish the tunnels immediately. However, in many cases the rebooted peer is a VPN gateway that protects only servers, or else the non-rebooted peer has a dynamic IP address. In such cases, the rebooted peer will not be able to re-establish the tunnels. Section 2 describes how - recovery works under RFC 4306, and explains why it may take several + recovery works under RFC 5996, and explains why it may take several minutes. The method proposed here, is to send an octet string, called a "QCD token" in the IKE_AUTH exchange that establishes the tunnel. That token can be stored on the peer as part of the IKE SA. After a reboot, the rebooted implementation can re-generate the token, and send it to the peer, so as to delete the IKE SA. Deleting the IKE SA results is a quick establishment of new IPsec tunnels. This is described in Section 3. @@ -140,21 +141,21 @@ receives is identical to the old token it has stored. The term "non-volatile storage" in this document refers to a data storage module, that persists across restarts of the token maker. Examples of such a storage module include an internal disk, an internal flash memory module, an external disk and an external database. A small non-volatile storage module is required for a token maker, but a larger one can be used to enhance performance, as described in Section 9.2. -2. RFC 4306 Crash Recovery +2. RFC 5996 Crash Recovery When one peer loses state or reboots, the other peer does not get any notification, so unidirectional IPsec traffic can still flow. The rebooted peer will not be able to decrypt it, however, and the only remedy is to send an unprotected INVALID_SPI notification as described in section 3.10.1 of [RFC5996]. That section also describes the processing of such a notification: "If this Informational Message is sent outside the context of an IKE_SA, it should be used by the recipient @@ -170,28 +171,70 @@ Detection" or DPD. Section 2.4 does not mandate how many times the liveness check message should be retransmitted, or for how long, but does recommend the following: "It is suggested that messages be retransmitted at least a dozen times over a period of at least several minutes before giving up on an SA..." - Those "at least several minutes" are a time during which both peers - are active, but IPsec cannot be used. + Those "at least several minutes" are a time during part of which both + peers are active, but IPsec cannot be used. + + Especially in the case of a reboot (rather than fail-over or + administrative clearing of state), the peer does not recover + immediately. Reboot, depending on the system may take from a few + seconds to a few minutes. This means that at first the peer just + goes silent, i.e. does not send or respond to any messages. IKEv2 + implementation can detect this situation and follow the rules given + in the section 2.4: + + If there has only been outgoing traffic on all of + the SAs associated with an IKE SA, it is essential to confirm + liveness of the other endpoint to avoid black holes. If no + cryptographically protected messages have been received on an IKE + SA or any of its Child SAs recently, the system needs to perform a + liveness check in order to prevent sending messages to a dead peer. + + [RFC5996] does not mandate any time limits, but it is possible that + the peer will start liveness checks even before the other end is + sending INVALID_SPI notification, as it detected that the other end + is not sending any packets anymore while it is still rebooting or + recovering from the situation. + + This means that the several minutes recovery period is overlaping the + actual recover time of the other peer, i.e. if the security gateway + requires several minutes to boot up from the crash then the other + peers have already finished their liveness checks before the crashing + peer even has change to send INVALID_SPI notifications. + + There are cases where the peer looses state and is able to recover + immediately, in those cases it might take several minutes to recover. + + Note, that IKEv2 specification specifically leaves number of retries + and lengths of timeouts out from the specification, as they do not + affect interoperability. This means that implementations are allowed + to use the hints provided by the INVALID_SPI messages as hints that + will shorten those timeouts (i.e. different environment and situation + requiring different rules). + + Good existing IKEv2 implementations already do that (i.e. both + shorten timeouts or limit number of retries) based on that kind of + hints and also start liveness checks quickly after the other end goes + silent. 3. Protocol Outline Supporting implementations will send a notification, called a "QCD token", as described in Section 4.1 in the first IKE_AUTH exchange - messages. These are the final IKE_AUTH request and final IKE_AUTH + messages. These are the first IKE_AUTH request and final IKE_AUTH response that contain the AUTH payloads. The generation of these tokens is a local matter for implementations, but considerations are described in Section 5. Implementations that send such a token will be called "token makers". A supporting implementation receiving such a token MUST store it (or a digest thereof) along with the IKE SA. Implementations that support this part of the protocol will be called "token takers". Section 9.1 has considerations for which implementations need to be token takers, and which should be token makers. Implementation that @@ -535,21 +576,21 @@ the public keys. That requires more storage than does a QCD token. Additionally, the public-key operations needed to verify the self- signed certificates are more expensive for Alice. We believe that a symmetric-key operation such as proposed here is more light-weight and simple than that implied by the Birth Certificate idea. 7.4. Reducing Liveness Check Length - Some have suggested that the RFC 4306 procedure described in + Some have suggested that the RFC 5996 procedure described in Section 2 can be tweaked by requiring fewer retransmissions over a shorter period of time for cases of liveness check started because of an INVALID_SPI or INVALID_IKE_SPI notification. We believe that the default retransmission policy should represent a good balance between the need for a timely discovery of a dead peer, and a low probability of false detection. We expect the policy to be set to take the shortest time such that this probability achieves a certain target. Therefore, reducing elapsed time and retransmission count will create an unacceptably high probability of false @@ -575,21 +616,21 @@ this problem by having the clients store an encrypted derivative of the IKE SA for quick re-establishment. What Session Resumption does not help is the problem of detecting that the peer gateway has failed. A failed gateway may go undetected for as long as the lifetime of a child SA, because IPsec does not have packet acknowledgement, and applications cannot signal the IPsec layer that the tunnel "does not work". Before establishing a new IKE SA using Session Resumption, a client should ascertain that the gateway has indeed failed. This could be done using either a - liveness check (as in RFC 4306) or using the QCD tokens described in + liveness check (as in RFC 5996) or using the QCD tokens described in this document. A remote access client conforming to both specifications will store QCD tokens, as well as the Session Resumption ticket, if provided by the gateway. A remote access gateway conforming to both specifications will generate a QCD token for the client. When the gateway reboots, the client will discover this in either of two ways: 1. The client does regular liveness checks, or else the time for some other IKE exchange has come. Since the gateway is still down, the IKE exchange times out after several minutes. In this @@ -686,20 +726,35 @@ if it arrived with the IKE SPIs of the parent IKE SA. However, a persistent storage module might not be updated in a timely manner, and could be populated with tokens relating to IKE SPIs that have already been rekeyed. A token taker MUST NOT take an invalid QCD Token sent along with an INVALID_SPI notification as evidence that the peer is either malfunctioning or attacking, but it SHOULD limit the rate at which such notifications are processed. 10. Security Considerations + + The extension described in this document must not reduce the security + of IKEv2 or IPsec. Specifically, an eavesdropper must not learn any + non-public information about the peers. + + The proposed mechanism should be secure against attacks by a passive + MITM (eavesdropper). Such an attacker must not be able to disrupt an + existing IKE session, either by resetting the session or by + introducing significant delays. This requirement is especially + significant, because this document introduces a new way to reset an + IKE SA. + + The mechanism need not be similarly secure against an active MITM, + since this type of attacker is already able to disrupt IKE sessions. + 10.1. QCD Token Generation and Handling Tokens MUST be hard to guess. This is critical, because if an attacker can guess the token associated with an IKE SA, she can tear down the IKE SA and associated tunnels at will. When the token is delivered in the IKE_AUTH exchange, it is encrypted. When it is sent again in an unprotected notification, it is not, but that is the last time this token is ever used. An aggregation of some tokens generated by one maker together with @@ -748,26 +803,28 @@ 10.3. QCD Token Enumeration An attacker may try to attack QCD if the generation algorithm described in Section 5.1 is used. The attacker will send several fake IKE requests to the gateway under attack, receiving and recording the QCD Tokens in the responses. This will allow the attacker to create a dictionary of IKE SPIs to QCD Tokens, which can later be used to tear down any IKE SA. Three factors mitigate this threat: + o The space of all possible IKE SPI pairs is huge: 2^128, so making such a dictionary is impractical. Even if we assume that one implementation always generates predictable IKE SPIs, the space is still at least 2^64 entries, so making the dictionary is extremely - hard. To ensure this, token makers MUST use a good pseudo-random - number generator to generate the IKE SPIs. + hard. To ensure this, token makers MUST generate unpredictable + IKE SPIs by using a cryptographically strong pseudo-random number + generator. o Throttling the amount of QCD_TOKEN notifications sent out, as discussed in Section 9.1, especially when not soon after a crash will limit the attacker's ability to construct a dictionary. o The methods in Section 5.1 and Section 5.2 allow for a periodic change of the QCD_SECRET. Any such change invalidates the entire dictionary. 10.4. Selecting an Appropriate Token Generation Method This section describes the rationale for token generation methods @@ -790,111 +847,116 @@ state is stored on this member, it will send a QCD token to the attacker. If the QCD token does not depend on IP address, this token can immediately be used to tell the token taker to tear down the IKE SA using an unprotected QCD_TOKEN notification. To thwart this possible attack, such configurations should use a method that considers the taker's IP address, such as the method described in Section 5.2. On the other hand, when using this method a change of address - invalidates the tokens, so this method has both advantages and - disadvantages. + invalidates the tokens, so this method is only recommended when the + configuration involves gateways generating the same tokens without + access to all the IKE SAs. 11. IANA Considerations IANA is requested to assign a notify message type from the status types range (16406-40959) of the "IKEv2 Notify Message Types" registry with name "QUICK_CRASH_DETECTION". 12. Acknowledgements We would like to thank Hannes Tschofenig and Yaron Sheffer for their comments about Session Resumption. - Frederic D'etienne and Pratima Sethi contributed the ideas in - Section 10.4 and Section 5.2. - Others who have contrinuted valuable comments are, in alphabetical - order, Lakshminath Dondeti and Scott C Moonen. + order, Lakshminath Dondeti, Tero Kivinen, and Scott C Moonen. 13. Change Log This section lists all changes in this document NOTE TO RFC EDITOR : Please remove this section in the final RFC -13.1. Changes from draft-ietf-ipsecme-failure-detection-00 +13.1. Changes from draft-ietf-ipsecme-failure-detection-01 + + o Fixed the language requiring random IKE SPIs. + o Some better explanation of the reasons to choose the methods in + Section 5.2 and the method in Section 5.1, to close issue #193. + o Added text to the beginning of Section 10 to accomodate issue + #194. + +13.2. Changes from draft-ietf-ipsecme-failure-detection-00 o Nits pointed out by Scott and Yaron. o Pratima and Frederic are back on board. o Changed IKEv2bis draft reference to RFC 5996. o Resolved issues #189, #190, #191, and #192: * Renamed section 4.5 and removed the requirement to send an acknowledgement for the unprotected message. * Moved the QCD token from the last to the first IKE_AUTH request. - * Added a MUST to Section 10.3 to require that IKE SPIs be randomly generated. * Changed the language in Section 9.1, to not use RFC 2119 terminology. * Moved the section describing why one would want the method dependant on IP addresses (in Section 5.2 from operational considerations to security considerations. -13.2. Changes from draft-nir-ike-qcd-07 +13.3. Changes from draft-nir-ike-qcd-07 o First WG version. o Addressed Scott C Moonen's concern about collisions of QCD tokens. o Updated references to point to IKEv2bis instead of RFC 4306 and 4718. Also converted draft reference for resumption to RFC 5723. o Added Dave Wiebrowski as author, and removed Pratima and Frederic. -13.3. Changes from draft-nir-ike-qcd-03 and -04 +13.4. Changes from draft-nir-ike-qcd-03 and -04 Mostly editorial changes and cleaning up. -13.4. Changes from draft-nir-ike-qcd-02 +13.5. Changes from draft-nir-ike-qcd-02 o Described QCD token enumeration, following a question by Lakshminath Dondeti. o Added the ability to replace the QCD token for an existing IKE SA. o Added tokens dependent on peer IP address and their interaction with MOBIKE. -13.5. Changes from draft-nir-ike-qcd-01 +13.6. Changes from draft-nir-ike-qcd-01 o Removed stateless method. o Added discussion of rekeying and resumption. o Added discussion of non-synchronized load-balanced clusters of gateways in the security considerations. o Other wording fixes. -13.6. Changes from draft-nir-ike-qcd-00 +13.7. Changes from draft-nir-ike-qcd-00 o Merged proposal with draft-detienne-ikev2-recovery o Changed the protocol so that the rebooted peer generates the token. This has the effect, that the need for persistent storage is eliminated. o Added discussion of birth certificates. -13.7. Changes from draft-nir-qcr-00 +13.8. Changes from draft-nir-qcr-00 + o Changed name to reflect that this relates to IKE. Also changed from quick crash recovery to quick crash detection to avoid confusion with IFARE. o Added more operational considerations. o Added interaction with IFARE. o Added discussion of backup gateways. 14. References - 14.1. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [RFC4555] Eronen, P., "IKEv2 Mobility and Multihoming Protocol (MOBIKE)", RFC 4555, June 2006. [RFC5996] Kaufman, C., Hoffman, P., Nir, Y., and P. Eronen, "Internet Key Exchange Protocol: IKEv2", RFC 5996,