draft-ietf-spring-segment-routing-msdc-09.txt   draft-ietf-spring-segment-routing-msdc-10.txt 
Network Working Group C. Filsfils, Ed. Network Working Group C. Filsfils, Ed.
Internet-Draft S. Previdi Internet-Draft S. Previdi
Intended status: Informational Cisco Systems, Inc. Intended status: Informational Cisco Systems, Inc.
Expires: November 30, 2018 G. Dawra Expires: April 18, 2019 G. Dawra
LinkedIn LinkedIn
E. Aries E. Aries
Juniper Networks Juniper Networks
P. Lapukhov P. Lapukhov
Facebook Facebook
May 29, 2018 October 15, 2018
BGP-Prefix Segment in large-scale data centers BGP-Prefix Segment in large-scale data centers
draft-ietf-spring-segment-routing-msdc-09 draft-ietf-spring-segment-routing-msdc-10
Abstract Abstract
This document describes the motivation and benefits for applying This document describes the motivation and benefits for applying
segment routing in BGP-based large-scale data-centers. It describes segment routing in BGP-based large-scale data-centers. It describes
the design to deploy segment routing in those data-centers, for both the design to deploy segment routing in those data-centers, for both
the MPLS and IPv6 dataplanes. the MPLS and IPv6 dataplanes.
Status of This Memo Status of This Memo
skipping to change at page 1, line 39 skipping to change at page 1, line 39
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/. Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on November 30, 2018. This Internet-Draft will expire on April 18, 2019.
Copyright Notice Copyright Notice
Copyright (c) 2018 IETF Trust and the persons identified as the Copyright (c) 2018 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(https://trustee.ietf.org/license-info) in effect on the date of (https://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
skipping to change at page 2, line 38 skipping to change at page 2, line 38
7.1. Per-packet and flowlet switching . . . . . . . . . . . . 14 7.1. Per-packet and flowlet switching . . . . . . . . . . . . 14
7.2. Performance-aware routing . . . . . . . . . . . . . . . . 15 7.2. Performance-aware routing . . . . . . . . . . . . . . . . 15
7.3. Deterministic network probing . . . . . . . . . . . . . . 16 7.3. Deterministic network probing . . . . . . . . . . . . . . 16
8. Additional Benefits . . . . . . . . . . . . . . . . . . . . . 17 8. Additional Benefits . . . . . . . . . . . . . . . . . . . . . 17
8.1. MPLS Dataplane with operational simplicity . . . . . . . 17 8.1. MPLS Dataplane with operational simplicity . . . . . . . 17
8.2. Minimizing the FIB table . . . . . . . . . . . . . . . . 17 8.2. Minimizing the FIB table . . . . . . . . . . . . . . . . 17
8.3. Egress Peer Engineering . . . . . . . . . . . . . . . . . 17 8.3. Egress Peer Engineering . . . . . . . . . . . . . . . . . 17
8.4. Anycast . . . . . . . . . . . . . . . . . . . . . . . . . 18 8.4. Anycast . . . . . . . . . . . . . . . . . . . . . . . . . 18
9. Preferred SRGB Allocation . . . . . . . . . . . . . . . . . . 18 9. Preferred SRGB Allocation . . . . . . . . . . . . . . . . . . 18
10. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 19 10. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 19
11. Manageability Considerations . . . . . . . . . . . . . . . . 19 11. Manageability Considerations . . . . . . . . . . . . . . . . 20
12. Security Considerations . . . . . . . . . . . . . . . . . . . 20 12. Security Considerations . . . . . . . . . . . . . . . . . . . 20
13. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 20 13. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 21
14. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 20 14. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 21
15. References . . . . . . . . . . . . . . . . . . . . . . . . . 22 15. References . . . . . . . . . . . . . . . . . . . . . . . . . 22
15.1. Normative References . . . . . . . . . . . . . . . . . . 22 15.1. Normative References . . . . . . . . . . . . . . . . . . 22
15.2. Informative References . . . . . . . . . . . . . . . . . 23 15.2. Informative References . . . . . . . . . . . . . . . . . 23
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 23 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 23
1. Introduction 1. Introduction
Segment Routing (SR), as described in Segment Routing (SR), as described in
[I-D.ietf-spring-segment-routing] leverages the source routing [I-D.ietf-spring-segment-routing] leverages the source routing
paradigm. A node steers a packet through an ordered list of paradigm. A node steers a packet through an ordered list of
skipping to change at page 14, line 35 skipping to change at page 14, line 35
worth noting that segment routing signaling and data-plane are only worth noting that segment routing signaling and data-plane are only
parts of the solution. Additional enhancements, e.g., such as the parts of the solution. Additional enhancements, e.g., such as the
centralized controller mentioned previously, and host networking centralized controller mentioned previously, and host networking
stack support are required to implement the proposed solutions. Also stack support are required to implement the proposed solutions. Also
the applicability of the solutions described below are not restricted the applicability of the solutions described below are not restricted
to the data-center alone, the same could be re-used in context of to the data-center alone, the same could be re-used in context of
other domains as well other domains as well
7.1. Per-packet and flowlet switching 7.1. Per-packet and flowlet switching
A flowlet is defined as a burst of packets from the same flow A flowlet is defined here as a burst of packets from the 5-tuple flow
followed by an idle interval. followed by a significant idle interval for application to detect
path issues and re-steer the packet.
With some ability to choose paths on the host, one may go from per- With some ability to choose paths on the host, one may go from per-
flow load-sharing in the network to per-packet or per-flowlet. The flow load-sharing in the network to possible per-packet or per-
host may select different segment routing instructions either per flowlet. The host may select different segment routing instructions
packet, or per flowlet, and route them over different paths. This either per packet, or per flowlet, and route them over different
allows for solving the "elephant flow" problem in the data-center and paths. This allows for solving the problem of avoiding link
avoiding link imbalances. imbalances in the data-center.
Note that traditional ECMP routing could be easily simulated with on- Note that traditional ECMP routing could be easily simulated with on-
host path selection, using method proposed in [GREENBERG09]. The host path selection, using method proposed in [GREENBERG09]. The
hosts would randomly pick a Tier-2 or Tier-1 device to "bounce" the hosts would randomly pick a Tier-2 or Tier-1 device to "bounce" the
packet off of, depending on whether the destination is under the same packet off of, depending on whether the destination is under the same
Tier-2 nodes, or has to be reached across Tier-1. The host would use Tier-2 nodes, or has to be reached across Tier-1. The host would use
a hash function that operates on per-flow invariants, to simulate a hash function that operates on per-flow invariants, to simulate
per-flow load-sharing in the network. per-flow load-sharing in the network.
Using Figure 1 as reference, let us illustrate this concept assuming Using Figure 1 as reference, let us illustrate this concept assuming
that HostA has an elephant flow to HostZ called Flow-f. that HostA has a flow to HostZ called Flow-f.
Normally, a flow is hashed on to a single path. Let's assume HostA Normally, a flow is hashed on to a single path. Let's assume HostA
sends its packets associated with Flow-f with top label 16011 (the sends its packets associated with Flow-f with top label 16011 (the
label for the remote ToR, Node11, where HostZ is connected) and Node1 label for the remote ToR, Node11, where HostZ is connected) and Node1
would hash all the packets of Flow-F via the same next-hop (e.g. would hash all the packets of Flow-F via the same next-hop (e.g.
Node3). Similarly, let's assume that leaf Node3 would hash all the Node3). Similarly, let's assume that leaf Node3 would hash all the
packets of Flow-F via the same next-hop (e.g.: spine node Node5). packets of Flow-F via the same next-hop (e.g.: spine node Node5).
This normal operation would restrict the elephant flow on a small This normal operation would restrict the flow on a small subset of
subset of the ECMP paths to HostZ and potentially create imbalance the ECMP paths to HostZ and potentially create imbalance and
and congestion in the fabric. congestion in the fabric.
Leveraging the flowlet proposal, assuming HostA is made aware of 4 Leveraging the flowlet proposal, assuming HostA is made aware of 4
disjoint paths via intermediate segment 16005, 16006, 16007 and 16008 disjoint paths via intermediate segment 16005, 16006, 16007 and 16008
(the BGP prefix SID's of the 4 spine nodes) and also made aware of (the BGP prefix SID's of the 4 spine nodes) and also made aware of
the prefix segment of the remote ToR connected to the destination the prefix segment of the remote ToR connected to the destination
(16011), then the application can break the elephant flow F into (16011), then the application optimially uses the paths by sending
flowlets F1, F2, F3, F4 and associate each flowlet with one of the flowlets F1, F2, F3, F4 and associate each flowlet with one of the
following 4 label stacks: {16005, 16011}, {16006, 16011}, {16007, following 4 label stacks: {16005, 16011}, {16006, 16011}, {16007,
16011} and {16008, 16011}. This would spread the load of the elephant 16011} and {16008, 16011}. This would spread the load of the flow
flow through all the ECMP paths available in the fabric and re- through all the ECMP paths available in the fabric.
balance the load.
7.2. Performance-aware routing 7.2. Performance-aware routing
Knowing the path associated with flows/packets, the end host may Knowing the path associated with flows/packets, the end host may
deduce certain characteristics of the path on its own, and leverage an external mechanism to deduce certain characteristics of
additionally use the information supplied with path information the path, and additionally use the information supplied with path
pushed from the controller or received via pull request. The host information pushed from the controller or received via pull request.
may further share its path observations with the centralized agent, The host may further share its application observations with the
so that the latter may keep up-to-date network health map to assist centralized agent, so that the latter may keep up-to-date network
other hosts with this information. health map to assist other hosts with this information.
For example, an application A.1 at HostA may pin a flow destined to For example, an application A.1 at HostA may pin a flow destined to
HostZ via Spine node Node5 using label stack {16005, 16011}. The HostZ via Spine node Node5 using label stack {16005, 16011}. The
application A.1 may collect information on packet loss or other application A.1 may collect information on packet loss or other
metrics. A.1 may additionally publish this information to a metrics on a particular path using external mechanism. A.1 may also
look locally at the application performance information. The
application A.1 may additionally publish this information to a
centralized agent, e.g. after a flow completes, or periodically for centralized agent, e.g. after a flow completes, or periodically for
longer lived flows. Next, using both local and/or global performance longer lived flows. Next, using both local and/or global performance
data, application A.1 as well as other applications sharing the same data, application A.1 as well as other applications sharing the same
resources in the DC fabric may pick up the best path for the new resources in the DC fabric may pick up the best path for the new
flow, or update an existing path (e.g.: when informed of congestion flow, or update an existing path (e.g.: when informed of congestion
on an existing path). The mechanisms for collecting the flow on an existing path). The mechanisms for collecting the flow
metrics, their publishing to a centralized agent and the decision metrics, their publishing to a centralized agent and the decision
process at the centralized agent and the application/host to pick a process at the centralized agent and the application/host to pick a
path through the network based on this collected information is path through the network based on this collected information is
outside the scope of this document. outside the scope of this document.
One particularly interesting instance of performance-aware routing is One particularly interesting instance of performance-aware routing is
dynamic fault-avoidance. If some links or devices in the network dynamic fault-avoidance. If some links or devices in the network
start discarding packets due to a fault, the end-hosts could probe start discarding packets due to a fault, the end-hosts may receive
updated information from controller, published by external mechanisms
and detect the path(s) that are affected and hence steer the affected and detect the path(s) that are affected and hence steer the affected
flows away from the problem spot. Similar logic applies to failure flows away from the problem spot. Similar logic applies to failure
cases where packets get completely black-holed, e.g., when a link cases where packets get completely black-holed, e.g., when a link
goes down and the failure is detected by the host while probing the goes down and the failure is detected by the host while probing the
path. path.
For example, an application A.1 informed about 5 paths to Z {16005, For example, an application A.1 informed about 5 paths to Z {16005,
16011}, {16006, 16011}, {16007, 16011}, {16008, 16011} and {16011} 16011}, {16006, 16011}, {16007, 16011}, {16008, 16011} and {16011}
might use the last one by default (for simplicity). When performance might use the last one by default (for simplicity). When performance
is degrading, A.1 might then start to pin flows to each of the 4 is degrading, A.1 might then start to pin flows to each of the 4
skipping to change at page 22, line 17 skipping to change at page 22, line 29
Email: jrmitche@puck.nether.net Email: jrmitche@puck.nether.net
15. References 15. References
15.1. Normative References 15.1. Normative References
[I-D.ietf-idr-bgp-prefix-sid] [I-D.ietf-idr-bgp-prefix-sid]
Previdi, S., Filsfils, C., Lindem, A., Sreekantiah, A., Previdi, S., Filsfils, C., Lindem, A., Sreekantiah, A.,
and H. Gredler, "Segment Routing Prefix SID extensions for and H. Gredler, "Segment Routing Prefix SID extensions for
BGP", draft-ietf-idr-bgp-prefix-sid-21 (work in progress), BGP", draft-ietf-idr-bgp-prefix-sid-27 (work in progress),
May 2018. June 2018.
[I-D.ietf-spring-segment-routing] [I-D.ietf-spring-segment-routing]
Filsfils, C., Previdi, S., Ginsberg, L., Decraene, B., Filsfils, C., Previdi, S., Ginsberg, L., Decraene, B.,
Litkowski, S., and R. Shakir, "Segment Routing Litkowski, S., and R. Shakir, "Segment Routing
Architecture", draft-ietf-spring-segment-routing-15 (work Architecture", draft-ietf-spring-segment-routing-15 (work
in progress), January 2018. in progress), January 2018.
[I-D.ietf-spring-segment-routing-central-epe] [I-D.ietf-spring-segment-routing-central-epe]
Filsfils, C., Previdi, S., Dawra, G., Aries, E., and D. Filsfils, C., Previdi, S., Dawra, G., Aries, E., and D.
Afanasiev, "Segment Routing Centralized BGP Egress Peer Afanasiev, "Segment Routing Centralized BGP Egress Peer
skipping to change at page 23, line 13 skipping to change at page 23, line 27
<https://www.rfc-editor.org/info/rfc8277>. <https://www.rfc-editor.org/info/rfc8277>.
15.2. Informative References 15.2. Informative References
[GREENBERG09] [GREENBERG09]
Greenberg, A., Hamilton, J., Jain, N., Kadula, S., Kim, Greenberg, A., Hamilton, J., Jain, N., Kadula, S., Kim,
C., Lahiri, P., Maltz, D., Patel, P., and S. Sengupta, C., Lahiri, P., Maltz, D., Patel, P., and S. Sengupta,
"VL2: A Scalable and Flexible Data Center Network", 2009. "VL2: A Scalable and Flexible Data Center Network", 2009.
[I-D.ietf-6man-segment-routing-header] [I-D.ietf-6man-segment-routing-header]
Previdi, S., Filsfils, C., Leddy, J., Matsushima, S., and Filsfils, C., Previdi, S., Leddy, J., Matsushima, S., and
d. daniel.voyer@bell.ca, "IPv6 Segment Routing Header d. daniel.voyer@bell.ca, "IPv6 Segment Routing Header
(SRH)", draft-ietf-6man-segment-routing-header-13 (work in (SRH)", draft-ietf-6man-segment-routing-header-14 (work in
progress), May 2018. progress), June 2018.
[RFC6793] Vohra, Q. and E. Chen, "BGP Support for Four-Octet [RFC6793] Vohra, Q. and E. Chen, "BGP Support for Four-Octet
Autonomous System (AS) Number Space", RFC 6793, Autonomous System (AS) Number Space", RFC 6793,
DOI 10.17487/RFC6793, December 2012, DOI 10.17487/RFC6793, December 2012,
<https://www.rfc-editor.org/info/rfc6793>. <https://www.rfc-editor.org/info/rfc6793>.
Authors' Addresses Authors' Addresses
Clarence Filsfils (editor) Clarence Filsfils (editor)
Cisco Systems, Inc. Cisco Systems, Inc.
 End of changes. 18 change blocks. 
35 lines changed or deleted 38 lines changed or added

This html diff was produced by rfcdiff 1.47. The latest version is available from http://tools.ietf.org/tools/rfcdiff/