Posted on May 27, 2020 12:24 am
 |  Asked by Danail Petrov
 |  214 views
0
0
Print Friendly, PDF & Email

G’day,

I am looking for some guidance as well as real life experience, gotchas, etc (if someone has done something similar it will be great to hear some thoughts). I am working on a design of 50+ DCs, meaning I’ll have 50+ fabrics across the globe. The requirement is to support multi-tenancy across all the sites and the only “relief” here is I only need to support layer3! That being said, I am considering multi-site, multi-pod architecture running route-servers within and between all sites.

My current idea is to have three major sites (regions) – US, EMEA & APAC. In each region I am planning running a pair of route-servers, used to distribute control plane within the “site” (or region). Each site (region) will be running as multi-pod fabric, sharing the same control plane. I think that should be okay from scaling POV, having in mind no support for Type-2’s is needed, meaning the underlay can be fairly simple . Day 1 I am expecting no more than 20,000 hosts within the entire network and maybe not more than 5,000 IP prefixes (Type-5s).

The answers I am looking for is:

  • Topology: From scalability perspective, does that “hierarchy” make sense – aggregating PODs under the same fabric within a site (region) and then another level of hierarchy running draft-sharma-multi-site using BGWs?
  • Scale: I couldn’t find anything about scalability & support when it comes to number of VTEP addresses supported within a fabric and also number of sites (draft-sharma) currently supported by both – hardware (7280) and EOS.
  • Route-servers design: VXLAN stitching will be done at BGWs but do I need to factor anything additional if I want to run route-servers? My idea is all POD SPINE devices to have MP-BGP session with 2 regional route-servers and then full mesh between those route-servers.
    • Do you see any problem with that setup?
    • Also, what’ll be best here – just running a dedicated hardware pair of switches to do this function or is there a good software alternative? I am aware I can run EOS on KVM or other types of hyper visor, but is that even an option?
  • Filtering: As I’ve mentioned, I am not planning on extending any L2 VNIs between the PODs. That beings said, I don’t want to export any Type-2 messages outside of local fabric (DC). Is that technically supported and if so – do you see any problems doing that? The idea here is to keep Type2s for within the local fabric only but not sharing this with the route-servers so other PODs won’t need to learn all the MAC addresses for fabrics, they won’t have L2 communication with.

I hope the above makes sense and thanks in advance for your time.

0
Posted by Aniket Bhowmick
Answered on May 29, 2020 3:20 pm

Hi Danail,

Thanks for writing such a detailed requirement.

I first want to ensure that what I understood is correct, please read the below two points which describes your requirement:

  • Each Site will have two route-servers (a.k.a Spines) which will be reflecting/distributing MP-BGP routes between different Leaf-pods.
  • Each Site will have a Border Gateway which will be the entry/exit point of the site and also to perform inter-DC communication. The BGW of all the three sites will be interconnected by a Layer of route-servers which will be used to distribute MP-BGP routes between the BGWs of each Site. The BGW will also be connected to the route-server that exists within the DC connecting to the Leafs. The BGW will stitch the Vxlan Fabric from one site to another, which means it will also be a VTEP.

Are the above two points correct ?

If yes, then (related to Topology/Route-server Design):

  • I think the best way to stitch the Vxlan Fabric is by introducing a DCI switch (which will run Vxlan/EVPN). The BGW can be connected to a DCI switch via a L2 trunk port.
  • Reason why I say this is because, if packets originated from servers connected to leafs in Site-1 gets encapsulated and then gets decapsulated on the BGW (Site-1), it cannot get encapsulated again (unless it does inter vlan routing) to exit the DC via Vxlan Tunnel, because of Spit horizon rule (packet received on Vx1 interface will not egress out of Vx1 again). I am saying this considering the fact that Leafs within a DC are not aware of the Vxlan Loopback IPs (Tunnel IP) of the Leafs in other DCs.
  • So the Tunnel must terminate on the BGW (Site-1), exit via a L2 trunk port, towards a DCI switch where it can get encapsulated again and reach the DCI of Site-2 or Site-3 where it will decapsulate and go to the BGW (Site-2/Site-3) via L2 trunk port and would get Vxlan encapsulated again on the BGW, and finally would reach the destination VTEP where it would get decapsulated and reach the end host.
  • You can run EVPN between DCIs and eBGP between BGW and DCI (between SVIs over L2 trunk port). BGW can advertise the Type-5 IP-prefixes learnt from downstream leaf switches in the form of regular BGP updates. DCIs will convert those BGP updates to Type-5 ip-prefix and will advertise it to the other two DCIs. The other two DCI will convert the received Type-5 IP-prefix into regular BGP updates and will send it to downstream eBGP neighbour- BGW. BGW will again convert those regular BGP updates to a type-5 IP prefix and will advertise it to the Spines and then Spines to Leafs.
  • Generally Hardwares are good option to be used here (DC traffics) as ASICs are comparatively faster than software look ups and 7280 are specially designed to tackle large amount of traffics with its special deep buffering capability.

However, I would wait for others to comment on this as well or would suggest to get in touch with Arista SE to get more details on the topology.

Regarding number of VTEPs supported on 7280 platform:

  • 7280R platforms can support upto 4000 VNI's (means 4000 Vlan to VNI mapping) and 2000 ECMP routes for remote VTEPs or 14000 non-ECMP routes.
  • I wasn't clear about the part where you mentioned- "number of sites (draft-sharma) currently supported"

Regarding Filtering option:

  • This is also achievable by connecting BGW to DCI via L2 port and establish eBGP between them (using SVIs). When BGW receives both Type-5 and Type-2, it will only convert the Type-5 prefixes (as that is what regular BGP updates can support) and send it to it's eBGP peer- DCI and DCI will again convert that into Type-5 IP-prefix to advertise it to other DCI. The Type-2 updates would stay in local DC itself.

Please let me know if I understood something incorrectly and I can rephrase my answers according to that.

Thanks,

Aniket

 

 

Post your Answer

You must be logged in to post an answer.