VXLAN Routing with MLAG

Introduction

This document describes the operation and configuration of  VXLAN routing on an Arista platform in conjunction with MLAG for redundancy. The configuration and guidance within the document unless specifically noted is based on the platforms and EOS releases noted in the table below.

 

Arista’s Multi-Chassis LAG (MLAG) technology provides the ability to build a loop free active-active layer 2 topology. The technology operates by allowing two physical Arista switches to appear as a single logical switch (MLAG domain), third-party switches, servers or neighbouring Arista switches connect to the logical switch via a standard port-channel (static, passive or active) with the physical links of the port-channel split across the two physical switches of the MLAG domain. With this configuration all links and switches within the topology are active and forwarding traffic, with no loops within the topology the configuration of spanning-tree becomes optional.

The focus of this document is the operation and configuration of an MLAG topology within a routed VXLAN overlay architecture.

Virtual eXtensible LAN  (VXLAN) Overview

The VXLAN protocol is an RFC (7348) standard co-authored by Arista. The standard defines a MAC in IP encapsulation protocol allowing the construction of layer 2 domains across a layer 3 IP infrastructure. The protocol is typically deployed as a data center technology to create overlay networks across a transparent layer 3 infrastructure.

  • Providing layer 2 connectivity between racks or halls of the data center, without requiring an underlying layer 2 infrastructure,
  • Logical connecting  geographically disperse data centers at layer 2, as a data center Interconnect (DCI) technology.

Figure 1, illustrates a typical layer 3 leaf spine architecture deployed in a modern data center to simplify scale while delivering consistent throughput and latency for east-to-west traffic. In this example eBGP is deployed between the leaf and spine switches for flexible control of the advertised routes but any standard dynamic routing protocol (OSPF, IS-IS etc) could be deployed. To provide traffic load-balancing across the four spine switches Equal Cost Multi-Pathing (ECMP) is configured. For illustration purposes the topology includes two racks of servers, with the servers within the rack dual-homed to a pair of leaf switches in an MLAG configuration.

To provide layer 2 connectivity between the racks VXLAN is introduced as the overlay technology, this is achieved by configuring a VXLAN VTEP on the leaf switches. Given the servers are dual-homed to a pair of leaf switches in an MLAG configuration, a single logical VTEP is created for each MLAG domain. This is achieved by configuring the VTEP on both MLAG peers with the same Virtual Tunnel Interface (VTI) IP address, this ensures both MLAG peers, VXLAN encapsulate any locally received traffic with the same source IP address

VXLAN bridging, providing layer 2 connectivity across the IP fabric

The logical VTEP in combination with MLAG provides an active-active VXLAN topology,  local traffic received by either peer switch can be VXLAN encapsulated and any VXLAN encapsulated traffic (received via a spine switch) can be locally decapsulated and forwarded to the end device.  To enable this behaviour the local MAC table of a peer switch is synchronised across the peer link, along with any remote MAC’s learnt via VXLAN, this information would include the remote hosts MAC address along with the associated remote VTEP IP address.

For further reading on the configuration and operation of an Arista VTEP within an MLAG topology to provide layer 2 VXLAN bridging please refer to  https://eos.arista.com/vxlan-with-mlag-configuration-guide

VXLAN Routing

The deployment of VXLAN bridging provides layer 2 connectivity across the layer 3 leaf spine underlay for hosts in the overlay network, to provide layer 3 connectivity between the hosts VXLAN routing is required. VXLAN routing  involves the routing of traffic based, not on the destination IP address of the outer VXLAN header but the inner header or overlay tenant IP address. The concept is illustrated in the diagram below, where VXLAN routing is used to route traffic between hosts (Serv-1 and Serv-2) residing in different IP subnets of the overlay network.

As the default gateway for Serv-1, traffic destined to Serv-2 is routed by leaf-1 into the Serv-2’s subnet (10.10.20.0/20). The destination MAC for Serv-2 has been learnt behind a remote VTEP (VTEP-2), thus to forward the traffic to Serv-2, leaf-1 will encapsulate and VXLAN bridge the frame to the remote VTEP. Receiving the frame, VTEP-2 decapsulates the VXLAN frame and based on it’s local configuration, maps the VNI (1020) to VLAN 20. A MAC address lookup of VLAN 20 resulting  in the packet being forward to Serv-2 on interface ethernet 6. The packet walk of the traffic flow is illustrated below.

The traffic flow from Serv-1 to Serv-2, therefore results in layer 3 routing on the original frame at leaf-1 and VXLAN bridging to the remote VTEP, VTEP-2. Traffic flow in the opposite direction, with the default gateway for Serv-2 being 10.10.20.1, in this topology would result in VXLAN bridging between VTEP-2 and VTEP-1 and routing of the inner frame on leaf-1 for local forwarding to the final destination, Serv-1.

VXLAN Routing Topologies

The introduction of VXLAN routing into the overlay network, can be achieved by maintaining separation between the overlay tenant subnets and the underlay routing architecture, via a direct or indirect routing models

  • Direct Routing: The direct routing model provides routing at the first-hop leaf node (“direct routing”) for all subnets within the overlay network.  This ensures optimal routing of the overlay traffic at the first hop leaf switch, regardless of the subnet the host may reside within
  • Indirect Routing: To reduce the amount of state (ARP /MAC entries and routes) each leaf node holds, allowing for  greater scale within the overlay network, the leaf nodes in the indirect routing model only route for a subset of the overlay tenant networks rather than all subnets within the overlay network.

As the indirect routing model is a derivative of the direct routing model, the following examples in this document illustrate the configuration steps for the direct routing model.

Direct routing model with MLAG

The direct routing model provides routing at the first-hop leaf node (“direct routing”) for all subnets within the overlay network  This ensures optimal routing of the overlay traffic at the first hop leaf switch.

The model works by creating anycast IP addresses for the tenant subnets across each of the leaf nodes, providing a logical distributed router. Each leaf node acts as the default gateway for all the overlay subnets, allowing the VXLAN routing to always occur at the first-hop, regardless of what rack or subnet a host is residing within. The leaf nodes own and respond to ARP requests destined to the anycast IP and route traffic destined to the MAC of the anycast IP address, thus providing distributed routing functionality across all the leafs.

In the topology illustrated above, a pair of leaf nodes are deployed in each rack in an MLAG configuration for resiliency, with the servers dual homed to their local MLAG domain. Given the servers are dual-homed to pairs of leaf switches in an MLAG configuration, a single logical VTEP is created for each MLAG domain (VTEP-1 and VTEP-2).

The anycast gateway addresses are defined (10.10.10.254 and 10.10.20.254) for the two subnets within each MLAG domain. This allows the routing of traffic between Serv-1 and Serv-2, which are hosted in the same rack but different subnets, to occur at the first-hop leaf without the need for any VXLAN bridging. This would also hold true for the routing of traffic between Serv-3 and Serv-4, which would occur locally on the directly attached leaf nodes of rack-2. The routing of traffic between hosts attached to different leaf nodes, would also occur at the directly attached leaf node, but involve VXLAN bridging across the IP fabric to the remote host. The traffic flow for this type of communication ( Serv-1 to Serv-4) is highlighted in the following steps.

Step 1 – With Serv-1’s default gateway the anycast IP address for VLAN 10 , the packet destined to Serv-4 has a destination MAC address of MAC-A, the MAC address of the anycast IP 10.10.10.254.

Step 2 – Due to the load-balancing algorithm of Serv-1’s port-channel, the frame destined for Serv-4 is received by leaf-11, with a Dest MAC of MAC-A the anycast IP configured on Leaf-11, the leaf switch routes the packet to the destination subnet (10.10.20.0/24) which is directly attached on leaf-11.

Step -3 – An ARP and MAC lookup on leaf-11 for the destination host (Serv-4) on VLAN 20,  points to the logical VTEP (2.2.2.2)  of the MLAG domain in rack-2 , on VNI 1020.

Step-4 –  Based on the routing table of leaf-11 for the destination IP address 2.2.2.2 (logical VTEP-2), the packet will be forwarded to a spine switch and subsequently routed by the spine switch to one of the two leaf switches (leaf-21 and leaf-22) which are advertising the logical VTEP-2 ip address

Step 4 – Assuming the packet is hashed by the ECMP algorithm of the spine switch to the leaf switch leaf-21, receiving the packet it will decapsulate the frame (remove the VXLAN header) and based on the VTEP’s local VNI to VLAN mapping, perform a lookup for the inner destination MAC (MAC-4) in VLAN 20. This results in the packet being forwarded out of interface ethernet 7 with a 802.1Q tag of VLAN 20 to Serv-4.

AnyCast IP address 

To provide direct routing within each rack regardless of the host’s subnet, the leaf nodes of an MLAG domain will be required to be configured with an IP interface in every subnet. In a configuration model using VARP or VRRP, this would result in each leaf node consuming an IP address within the tenant’s subnet in addition to the shared gateway IP address. For example; with a /24 subnet mask and 252 leaf nodes, all 253 IP addresses of the subnet would be consumed just by the leaf switches and the anycast IP address.

To conserve IP address space the “ip address virtual” concept is introduced for the VTEP. When a VLAN interface is configured with the ‘ip address virtual’ option, a virtual IP address is assigned without the need to configure a physical IP address on the interface.  With the “ip address virtual”  representing the default gateway for the subnet and shared across all leafs,  only a single IP address is consumed for each subnet, rather an IP address per leaf node participating in the routing for the subnet. As outlined in the sample configuration below, the virtual addresses across the leafs nodes are configured with a single virtual router MAC address (00:aa:aa:aa:aa:aa).

Consideration should be taken when using the ‘ip address virtual’ option

  • A Routing adjacency cannot be formed over an vlan interface configured with the ‘ip address virtual’.
  • A VTEP configured with the ‘ip address virtual’ option, will forward any ARP responses destined to the virtual router MAC to all neighbouring VTEPs via head-replication (HER). This ensures neighboring VTEPs hosting the same virtual IP address will receive ARP responses regardless of  whether the host is local or remote from the VTEP.
  • In an MLAG configuration, ARP responses to the “virtual ip address” are synchronised with the MLAG peer, to ensure consistency between the MLAG peers.
  • NOTE: The synchronisation of ARP responses between MLAG peers is achieved via the VXLAN agent, hence the “virtual ip address” feature is only supported with a VXLAN configuration.

Virtual VTEP with the Anycast IP address

With the Anycast IP address an ARP request  (an all FF’s broadcast packet) for the virtual MAC , could potentially result in all VTEPs in the topology responding to the ARP request. To avoid this situation and ensure only a single VTEP responds to the ARP request,  the Arista platforms support a virtual VTEP functionality (vVTEP), this allows the virtual MAC to sit behind a single virtual VTEP which is shared across the leaf switches owning the same anycast IP address.  This means ARP requests to the virtual MAC are only replied to when sent to the virtual VTEP rather the logical VTEP (i.e. 2.2.2.1 and 2.2.2.2 in the example below). Adding the virtual VTEP to the HER flood list of aVTEP ensures ARP requests are also forwarded to the virtual VTEP (2.2.2.4 in the diagram below).

 

The virtual VTEP address is specified by configuring a secondary address on the loopback interface designated as the VXLAN’s source interface.

The virtual VTEP address can then be added to the flood list to ensure any ARP requests to virtual MAC are sent to the  virtual VTEP.  

In the topology above, where leaf-11 and leaf-21 are configured with the virtual MAC and virtual VTEP locally, adding the virtual VTEP to the flood list will only result in the ARP request being flooded to the virtual VTEP when the local loopback interface (2.2.2.4) is down.  However in the topology there may be a requirement to introduce a layer 2 only VTEP (VXLAN bridging only), for example a software VTEP within a vSwitch of a virtualised server. The Virtual Machines (VMs) attached to the software VTEP would still have a default gateway of the anycast IP address,  but the software VTEP wouldn’t be able to resolve the ARP request to the anycast IP MAC directly. Instead the vSwitch would flood the request to all the VTEPs (leaf switches) in the VM’s VNI.

To ensure only one VTEP responds to the ARP request, the flood list of the layer 2 VTEP is configured with the virtual VTEP. As the virtual VTEP is a shared address which is used to respond to any ARP requests for the virtual MAC,  regardless of what leaf switches responds to the ARP request, the vSwitch will learn the response behind the same Virtual VTEP address, which would be the Virtual VTEP (IP-4 in the above diagram).

 

Flooded traffic received by the virtual VTEP apart from ARP packets are dropped, the forwarding of flooded traffic is handled by the primary VTEP on the switch which will also receive a copy. In the case of VTEP-1, the primary VTEP would be 2.2.2.1 and therefore needs to be added to the flood list of any remote VTEP.

ARP timer

With the configuration of  “ip address virtual”,  each VTEP will have a virtual IP for all subnets in the network and will respond to ARP requests locally and headend replicate ARP replies destined to the virtual MAC. In the direct-routing scenario below, traffic from Serv-1 is routed by VTEP-1 to Serv-4. By performing the routing  VTEP-1 would learn the MAC of Serv-4 via the initial ARP request, but not via the subsequent bi-directional conversation, as the return traffic from Serv-4 would be routed by VTEP-2. Routing the traffic, VTEP-2  rewrites the source MAC of the inner frame (MAC-4 of Serv-4) to the system MAC of VTEP-2, thus VTEP-1 wouldn’t learn the MAC of Serv-4 and refresh it’s MAC table.

To avoid the MAC table entry for MAC-4 being flushed on leaf-11 after the default timeout (5 minutes) due to a lack of traffic, it is advised to configure the ARP ageing time (default 4 hours) to a value less than the configured MAC timeout. This configuration will force an ARP refresh on leaf-11 and consequently a re-learning of the MAC entry before the MAC is flushed.  The ARP ageing timer is configured at the interface level with the CLI command ‘arp timeout <60-65535 seconds>’,  the MAC timeout value is a global parameter and configured with the CLI command “mac address-table aging-time <10-1000000 seconds’.

Direct Routing configuration

The following example provides the configuration steps to configure the direct routing topology  illustrated in the figure below.

In the configuration two tenant networks have been created, both of which exist in each of the MLAG domains, to provide direct routing for the two subnets anycast IP addresses have been configured within each of the MLAG domains. With the anycast IP addresses being the default gateways for the servers in each of the racks, traffic can be directly routed for both subnets at either leaf switch.

Within each of the racks a pair of leaf nodes are configured as an MLAG domain, with the relevant servers dual-homed using a port-channel which is split across each of the physical leaf nodes for resiliency. The MLAG configuration for rack-1 is illustrated below.

The configuration steps are only illustrated for the switches in rack-1, the same steps would be repeated for the switches in rack-2. The first 4 steps follow a  standard MLAG configuration. For the 7050X and 7250 platforms pleased refer to the document “VXLAN routing on the 7050X/7250 platform ” for the additional configuration steps required to enabled VXLAN routing on these platforms.

 

Step 1: Create port-channel (port-1000) between the two leaf switches which will be used as the MLAG peer link.

Step 2: Create the peer VLAN (4094) and peer link IP addresses on both switches, the peer IP is used for heartbeat and MAC address synchronisation between the peers.

Step 3: Configure the MLAG domain on both peers, using the configured port-channel as the peer link and interface VLAN 4094 as the peer address.

Step 4: Configure the MLAG  port-channels (port-channel 10 and 20) on interfaces ethernet 6 and ethernet 7 respectively on both peers.

 

Step 5: With the MLAG domain created configure the anycast IP addresses for the two tenant subnets, this is created using ‘ ip address virtual” configuration model. The “virtual-router mac-address 00:aa:aa:aa:aa:aa” commands defines the virtual MAC address used by the anycast IP addresses.

Step 6: Create the loopback interface to be used as the source-ip address for the logical VTEP of the MLAG domain (2.2.2.1/32), the same IP address will be defined on both MLAG peers.  The Virtual VTEP  ip address (2.2.2.4/32) is configured as a secondary IP under the same loopback interface

Step 6: Assign the Loopback 1 interface to the Virtual Tunnel Interface (VTI) of the VTEP on both leaf switches

Step 7: Map VLAN 10 to the VNI 1010 and VLAN 2o to VNI 1020 on both peer switches.

Step 8:  Configure the flood list for the two created VNIs, which will be the logical VTEP of rack-2 (2.2.2.2) which is a member of both VNIs’ 1010 and 1020. This will mean the logical VTEP (2.2.2.2) on the peers leaf-21 and leaf-22, will receive any broadcast, multicast or unknown unicast  (B.U.M) frames for the VNIs, allowing the learning of MAC address and forwarding of broadcast between the two VTEPs. In the configuration of the logical VTEP on leaf-21 and leaf-22, the logical VTEP 2.2.2.1 would be added to the flood-list on both switches.

Step 8:  To provide IP connectivity between the two logical VTEP’s the loopback IP addresses of VTI and it’s secondary address needs to be advertised into BGP of the leaf spine IP fabric. The virtual VTEP (2.2.2.4) , only needs to be announced into BGP underlay network, when a layer 2 VTEP is added to the topology

With the leaf switches announcing their respective logical VTEP into the underlay BGP routing topology, each leaf switch learns four equal cost paths (via the four spine switches) to it’s neighboring logical VTEP.  This is illustrated in the routing tables of the leaf 11 and leaf  12  below, each of which learn four equal cost paths to logical VTEP of rack-2 (2.2.2.2). Note as the virtual VTEP (2.2.2.4) is locally configured on both switches it’s not learnt via eBGP.


With the direct routing model, the tenant subnets exist only on the leaf switches, there is therefore no need to announce the tenant “overlay” subnets into the underlay BGP routing topology, the spine switches are transparent to the overlay subnets and only learn the logical VTEP addresses of the leaf switches. With the VXLAN interface(s) configured with a flood-list of the neighboring leaf’s logical VTEP address and the loopback interfaces announced into BGP,  layer 2 and layer 3 connectivity between the servers in the racks is now possible. With traffic successfully flowing between the servers, below is the resultant MAC and VXLAN address table for the leaf switches.