• MLAG: Traffic flow for single-homed hosts

 
 
Print Friendly, PDF & Email

Objective

The objective of this document is to explain the traffic flows, best practice designs, and configuration details when single-homed devices are connected to an MLAG domain. 

It is assumed that the reader is familiar with the concept of Leaf-Spine fabrics, MLAG, and VXLAN. More details about these concepts can be found on EOS Central. Recommended articles are:

MLAG – Basic configuration
MLAG – Advanced configuration
VXLAN bridging with MLAG
VXLAN routing with MLAG

Introduction

Arista’s Multi-Chassis LAG (MLAG) technology provides the ability to build a loop-free active-active layer 2 topology. The technology operates by allowing two physical Arista switches to appear as a single logical switch (MLAG domain), third-party switches, servers or neighbouring Arista switches connect to the logical switch via a standard port-channel (static, LACP passive or active) with the physical links of the port-channel split across the two physical switches of the MLAG domain. With this configuration, all links and switches within the topology are active and forwarding traffic, with no loops within the topology the configuration of spanning-tree becomes optional.

A dual-homed host connecting to the MLAG domain will use a standard port-channel Traffic transmitted by the host, will be hashed across the different physical links of the port-channel. The hashing algorithm will be dependent on the host’s implementation, but will typically use a combination of source and destination IP, source and destination port numbers, and source and destination mac addresses.  The hashing algorithm of the host will result in specific traffic flows being hashed to the same physical link of the port-channel.  Consequently, this will result in certain traffic flows from the host, being forwarded to one of the switches in the MLAG domain, while other flows based on the hash being forwarded to the other switch of the domain. With optimal load sharing traffic would be equally split between the two switches of the MLAG domain,  however, the distribution achieved in a real-world scenario, will be dependent on the entropy of the traffic flows and the hashing algorithm implemented.

You can through configuration specify the seed for the hashing algorithms that balance the load across ports comprising a port channel. The command used for this is “port-channel load-balance”. Available seed values vary by switch platform. See the EOS configuration manual for details.

To provide optimal active-active forwarding, hosts are required to be connected (via a split port-channel) to the two physical switches of the MLAG domain, however not all servers support the capability to be dual-homed. There is, therefore, the need to single-home hosts. This scenario is illustrated in the figure below.

 

 

Traffic flows

Bridged traffic within an MLAG domain

In this scenario a single-homed host (Host-1) will communicate with a dual-homed host (Host-2) on the same subnet (VLAN 10). As Host-2 is on VLAN 10, the VLAN has been configured on both switches (leaf-1 and leaf-2) of the MLAG domain.

Let’s look at the traffic flow when Host-1 communicates with Host-2. 

1. Host-1 sends a packet with Host-2 as the destination. Leaf-1 receives the packet and learns that the MAC address of Host-1 is reachable on Eth1. The MAC address is synced with Leaf-2 as part of the MLAG protocol. As host-1 is single-homed, the MAC address will be learned on Leaf-2, via the peer-link.

2. Assuming Leaf-1 has already learned that the MAC address of Host-2 is reachable through Po1, the packet is sent to Host-2 via the local interface of Po1 on Leaf-1 (Eth2). As a rule, a physical switch of an MLAG domain will always forward traffic to the destination via a local port if available. A remote port on the MLAG peer switch and therefore the peer-link is only used when there is no local port available. 

When Host-2 sends traffic to Host-1 it uses its load-balancing algorithm to hash the traffic flow to one of the two physical links of the port-channel, connected to Leaf-1 and Leaf-2. This will create two different traffic flows, one where traffic is sent to Leaf-1 and one where traffic is sent to Leaf-2. In this first use case, we will look at both flows.

Flow 1: Traffic sent to Leaf-1

3. The return traffic flow is hashed to the link connected to Leaf-1. Note that this decision is made by Host-2. The MLAG switches have nothing to do with which link gets chosen.

4. As Host-1 is directly attached to Leaf-1, it will forward traffic out interface Eth1 to Host-1. 

The important thing to understand with this traffic flow is that it will never put any load on the peer-link. In the following use cases, this flow will not be described in detail as it does not add anything to what you need to consider when attaching single-homed hosts to an MLAG pair.

Flow 2: Traffic sent to Leaf-2

3. The return traffic flow is hashed to the link connected to Leaf-2 

4. When Leaf-1 synced the MAC address of Host-1 with Leaf-2,  Leaf-2 learned the MAC address of Host-1 via the port-channel interface Po10   The traffic is therefore forwarded by Leaf-2 across the peer-link to Leaf-1.

5. Receiving the traffic on the peer-link, Leaf-1 performs a MAC lookup for Host-1, and forwards the traffic out of interface Eth1 to Host-1.

Configuration considerations

Since VLAN 10 is already configured on both leaf switches there is no specific configuration that needs to be added to Leaf-2 in order to handle the traffic flows that gets hashed by Host-2 to Leaf-2. Leaf-2 will automatically learn that MAC addresses of single-homed hosts connected to Leaf-1 are reachable over the peer link, as per standard MLAG behavior. Similarly, MAC addresses of single-homed hosts connected to Leaf-2 will automatically be learned by Leaf-1. 

Routed traffic within a leaf MLAG pair

In this scenario, a single-homed host (Host-1) on one subnet (VLAN 10) will communicate with a dual-homed host (Host-2) on another subnet (VLAN 20). For these two hosts to be able to communicate with each other a routing interface must be set up for each VLAN. Typically these routing interfaces are configured as gateways on the leaf switches. As per recommended MLAG design, the gateway for VLAN 20 is set up with VARP and a virtual router IP address that is shared by both leaf switches. But what about the gateway for VLAN 10? Is it enough to configure that on Leaf-1?

 Again, let’s look at the traffic flow when Host-1 communicates with Host-2.

 

1. Host-1 sends a packet with Host-2’s IP address as the destination. The default gateway for VLAN 10 is 10.10.10.1, which is configured on interface VLAN10 on Leaf-1.

2. Leaf-1 will route the packet to VLAN 20 and send it directly to Host-2 out of the local interface (Eth2) of port-channel Po1.

3. When Host-2 sends a packet back to Host-1 it must first send it to the gateway for VLAN 20. That gateway is the virtual router IP (10.10.20.1) that is shared between the leaf switches. Host-2, like in the bridged use case above, will hash the traffic flow to one of the two links of the port-channel (Po1) which is connected to both Leaf-1 and Leaf-2. If the packet is sent to Leaf-1, traffic will be routed by Leaf-1 to VLAN10 and then forwarded to Host-1 out of Eth1. If, however, the traffic is hashed to the link connected to Leaf-2 the traffic flow is as follows:

4. When the packet arrives at Leaf-2, Leaf-2 must route it to VLAN 10. If the gateway for VLAN 10 has only been configured on Leaf-1, the packet must first be routed to a switch where VLAN 10 has a routing interface. In case the data center fabric is Layer 3, the spines would have advertised the VLAN 10 subnet to Leaf-2, so there would be a route for the packet going via the spine switches. However, this path is longer than traversing the peer link so from a latency optimal view the best way to solve this is to add a gateway for VLAN 10 to Leaf-2. This will also allow you to connect single-homed hosts on VLAN 10 to Leaf-2. The gateway for VLAN 10 is created the same way as for VLAN 20, i.e. with a virtual router IP. With a VLAN 10 interface active on Leaf-2, Leaf-2 routes the packet to VLAN 10. Leaf-2 has learned from Leaf-1 that the MAC/ARP for Host-1 is reachable through the peer-link (Po10), the packet is therefore transmitted across the peer link.

5. Receiving the packet on the peer-link, Leaf-1 bridges the packet to the local interface connected to Host-1 (Eth1).

Configuration considerations

As pointed out in step 4, Host-2 will hash traffic flows to both Leaf-1 and Leaf-2. For Leaf-2 to be able to route traffic to VLAN 10 it should be configured with an IP address on VLAN 10. By configuring VLAN 10 with a virtual router IP that is shared by both leaf switches, single-homed hosts on VLAN 10 can also be connected to Leaf-2. Just as in the previous use-case MLAG functionality will make sure that both leaf switches will automatically learn about single-homed hosts connected to the other leaf.

For more information about VARP see Active-active router redundancy using VARP

Bridged or routed traffic between racks in a Layer 2 Leaf-Spine fabric

In a Layer 2 Leaf-Spine fabric, the spine switches are configured in an MLAG pair, with port-channels going to each leaf switch. This means a traffic flow between hosts connected to different racks in the data center will first get hashed by a leaf switch to either of the spine switches and then get hashed again by the spine switch to any of the leafs in the rack in which the destination host resides. With perfect load balancing in the data center, 50 percent of the traffic would arrive at Leaf-1 and 50 percent at Leaf-2. However, in theory, you can get anything between 0 and 100 percent of the flows to any of the leaf switches. The load balancing behavior should be monitored and, if needed, the hashing algorithm used on the switches should be adjusted to get as close to 50 percent as possible. 

Traffic from a host in a remote rack to a single-homed host in a local rack can arrive at either of the leaf switches in the local MLAG leaf pair. This will create two different traffic flows within the local rack. In the first flow, traffic arrives at Leaf-1 from either Spine-1 or Spine-2, as shown in the picture below.

1. Traffic arrives at Leaf-1 from either Spine-1 (Eth5) or Spine-2 (Eth6).

2. Since Host-1 is directly attached to Leaf-1, it will simply send traffic out the interface (Eth1) connected to Host-1. 

In the second traffic flow traffic arrives at Leaf-2 from either Spine-1 or Spine-2, as shown in the picture below.

 

1. Traffic arrives at Leaf-2 from either Spine-1 (Eth5) or Spine-2 (Eth6).

2. As been pointed out in the previous use cases for traffic flows inside an MLAG pair, when the traffic arrives at the leaf switch the single-homed host is not connected to, it has to be sent across the peer-link to the other leaf switch. 

3. Leaf-1 sends the traffic to Host-1 via Eth1.

Configuration considerations 

The configuration needed for this use case is the same as for the use cases where the traffic flows are local to the MLAG pair. VLANs must be configured on both leaf switches, and in case gateways for those VLANs are needed they must also be configured on both leaf switches.

VXLAN bridged or routed traffic between racks in a Layer 3 Leaf-Spine fabric

In this scenario, we have a layer 3 Leaf-Spine fabric. Bridging and routing traffic between hosts in different racks is made possible using VXLAN overlay.

The current VXLAN design specifies that only one VTEP IP can be configured per MLAG pair. This VTEP IP is a logical VTEP that is shared between the leaf switches in the pair. This IP is distributed via the underlay routing protocol to all spine switches. This way ECMP is achieved for traffic between racks in the data center.

From the perspective of traffic paths for single-homed hosts, it does not matter if the two hosts communicating are on the same or on different subnets. If they are on different subnets routing between the subnets would occur at the local or remote leaf pair, depending on which VXLAN routing model is used (direct or indirect). In both cases, traffic would traverse the fabric in a VXLAN tunnel.

Let’s look at the traffic flow when Host-1 communicates with a host in a remote rack.

1. Host-1 sends a packet destined to a host in a remote rack.

2. In case the other host is on another subnet Leaf-1 might first route the packet to the correct VLAN. In the initial ARP process, when Host-1 requested the MAC address of the remote host, Leaf-1 learned which VTEP the remote host’s MAC address is behind. This information is also synced to Leaf-2 as per standard MLAG behavior. Leaf-1 will VXLAN encapsulate the packet and send it to the remote VTEP.

3. When the remote host sends traffic back to Host-1, the remote leaf switch will VXLAN encapsulate the packet and send it to the logical VTEP of the local leaf pair (2.2.2.1). Due to the ECMP underlay of the Leaf-Spine fabric, the packet could arrive at either Leaf-1 or Leaf-2, as shown in the spine routing table. Just like in the previous uses cases, if the packet arrives at Leaf-1, it will be directly sent to Host-1. Let’s look at the other case when the packet arrives at Leaf-2.

4. When the packet arrives at Leaf-2 it will decapsulate the VXLAN packet. If the packet is on another VLAN than Host-1’s VLAN, Leaf-2 will route the packet to the correct VLAN. Leaf-2 has through the MLAG sync process learned from Leaf-1 that the MAC address for Host-1 is reachable through Po10 and will send the packet over the peer link.

5. Leaf-1 delivers the packet to Host-1 via Eth1.

Configuration considerations 

Currently, the VXLAN design in EOS requires switches in an MLAG pair to act as a single VTEP, and therefore they have the same VTEP IP. Also, VLAN-to-VNI mappings must be identical on both switches. This means that traffic from single-homed hosts will be encapsulated using the same VTEP IP as the source, regardless of if they are on the same or a different VLAN than the dual-homed hosts. This means that there is no way to prevent traffic to a single-homed host from arriving at the leaf switch the host is not connected to and traversing the peer link.

If routing in the local MLAG pair is a requirement a shared gateway IP must be configured on both leaf switches. When setting up a direct routing model (routing at every leaf) a lot of IP addresses would be consumed if the “virtual-router ip” concept is used and the data center is big. For this reason, it is possible to instead configure the same anycast IP address on the SVI on all leaf switches, but without the need for also configuring a unique physical IP address on the SVIs.

This solution is only possible in a VXLAN design. For more information see Difference between”ip virtual-router address” and “ip address virtual”

External controller integrated with the VXLAN Control Service (VCS)

When using an external OVSDB controller to manage VTEPs, great care must be taken when using non-MLAG interfaces (which ports connecting single-homed hosts typically are). These interfaces are configurable using the external controller as long as the VLAN to VNI mapping of each MLAG peer is consistent and do not conflict with each other. To ensure the external controller creates consistent VLAN to VNI mapping across the peers the following workaround can be used. The idea is to create an MLAG interface where the Port-Channel side without the single-homed host is configured without members in order to force traffic across the peer link to the side with the device. The MLAG peer connected to the single-homed host will be configured with a Port-Channel with a single member and with LACP disabled. A sample configuration is as follows:

Design considerations

As shown in the different traffic flow use-cases above, when there are single-homed hosts connected to a leaf switch in an MLAG pair, there will be a steady flow of traffic over the MLAG peer link. In comparison, when there are only dual-homed hosts connected to an MLAG leaf pair, there will, in a non-failure state, only be control traffic traversing the peer link. Only in certain failure scenarios, there will be host traffic traversing the peer link.

Worth noting is that under steady-state conditions, traffic sourced from a single-homed host will not traverse the peer link in the local MLAG leaf pair, except in the case where a single-homed host connected to one leaf switch in the MLAG pair sends traffic to a single-homed host connected to the other leaf switch in the MLAG pair. Traffic destined for a single-homed host will traverse the MLAG peer link when traffic is hashed to the leaf switch the host is not connected to.

For this reason, it is important that the bandwidth of the peer link is high enough to carry the steady flow of host-to-host traffic. When calculating the amount of bandwidth to allocate for the peer link the following must be taken into consideration:

  • The number of single-homed hosts
  • The bandwidth the single-homed hosts are connected with
  • Which of the traffic flow use cases above are present in the data center
  • Planned oversubscription ratio

Typically the peer-link should be able to carry 50 to 100 percent of the bandwidth your single-homed hosts are connected with. Added to this should be the bandwidth you need to accommodate traffic in failure scenarios.

Follow

Get every new post on this blog delivered to your Inbox.

Join other followers: