If you’ve tried implementing VXLAN on Trident VTEPs with a Layer-2 uplink to remote-vteps, you may have noticed that sometimes packets can’t be successfully forwarded across the VXLAN fabric, or perhaps the latency observed is higher than what’s expected.
Alternatively, maybe you never noticed the above symptoms and are here instead because you noticed the following log in your switch: STRATA-6-VXLAN_PORT_TO_NEXTHOP_OVERFLOW.
These are the most common observations made when a Trident switch has hit the following limitation, 113722, found in EOS Release Notes:
- Limitations and Restrictions in 4.26.1F
- Consider the following topology:
Limitations and Restrictions in 4.26.1F
7300X3, 7320X, 7368, CCS-720XP, CCS-750, DCS-7010, DCS-7050X,
DCS-7050X2, DCS-7050X3, DCS-7060X, DCS-7060X2, DCS-7060X4,
DCS-7250X, DCS-7260X, DCS-7260X3 and DCS-7300X Series
<snip to relevant section>
Packets being VXLAN encapsulated are constrained to a single underlay physical nexthop per physical egress port. Topologies involving SVIs over trunk ports, or non point-to-point routed ports towards the core will not work optimally, the encapsulated packets will be sent to the wrong next-hop for all but one of the VTEPs. Topologies involving virtual VTEPs connected via a downstream L2 switch will not work, as the receiving vswitch will not have the capability to route the packet to the right VTEP. (113722)
The purpose of this article is to delve into an example, and provide insight into the limitation. Please note, while we will briefly talk about a potential workaround, our design and deployment guides use L3 point-to-points for uplinks because they scale better and offer the benefits of ECMP.
Consider the following topology:
We see that there are three VTEPs in the VXLAN fabric, connected via a L2 uplinks to our L2-Spine/Core. This could be a single switch, or more-likely a larger mesh.
BGP peering is performed between SVIs to form our Underlay: VTEP1 and VTEP2 via VLAN 55, VTEP1 and VTEP3 via VLAN 65, and VTEP2 and VTEP3 via VLAN52.
Lastly, EVPN is used to dynamically update our control-plane.
Based on the above, and our topology, we expect that if Host 1 pings Host 2, Host1 need only do an ARP lookup for the remote device, since they exist in the same network, 10.100.0.0/24. The resulting packet will have source MAC and IP addresses for Host1 and Destination MAC and IP addresses for Host2.
This packet will ingress on VTEP1, who will do a mac-lookup for Host2.
VTEP1#show mac address-table vlan 100 Mac Address Table ------------------------------------------------------------------ Vlan Mac Address Type Ports Moves Last Move ---- ----------- ---- ----- ----- --------- 100 001c.7300.0099 STATIC Re1 100 001c.73f9.01e7 DYNAMIC Vx1 1 3:19:28 ago 100 7483.ef00.58e1 STATIC Re1 100 985d.8276.0dbd DYNAMIC Et54/1 1 12:44:11 ago 100 985d.82c1.21fb DYNAMIC Vx1 1 3:19:28 ago
Since VLAN100 is extended via VXLAN, we’ll see Vx1 listed in the port column, which tells us to consult the VXLAN Address Table for further forwarding details.
VTEP1#show vxlan address-table address 985d.82c1.21fb Vxlan Mac Address Table ---------------------------------------------------------------------- VLAN Mac Address Type Prt VTEP Moves Last Move ---- ----------- ---- --- ---- ----- --------- 100 985d.82c1.21fb EVPN Vx1 10.55.31.1 1 3:20:45 ago
The Vxlan Address Table show that Host2’s MAC was learned from VTEP-2.
At this point, we expect that Vtep1 will encapsulate the packet using its VTEP IP as the source, and VTEP3’s IP, 10.55.31.1, as the destination. Next-up we need to do a route-lookup for the destination VTEP to continue observing the forwarding path.
VTEP1#show ip route 10.55.31.1 <snip> B E 10.55.31.1/32 [200/0] via 10.55.55.2, Vlan55
Our next-hop is 10.55.55.2 out vlan 55 so we need to check our ARP entry for 10.55.55.2 to determine our outbound interface and our destination MAC for the VXLAN encapsulated packet.
Since we have a Layer-2 underlay, we expect to see VTEP3’s mac-address, VLAN55, and in EOS we also add the outbound interface to ARP, so we don’t need a MAC lookup for VLAN55 to see the interface.
VTEP1#show arp 10.55.55.2 Address Age(sec) Hardware Addr Interface 10.55.55.2 0:00:19 2899.3abe.ea26 Vlan55, Port-Channel1000
Given the above, we expect to forward the packet out interface Port-Channel1000 with a Vlan tag of 55, destination MAC of 2899.3abe.ea26. However, when we do a packet capture on the tx traffic for this switch, we see the following:
10:23:22.984189 74:83:ef:00:58:e1 > 28:99:3a:be:64:06, ethertype 802.1Q (0x8100), length 187: vlan 65, p 0, ethertype IPv4, 10.55.10.11.21881 > 10.55.31.1.4789: VXLAN, flags [I] (0x08), vni 100100 98:5d:82:76:0d:bd > 98:5d:82:c1:21:fb, ethertype IPv4 (0x0800), length 114: 10.100.10.101 > 10.100.10.36: ICMP echo request, id 14396, seq 644, length 80
All of our forwarding tables looks right, we have the correct destination IP, yet we’re sending to the wrong MAC and in the wrong VLAN!
This is the result of the limitation described above. The switch can only use a single next-hop address for all remote vteps, per-physical-interface. Since our topology uses BGP peering between SVIs, and we have two BGP peers, one-per-vtep, but only a single trunk port as our physical uplink, we’ve exceeded the limitation and now our forwarding is impacted.
How do we know to which vtep we will send all of our VXLAN encapsulated traffic?
We can use a platform command to see which VXLAN tunnels have been successfully programmed:
VTEP1#show platform trident vxlan port-to-next-hop Key : '*' : '*' after the interface name indicates that the Port-To-Next-Hop Table of that interface is in overflow state NH ID : Next Hop Index NH Mac : Next Hop Mac Address Prog.NHId : Next Hop ID programmed in H/W, 'None' in case of Error Programming the H/W Interface NHId NH Mac Prog.NHId Port-Channel1000 * 24 28:99:3A:BE:EA:26 None Port-Channel1000 * 34 28:99:3A:BE:64:06 34
In the above output, we see that the tunnel with Next-Hop of EA:26 has not been programmed, but the tunnel with the Next-Hop of 64:06 has. This means all VXLAN encapsulated traffic will use the one working tunnel. The tunnel that gets successfully programmed is whichever is first to be created.
We’ve dug into what this limitation looks like, but what happens to the traffic?
- In our example, VTEP2 is going to receive a packet with destination IP 10.55.31.1 and with a destination MAC Address of 28:99:3a:be:64:06. Since the destination MAC is is VTEP2’s system MAC, it will process the packet. The destination IP is for VTEP3, so VTEP2 will perform a route-lookup, and route the packet to VTEP3 via their mutual VLAN, 52.
- If instead, VTEP2 and VTEP3 didn’t have a common VLAN to form BGP peering over, either:
- VTEP3 wouldn’t have a route to VTEP2, this would result in the traffic being dropped.
- VTEP3 has a route to VTEP2, but it’s learned via VTEP1. This results in VTEP3 routing the packet back to VTEP1, who then performs a regular route-lookup and forwards to VTEP2.
Now that we understand the limitation better, are there any workarounds?
In a topology as simple as the one described above, the solution is somewhat simple. Instead of using a single physical interface to connect to the L2 Underlay, we would need two. For VTEP1, that’s one for VLAN 55 and one for VLAN 65. If the other VTEPs have the same Trident chip, they would also need two uplinks for this deployment.
However, since the total number of links required can otherwise be thought of as one-per-remote-vtep, scale is an obvious problem. Using L3 uplinks or an alternative method to tunnel the layer2 traffic may be better alternatives.