• Virtual IPs in Vxlan and need for vVTEP

 
 
Print Friendly, PDF & Email

Objective

Objective of this Document is to contrast the differences in the behaviour of “ip address virtual” and “ip virtual-router address” in VxLan and to understand the need for virtual VTEP IP (VVTEP) with or without L2 VTEP in a network.

Topology

VXLAN Direct Routing Model

Virtual IP in SVI 100: 100.100.100.50

Virtual IP in SVI 200: 200.200.200.50

Virtual MAC: 0011.2233.4455

Virtual VTEP IP (VVTEP): 5.5.5.5/32

Underlay Protocol used: OSPF

Types of Virtual IPs:

Types of Virtual IPs usually configured with Vxlan:

1) ip virtual-router address

2) ip address virtual

RULE-1 : If ethernet source MAC of original/naked frame is “PHYSICAL” then after encapsulation, outer Source IP will also be Primary  IP of vxlan loopback interface. If ethernet source MAC of original/naked frame is “VIRTUAL”, then after encapsulation outer Source IP will be VVTEP IP (secondary loopback IP).

When using “ip virtual-router address”:

RULE-2 : When a remote host sends ARP request for the virtual IP,  routing VTEP generates the ARP reply where the Ethernet Source MAC of the ARP reply will be the Virtual MAC address that has been configured on the switch. This is same for GARP as well. Correspondingly, all information (Sender IP, Sender MAC,) in the ARP header (of GARP and ARP reply) will be VIRTUAL. In all other cases, the ethernet source MAC will be physical and also all information in ARP request will be Physical MAC and physical IP.

In the above topology, consider SVI-100,200 has “ip virtual-router address” configured on all VTEPs. If H1 (vlan 100) tries to reach H2 (vlan 200), it will resolve ARP with it’s gateway (SVI 100 on VTEP-1). After that VTEP-1 will send ARP request for H2 (vlan 200). When it will generate ARP request, Source MAC will be it’s system MAC (not virtual). Since ethernet MAC will be physical after encapsulation, outer source IP will be the primary loopback IP and not the VVTEP (secondary) ← Rule-1

After VTEP-1 learns the ARP of H2, it will forward the traffic (sent by H1) and in that case also, ethernet source MAC will be physical and correspondingly outer source IP will be physical IP.

When using “ip address virtual”:

RULE-3: During ARP request and Arp reply, Ethernet source MAC and all information in the ARP header is virtual -> Ethernet Source MAC- VMAC, Sender IP- Virtual IP, sender MAC- VMAC. So in this case, outer source IP will be VVTEP IP (secondary loopback IP).

RULE-4: When data-plane routing occurs on the routing VTEP, whether the traffic is routed in HW or SW, source MAC will be the switch’s system MAC and after encapsulation source IP will be physical loopback IP (VTEP IP).

Considering SVI-100,200 has “IP Address virtual” configured, if H1 tries to reach H2, it will resolve ARP with its gateway. When VTEP-1 will generate ARP request for H2 (VLAN 200), it will source it with it’s virtual MAC. Since ethernet source MAC is virtual, outer Source IP will also be VVTEP IP.

When VTEP-1 resolves ARP of H2, it will route and forward the traffic to H2. In that case, the source MAC will be the System MAC and correspondingly outer source ip will be the primary VTEP IP.

RULE-5: Applicable for both “ip address virtual” and “ip virtual-router address” : If ARP request is received on a routing vtep on the VXLAN interface for a VLAN where “IP Address virtual” is configured, it will only send the ARP reply if the outer destination IP of the packet is VVTEP IP (not it’s physical IP). This to prevent duplicate ARP replies in the network. Consider a host connected to a bridging vtep whose gateway is present across Vxlan tunnel, if this rule is not present, all routing vteps will then send ARP reply as same VIRTUAL IP will be configured on all routing vteps.

Also, in the flood list of the bridging vtep, the VVTEP must be present and also it must have a route to reach the VVTEP. Spine will have ECMP routes for VVTEP IP pointing to all routing vteps.

ARP REPLY SYNC:

This feature is only supported with “ip address virtual”.

ARP Reply Sync is applicable primarily for the following two cases:

1) A L3 VTEP (named VTEP-1) receives an naked ARP Reply from H1(directly connected host) whose dst-MAC == VARP-MAC.

2) A L3 VTEP (named VTEP-1) receives an VXLAN encapsulated ARP Reply from H2 (on interface Vx1) whose outer-dst-ip == VVTEP && inner-dst-MAC == VARP-MAC.

– In both the above cases, VTEP-1 will do SW-HER to send that ARP Reply to all VTEPs in its floodset except following 3 VTEPs:

<A> Itself (as physical VTEP — its primary loopback IP)

<B> VVTEP IP (which is also itself)

<C> In <2> case, the VTEP which sent this VXLAN ARP Reply (VTEP3).

       – The HERed ARP Reply has following format:

<A> Outer-src-ip: VTEP-1’s primary VTEP IP

<B> Outer-dst-ip: Copy of the packet per each IP in Flood List

<C> inner-dst-MAC: VMAC

<D> Inner-src-MAC: VTEP-1’s system-MAC. <– Applicable only for second case. Inner SMAC doesn’t change in first case.

       – Since VTEP-1 replaces inner-src-MAC to its own system-MAC — which means other VTEPs can not learn the host’s src-MAC from this HERed ARP Reply packet

       – This happens only in case of “ARP REPLIES” from end host (source MAC is host-MAC and destination-MAC is vMAC)

       – This helps to synchronize ARP of end host across all routing vteps.

       – This feature is not supported in “ip virtual-router address”. In “ip virtual-router address”, ARP request generated by switch is sourced from it’s own System MAC and hence ARP reply from host will reach back to the original router (that generated the ARP requested).

Source IP NAT feature while using “ip address virtual”

What is the purpose of Source IP NAT ?

Often when we ping a remote host from a VTEP where “ip address virtual” is configured, we see the ping fails. This is because the source IP (virtual IP of SVI) gets NAT’ed to the highest loopback IP in that VRF and the end host might not have route to reach the NAT’ed IP. If there is no loopback, simply the highest IP in the vrf is used.

The reason why this happens is because, since SVI is configured with a virtual IP, the ICMP reply from remote host may reach to some other routing vteps (due to ECMP routes of VVTEP on spine) as the same SVI ip address is configured in all routing vteps.

To ensure ICMP reply from host reaches the original vtep (which initiated the ICMP echo requests) the source-IP of icmp request  is changed to highest loopback IP. The NAT’ed IP will be unique IP and hence ICMP reply will not go to any other VTEP.

Typically, the NAT’ed IP would be the IP on the Loopback interface that is associated with the VXLAN interface (though it could really be any other non-virtual IP on the switch). However, in case of MLAG-VTEP’s, users need to configure an additional loopback in default VRF, since Vxlan loopback interface will have the same IP among MLAG-VTEPs, for Source NAT.

Packet Structure after NAT:

Following is packet structure after NAT (Refer to topology):

Let’s say we ping H2 from VTEP-1 SVI 200 (which has “ip address virtual” configured), the source-IP will be NAT’ed and this is how it will look like:

  • Inner-source IP: Highest loopback IP in the vrf (or highest ip in the vrf if no loopback)
  • Inner-destination IP: IP address of H2
  • Inner source MAC: VTEP-1 System MAC (not vMAC)
  • Inner destination MAC: MAC address of H2
  • Outer Source IP: Physical IP (not VVTEP) ← refer Rule-1
  • Outer destination IP: Physical IP of VTEP-3

Since outer source IP is physical IP, the ICMP reply will return back to VTEP-1 again.

Support in Non-default VRF:

This feature is also supported in non-default overlay VRFs which allows users to configure VXLAN SVIs in non-default VRFs.

Note that while the overlay can be in non-default VRFs, the underlay (physical connectivity between VTEPs) must be in default VRF. Also, only IPv4 based VXLAN routing is currently supported.

This feature makes VXLAN routing more deployable by:

  1. Allowing users to configure multiple overlay routing domains using VRFs, which is important in hosted environments.
  2. Allowing users to have a clean separation between underlay and overlay traffic, including simpler and cleaner protocol configuration without having to use complicated route-maps to control distribution of prefixes to peers in the underlay and overlay.

Note that the user is expected to ensure that there is at least one non-virtual IP on some interface in the overlay VRF. More importantly, this IP must be reachable from all other VTEPs in the overlay VRF. There are two ways to do this:

  1. Configure a Loopback interface in the overlay VRF on all VTEPs. Add static routes to reach that Loopback IP on each VTEP from all other VTEPs.
  2. Configure another VXLAN SVI in the overlay VRF on all VTEPs (so SVI with an unique IP + VLAN-VNI map).

Option 2) is easier to configure. Either way the user has to configure an unique IP per VTEP per overlay VRF for this to work.

Does NAT occurs when we initiate ping from remote host to it’s Gateway (Virtual IP on VTEPs) ?

Answer is NO. Let’s understand why:

Let’s say we initiated ping from H2 (refer topology) to it’s Gateway (SVI 200 on VTEP-1 and VTEP-2), the ICMP request can reach either VTEP-1 or VTEP-2 (due to ECMP route of VVTEP on spine). Whichever VTEP receives the packet will create the ICMP reply and will send it to H2 and the ping will be successful. While it creates the reply packet, it will use it’s actual SVI-200 IP as source. There is no need for NAT in this case as ping will always be successful from the perspective of the end host (H2), no matter which VTEP sends the ICMP reply.

Why do we need VVTEP?

The main reason for a Virtual VTEP configuration is when you have a host behind a L2 VTEP sending traffic for routing, you would like this traffic to be routed by one of many L3 VTEPs. So packet is encapsulated and sent to the Virtual VTEP IP by the L2 VTEP. This packet (thanks to ECMP) will get hashed to one of the L3 VTEPs. Thus you get the distributed gateway functionality.

Few other reasons listed below:

1. To avoid MAC flaps on L2 vtep:

Ideally, every overlay MAC address should be associated with a single vtep IP. However when we configure virtual MAC on multiple routing vteps (VTEPs with SVI), the virtual MAC gets associated with multiple vtep IPs.

This can cause MAC flaps on bridging vteps as the bridging vtep may receive traffic with source MAC- VMAC from different vteps.

So a shared VTEP IP a.k.a Virtual VTEP IP is configured on all routing vteps (where same vMAC is configured) to avoid the flaps. VVTEP IP is a secondary IP on the vxlan loopback interface.

2. To avoid ARP flaps on end hosts:

Consider the following topology:

h1 <=> VTEP A  ==== VTEP B ===== VTEP C

h1 is connected to VTEP A. VTEP B and C are remote VTEPs from h1’s perspective. Let’s say h1 ARPs for its default gateway (which is present on VTEP A, B, C). This ARP request is HERed by VTEP A  to VTEP B and C since it is a broadcast packet.

If VVTEP is configured on all VTEPs:

1) VTEP A replies with VARP MAC.

2) VTEP B drops the request.

3) VTEP C drops the request.

If VVTEP is not configured:

1) VTEP A replies with VARP MAC.

2) VTEP B replies with it’s bridgeMAC.

3) VTEP C replies with it’s bridgeMAC.

Now in the latter case the MAC seen by the host for its default gateway depends on which ARP reply reaches the host last. So, depending on that, the host might see the VARP MAC or the bridgeMAC of VTEP B/C.

Also note that the VVTEP IP must be configured only on L3 VTEPs (i.e. VTEPs with the SVIs). On L2 VTEPs (i.e. VTEPs without the SVIs), VVTEP IP must NOT be configured as the secondary address but it MUST be added to the HER flood list.

This is why VVTEP must be configured on all L3 VTEPs even if there is no L2 VTEP in the network.

IMPORTANT NOTE: The ARP flap behaviour doesn’t happen from EOS-4.25.x onwards. When a L3VTEP (with no VVTEP configured) receives Vxlan ARP request (on Vxlan interface) targeted for its own virtual IP, it sends the ARP reply with its VMAC. Therefore, an endhost always receives multiple ARP reply (sent by multiple L3VTEP), all sourced with a single VMAC (hence no ARP flaps). This, however, will cause the mapping of VMAC to VTEP IP to flap on a L2 VTEP.

3. To avoid multiple ARP replies

When a host connected to L2vtep, sends ARP request for it’s gateway (virtual IP), the arp request will flood to all the vteps in the network. All routing vteps will send ARP reply to the host.

However, when VVTEP is configured, only that vtep sends the ARP reply which receives the encapsulated ARP request with outer destination IP- VVTEP IP.

When outer destination ip= VVTEP IP, spine will have ECMP route for the VVTEP IP and hence it can hash and send it to any routing vtep. Whichever router receives that packet (with dest IP- VVTEP), it will generate the ARP reply.

Also, it is always recommended by development team to configure VVTEP in presence of VMAC and virtual IP as switches may show some abnormality in behavior if it is not configured.

Please note: Any packet with destination MAC!=VMAC and outer destination IP= VVTEP IP will be dropped by the switch except ARP. This means, if any BUM traffic is received by routing vtep with dest-MAC!=vMAC and destination-IP=VVTEP, it will be dropped. This to avoid duplicate packets in the network.

When is VVTEP not required ?

In EVPN Asymmetric IRB (Direct routing model) where there is no Bridging/L2 VTEP in the network, VVTEP IP is not required. This is because ARP flaps on end hosts will not occur if EVPN as the VxLAN control plane is used. With EVPN asymmetric IRB, ARP request sent by end host to it’s locally connected Gateway is not HER’ed to remote VTEPs. ARP suppression code does this optimisation, so the Gateway ARP requests don’t get to remote VTEPs and no ARP flaps on end host is seen.

However, if there is a L2VTEP (VTEP which doesn’t have overlay SVI) in the network, VVTEP must be configured otherwise end hosts behind the L2 VTEP will end up learning incorrect ARP for the Virtual IP: It will learn the physical mac (of any VTEP) for the Virtual IP in ARP table instead of VMAC, this will cause loss of redundancy as traffic destined to the Gateway will always end up reaching a particular VTEP (because of the Physical mac in ARP cache of end host being mapped to VIP).

Packet captures to understand difference in ARP request/replies with and without VVTEP

Lab-Topology:

Virtual IP in SVI 100: 100.100.100.50

Virtual IP in SVI 200: 200.200.200.50

IP Address of H3 (Vlan100): 100.100.100.49

VVTEP IP (VVTEP): 5.5.5.5/32

Virtual MAC: 0011.2233.4455

H3 MAC Address: 001c.731c.1da8

SW1 MAC Address: 444c.a82e.0ecf

Type of Virtual IP used: “ip address virtual”

Underlay used: OSPF

Test cases when VVTEP is configured on SW1 and SW2 (L3VTEP)

1.)  ARP request packet before encapsulation generated by SW1 (SVI100) for H3 (Vlan100):

Interact above or View on Cloudshark

2.) ARP request packet after Vxlan encapsulation:

Interact above or View on Cloudshark

Please note: Before encapsulation, ARP request generated by kernel is sourced from Physical MAC. This is because Kernel binds SVI (“ip address virtual”) with the switch’s system MAC.

However, before encapsulation, DMA driver modifies the source MAC to the virtual MAC (0011.2233.4455) and then the packet is encapsulated by VxlanSwFwd Agent.

3.) ARP reply by SW1 (VLAN100) after H3 send ARP request for it’s gateway:

Interact above or View on Cloudshark

4.) ARP relay packet generated by SW1:

When SW1 sends ARP request for H3 and it receives an ARP response where:

Inner destination MAC: 0011.2233.4455 (VMAC)

Outer Destination IP: IP Addresses in the Flood List (Except VVTEP and the VTEP from where ARP reply was received)

The arp reply get’s software HER’ed (gets flooded to all vteps in flood list, except the vtep from where vxlan encapped arp reply was received) by SW1 so that the ARP  is sync across all L3VTEPS:

Interact above or View on Cloudshark

In the above capture, you will see that source MAC is changed to system MAC of SW1 (444c.a82e.0ecf) and destination VMAC is the VMAC 0011.2233.4455.

Test cases when VVTEP is NOT configured:

1.)  ARP request packet before encapsulation generated by SW1 (SVI100) for H3 (Vlan100):

Interact above or View on Cloudshark

2.) ARP request packet after Vxlan encapsulation:

Interact above or View on Cloudshark

After encapsulation, outer source IP=physical loopback IP of SW1 (1.1.1.1) and inner source MAC=0011.2233.4455 (VMAC). Due to this combination, it can lead to MAC Flaps on L2VTEP as there will other L3VTEPs as well which will source the ARP packet with it’s own physical loopback IP but the inner source MAC will still be the same (VMAC).

VVTEP IP should be configured to avoid such flaps.

3.) ARP reply by SW1 (VLAN100) after H3 send ARP request for it’s gateway:

Interact above or View on Cloudshark

In the above capture you will see when SW1 sends the ARP reply, it sends it’s physical MAC information in the ARP header. When H3 will receive the ARP reply, it will map 100.100.100.50 (Virtual IP) to 444c.a82e.0ecf (Physical mac of SW1).

This can create problems like ARP flaps on end hosts and concept of anycast gateway is lost as physical mac of the gateway will be learnt in ARP cache of end host. VVTEP resolves this issue.

4.) Encapsulated ARP reply of SW1 (VLAN100) as a response to H3:

Interact above or View on Cloudshark
Follow

Get every new post on this blog delivered to your Inbox.

Join other followers: