Basic Troubleshooting – Traffic loss
The following document suggests some of the general troubleshooting steps that could be quickly performed to diagnose or to narrow down network connectivity issues.
The nature of outage can broadly be classified as:
- Complete traffic loss
- Intermittent traffic loss
Complete traffic loss
The following steps help in scenarios where ping or data traffic between two hosts results in 100% traffic loss.
To start with:
Gather the IP address and MAC address of both the hosts under concern.
Now there are two conditions:
Condition 1: Hosts are in the same subnet
The following steps help to isolate the switch between the hosts where the issue may lie.
Initiate a continuous ping from Host-A to Host-B and also from Host-B to Host-A. Continue with Step2.
Note: Continuous ping in both directions helps the switches to learn the mac addresses.
After Step1, we need to identify the forwarding path and expected reverse path between the hosts. This involves running the ‘show mac-address-table’ command on all the intermediate switches in the path to identify on which port Host-A mac-address is learned and on which port Host-B mac-address is learned.
The following command will be useful in determining the same.
switch#show mac address-table
In Step2 please make sure that the mac address of Host-A and Host-B are learned on all intermediate switches, on the correct port. If it’s not learned please proceed with Step3.
If mac addresses are learned on the correct interfaces in all the intermediate switches, proceed with Step5.
From Step2 we can determine the switch where the mac-address is not learned properly.
Make sure that the VLAN, on which the end hosts are residing, is allowed on both the ports (ingress/egress) of the switch. To confirm, please execute the following command.
Switch#show vlan vlan-number
From Step3 we can determine if the respective VLAN is allowed on both, forward and reverse path between the hosts. If VLAN is configured correctly please proceed with STEP4
Confirm that spanning tree is not blocking a particular port in the path:
switch#show spanning-tree vlan vlan-number
Run the above command for a couple of iterations and check if there is any change in output from the previous output.
In Step4, if there is no issue identified continue with Step5
If your platform supports advanced mirroring to CPU, it can be used to determine if the switch is receiving/transmitting packets. Below are the commands.
Switch(config)#monitor session 1 source Ethernet Interface-number both Switch(config)#monitor session 1 destination Cpu Switch(config)#tcpdump session 1 file flash:FileName.pcap
Collect the pcap file from switch’s flash and analyze it in a tool such as Wireshark or expand the pcap file on your console using below command.
Switch#bash tcpdump -r /mnt/flash/FileName.pcap | less
Check if both hosts have a valid arp entry of each other.
If the hosts don’t have a valid arp entry please proceed to the end of the document, where the necessary information required by Arista TAC is discussed.
If both the hosts have a valid arp entry. Proceed with Step7
The methods mentioned below are additional steps to determine the problem and may not suit in all network environments.
Please proceed with raising a ticket with Arista TAC if the issue is having a production impact and time factor is in place.
Method 1: Counter bin with a hop by hop check
Log in to the first switch in the path from Host-A perspective.
From Step2, we had determined the ingress and egress port. Based on that execute the following command.
Switch#show interfaces ethernet <ingress port> counters bins
Traffic Flow is in this direction ------------> Interface Eth27/4 ------- |Switch| ------ Interface Eth23/4
Switch#show interfaces ethernet 27/4 counters bins Input Port 512-1023 Byte 1024-1522 Byte 1523-MAX Byte ------------------------------------------------------------- Et27/4 0 0 0
Identify a bin where the counters are zero or not incrementing over a period of time. Note down the bin’s byte range.
Stop the ping that was started at Step1 and start a new ping with zero/non incrementing bin’s size.
Check if the ingress/egress counters are incrementing.
Switch#show interfaces ethernet 27/4 counters bins Input Port 512-1023 Byte 1024-1522 Byte 1523-MAX Byte ------------------------------------------------------------- Et27/4 5 0 0
Switch#show interfaces ethernet 23/4 counters bins Output Port 512-1023 Byte 1024-1522 Byte 1523-MAX Byte ------------------------------------------------------------- Et23/4 5 0 0
Continue checking all the L2 switches, hop by hop, in the path and identify which switch’s ingress/egress counters are not matching.
Method 2: IP ACLs with statistics per-entry:
Depending on the platform and EOS code compatibility, the following method could be used to check hop by hop packet propagation.
From previous steps, ingress and egress ports are identified in all intermediate switches.
Apply the ingress port ACL on all incoming interfaces in the forward path.
The sample ACL config is as below:
ip access-list test statistics per-entry 10 permit ip host hostA_IP host hostB_IP 20 permit ip any any
interface Ethernet1 ip access-group test in
Initiate a ping from Host-A to Host-B with say count as 5.
Check on all intermediate nodes, if the ACL statistics matched 5 packet count. If not, jump to the previous hop and check for the logs.
Switch#sh ip access-lists test IP Access List test statistics per-entry 10 permit ip host 220.127.116.11 host 18.104.22.168 [match 5 packets, 0:00:00 ago] 20 permit ip any any
Condition 2: If the hosts are in different subnets
Check if the host is able to ping the gateway.
Run a traceroute from the host and check the point at which it fails.
Check if the failing device has a valid route to the destination.
Switch#show ip route <destination_IP>
Once we identify the section of the network where the problem may lie, we can follow the steps mentioned in Condition 1 for further isolating the issue.
Intermittent traffic loss
To begin with:
Gather the IP address and MAC address of both the hosts under concern.
Check if there are interface flaps on the path between the source and destination. If so, the following logs will be repeatedly observed on the switch where the flaps are happening.
Leaf-1 Ebra: %LINEPROTO-5-UPDOWN: Line protocol on Interface Ethernet1, changed state to down Leaf-1 Ebra: %LINEPROTO-5-UPDOWN: Line protocol on Interface Ethernet1, changed state to up
If the interface flaps are happening at random intervals, we could check interface uptime and the number of state changes.
Switch#show interfaces ethernet 1 Ethernet1 is up, line protocol is up (connected) Hardware is Ethernet, address is 5038.0002.0001 (bia 5038.0002.0001) Member of Port-Channel1 Ethernet MTU 9214 bytes Full-duplex, Unconfigured, auto negotiation: off, uni-link: n/a Up 10 seconds Loopback Mode: None 17 link status changes since last clear
Check for any spanning tree instability:
Switch#show logging | grep -i "Stp" Stp: %SPANTREE-6-STABLE_CHANGE: Stp state is now not stable
The following command can be used to check the number of spanning tree topology changes that had happened with the most recent timestamp. With the help of the timestamp and number of changes, we can determine if the STP is stable on the switch.
Switch#show spanning-tree detail MST0 is executing the mstp Spanning Tree protocol <snipped> Root port is 100 (Port-Channel1), cost of root path is 0 (Ext) 1999 (Int) Number of topology changes 3 last change occurred 4842 seconds ago from Port-Channel1
Check if there are any output discards on interfaces.
The following output can be collected for a couple of iterations to determine if the counters are incrementing over a period of time:
Switch#show interfaces counters discards | nz Port InDiscards OutDiscards --------------- ---------------- ----------- Et11 0 15 Et12 0 10092 Et13 0 31095 --------- --------- --------- Totals 0 41202
Check if there are any CRC and Rx errors on interfaces in the path between the hosts
------------- show interfaces counters errors ------------- Port FCS Align Symbol Rx Runts Giants Tx Et1 3245 0 0 3245 0 0 0 Et2 1939 0 0 1939 0 0 0 <snipped>
Note: The steps documented in this guide helps the engineer to track down the issue to a particular switch or networking device where the problem may lie. However, a TAC case with Arista is recommended to confirm and determine the nature of the problem.
Please collect the following logs from the suspected switches after performing the troubleshooting and open a TAC case.
show tech-support show agent log show agent qtrace show logging system