• Basic troubleshooting – Complete/Intermittent Traffic loss

 
 
Print Friendly, PDF & Email

Basic Troubleshooting – Traffic loss

Objective

The following document suggests some of the general troubleshooting steps that could be quickly performed to diagnose or to narrow down network connectivity issues.

The nature of outage can broadly be classified as:

  • Complete traffic loss
  • Intermittent traffic loss

Complete traffic loss

The following steps help in scenarios where ping or data traffic between two hosts results in 100% traffic loss.

To start with:

Gather the IP address and MAC address of both the hosts under concern.

Now there are two conditions:

Condition 1: Hosts are in the same subnet

The following steps help to isolate the switch between the hosts where the issue may lie.

STEP1:

Initiate a continuous ping from Host-A to Host-B and also from Host-B to Host-A. Continue with Step2.

Note: Continuous ping in both directions helps the switches to learn the mac addresses.

 

STEP2:

After Step1, we need to identify the forwarding path and expected reverse path between the hosts. This involves running the ‘show mac-address-table’ command on all the intermediate switches in the path to identify on which port Host-A mac-address is learned and on which port Host-B mac-address is learned.

The following command will be useful in determining the same.

switch#show mac address-table

In Step2 please make sure that the mac address of Host-A and Host-B are learned on all intermediate switches, on the correct port. If it’s not learned please proceed with Step3.

If mac addresses are learned on the correct interfaces in all the intermediate switches, proceed with Step5.

 

STEP3:

From Step2 we can determine the switch where the mac-address is not learned properly.

Make sure that the VLAN, on which the end hosts are residing, is allowed on both the ports (ingress/egress) of the switch. To confirm, please execute the following command.

Switch#show vlan vlan-number

From Step3 we can determine if the respective VLAN is allowed on both, forward and reverse path between the hosts. If VLAN is configured correctly please proceed with STEP4

 

STEP4:

Confirm that spanning tree is not blocking a particular port in the path:

switch#show spanning-tree vlan vlan-number

Run the above command for a couple of iterations and check if there is any change in output from the previous output.

In Step4, if there is no issue identified continue with Step5

 

STEP5:

If your platform supports advanced mirroring to CPU, it can be used to determine if the switch is receiving/transmitting packets. Below are the commands.

 

Switch(config)#monitor session 1 source Ethernet Interface-number both

Switch(config)#monitor session 1 destination Cpu

Switch(config)#tcpdump session 1 file flash:FileName.pcap

 

Collect the pcap file from switch’s flash and analyze it in a tool such as Wireshark or expand the pcap file on your console using below command.

 

Switch#bash tcpdump -r /mnt/flash/FileName.pcap | less

 

STEP6:

Check if both hosts have a valid arp entry of each other.

If the hosts don’t have a valid arp entry please proceed to the end of the document, where the necessary information required by Arista TAC is discussed.

If both the hosts have a valid arp entry. Proceed with Step7

 

STEP7:

The methods mentioned below are additional steps to determine the problem and may not suit in all network environments.

Please proceed with raising a ticket with Arista TAC if the issue is having a production impact and time factor is in place.

 

Method 1: Counter bin with a hop by hop check

Log in to the first switch in the path from Host-A perspective.

From Step2, we had determined the ingress and egress port. Based on that execute the following command.

Switch#show interfaces ethernet <ingress port> counters bins

Sample Flow:

Traffic Flow is in this direction ------------> 

Interface Eth27/4 ------- |Switch| ------ Interface Eth23/4

Switch#show interfaces ethernet 27/4 counters bins

Input

Port          512-1023 Byte  1024-1522 Byte  1523-MAX Byte

-------------------------------------------------------------

Et27/4                    0 0        0

 

Identify a bin where the counters are zero or not incrementing over a period of time. Note down the bin’s byte range.

Stop the ping that was started at Step1 and start a new ping with zero/non incrementing bin’s size.

Check if the ingress/egress counters are incrementing.

Switch#show interfaces ethernet 27/4 counters bins
Input
Port          512-1023 Byte  1024-1522 Byte  1523-MAX Byte
-------------------------------------------------------------
Et27/4                    5 0        0

Switch#show interfaces ethernet 23/4 counters bins
Output
Port          512-1023 Byte  1024-1522 Byte  1523-MAX Byte
-------------------------------------------------------------
Et23/4        5   0 0

Continue checking all the L2 switches, hop by hop, in the path and identify which switch’s ingress/egress counters are not matching.

 

Method 2: IP ACLs with statistics per-entry:

Depending on the platform and EOS code compatibility, the following method could be used to check hop by hop packet propagation.

From previous steps, ingress and egress ports are identified in all intermediate switches.

Apply the ingress port ACL on all incoming interfaces in the forward path.

The sample ACL config is as below:

ip access-list test

   statistics per-entry

   10 permit ip host hostA_IP host hostB_IP

   20 permit ip any any
interface Ethernet1

   ip access-group test in

 

Initiate a ping from Host-A to Host-B with say count as 5.

Check on all intermediate nodes, if the ACL statistics matched 5 packet count. If not, jump to the previous hop and check for the logs.

Switch#sh ip access-lists test

IP Access List test

        statistics per-entry

        10 permit ip host 1.1.1.12 host 1.1.1.13 [match 5 packets, 0:00:00 ago]

        20 permit ip any any

 

Condition 2: If the hosts are in different subnets

Check if the host is able to ping the gateway.

Run a traceroute from the host and check the point at which it fails.

Check if the failing device has a valid route to the destination.

Switch#show ip route <destination_IP>

 

Once we identify the section of the network where the problem may lie, we can follow the steps mentioned in Condition 1 for further isolating the issue.

Intermittent traffic loss

To begin with:

Gather the IP address and MAC address of both the hosts under concern.

Check if there are interface flaps on the path between the source and destination. If so, the following logs will be repeatedly observed on the switch where the flaps are happening. 

Leaf-1 Ebra: %LINEPROTO-5-UPDOWN: Line protocol on Interface Ethernet1, changed state to down
Leaf-1 Ebra: %LINEPROTO-5-UPDOWN: Line protocol on Interface Ethernet1, changed state to up


If the interface flaps are happening at random intervals, we could check interface uptime and the number of state changes.

Switch#show interfaces ethernet 1
Ethernet1 is up, line protocol is up (connected)
  Hardware is Ethernet, address is 5038.0002.0001 (bia 5038.0002.0001)
  Member of Port-Channel1
  Ethernet MTU 9214 bytes
  Full-duplex, Unconfigured, auto negotiation: off, uni-link: n/a
 Up 10 seconds
  Loopback Mode: None
  17 link status changes since last clear

 

Check for any spanning tree instability:


Switch#show logging | grep -i "Stp"
Stp: %SPANTREE-6-STABLE_CHANGE: Stp state is now not stable

 

The following command can be used to check the number of spanning tree topology changes that had happened with the most recent timestamp. With the help of the timestamp and number of changes, we can determine if the STP is stable on the switch.


Switch#show spanning-tree detail
 MST0 is executing the mstp Spanning Tree protocol
 <snipped> 
  Root port is 100 (Port-Channel1), cost of root path is  0 (Ext) 1999 (Int)
  Number of topology changes 3 last change occurred 4842 seconds ago from Port-Channel1

Check if there are any output discards on interfaces.

The following output can be collected for a couple of iterations to determine if the counters are incrementing over a period of time:

Switch#show interfaces counters discards | nz

Port               InDiscards OutDiscards 

--------------- ---------------- ----------- 

Et11                        0 15 
Et12                        0 10092 
Et13                        0 31095 
---------           --------- --------- 
Totals                      0 41202

Check if there are any CRC and Rx errors on interfaces in the path between the hosts

------------- show interfaces counters errors -------------
Port               FCS Align Symbol       Rx Runts Giants Tx
Et1              3245 0    0 3245 0      0 0
Et2              1939 0    0 1939 0      0 0
<snipped>

Note: The steps documented in this guide helps the engineer to track down the issue to a particular switch or networking device where the problem may lie. However, a TAC case with Arista is recommended to confirm and determine the nature of the problem.

Please collect the following logs from the suspected switches after performing the troubleshooting and open a TAC case.

show tech-support
show agent log
show agent qtrace
show logging system
Follow

Get every new post on this blog delivered to your Inbox.

Join other followers: