• Troubleshooting Dequeue deletes on 7280/7500 devices

 
 
Print Friendly, PDF & Email

Overview

On a 7280 / 7500 devices, any known unicast packet which arrives on an ingress port is classified by the ingress processor based on priority level and egress port, and enqueued in the Virtual Output Queue (VOQ). The same would apply for BUM traffic if ingress only replication is enabled. The packet is enqueued on the ingress chip buffers until the egress packet scheduler issues a credit grant for the given packet.

DeqDelete drops indicate stale packets in VOQ, i.e, packets which have been in the VOQ for more than 500ms without getting credits. These packets are deleted, and the drops appear as DeqDeletePktCnt. These are independent of InDiscards or OutDiscards.  This document outlines how to troubleshoot the DeqDelete drops.

Known Causes

Below are a couple of common reasons for DeqDeletes:

  • Starvation in low priority queues when strict priority queuing is used. It can be for Data Traffic (going to front ports) or Control Traffic (going to CPU).
  • Flow control configuration causing the same to be asserted on a port.

Syslog

If the device is reporting DeqDelete drops, you can see logs like the following in the “show logging”:

switch#show logging | grep DeqDelete

Mar 1 15:57:03 switch EventMgr: %HARDWARE-3-DROP_COUNTER_ALERT: Persistent Internal Drop 'DeqDeletePktCnt': 139906 detected on Jericho5/3

Another way to confirm would be to check the output of the following commands:

switch#show hardware counter drop

Summary:

Total Adverse (A) Drops: 334790742

Total Congestion (C) Drops: 0

Total Packet Processor (P) Drops: 501036382

Type  Chip       CounterName               :        Count :    First Occurrence :     Last Occurrence 

-----------------------------------------------------------------------------------------------------

A  Jericho5/3   DeqDeletePktCnt            :     83794436 : 2019-03-01 15:54:44 : 2019-03-01 16:11:15

A  Jericho3/5   DeqDeletePktCnt            :     82956909 : 2019-03-01 15:54:44 : 2019-03-01 16:10:51
switch#show platform fap interrupts

Jericho5/3

 --------------------------------------------------------------------------------------------------

 | Interrupt Bit                              | Count |    First Occurrence |     Last Occurrence |

 | IPS_QueueEnteredDel[0]                     |  4379 | 2019-03-01 15:54:44 | 2019-03-01 16:11:15 | 

Troubleshooting DeqDeletePktCnt Drops

Step 1: Confirm the device is seeing DeqDelete drops:

switch#show logging | grep DeqDelete

Mar 1 15:57:03 switch EventMgr: %HARDWARE-3-DROP_COUNTER_ALERT: Persistent Internal Drop 'DeqDeletePktCnt': 139906 detected on Jericho5/3

 

Step 2: Confirm that the DeqDeletePktCnt event occurred on the reported chip by looking at the timestamp of the most recent event. If the timestamp is not recent and not incrementing, the alert can be ignored.

switch#show hardware counter drop

Summary:

Total Adverse (A) Drops: 334790742

Total Congestion (C) Drops: 0

Total Packet Processor (P) Drops: 501036382

Type  Chip       CounterName               :        Count :    First Occurrence :     Last Occurrence 

-----------------------------------------------------------------------------------------------------

A  Jericho5/3   DeqDeletePktCnt            :     83794436 : 2019-03-01 15:54:44 : 2019-03-01 16:11:15

Note the chip number (In this case it is Jericho5/3).

 

Step 3: Identify the Queue number that experienced the delete by issuing the following command:

switch#platform fap Jericho5/3 diag get IPS_DEL_QUEUE_NUMBER

IPS_DEL_QUEUE_NUMBER.IPS0[0x124]=0x1501be8: <QUEUE_LAST_CR_TIME=0x15,

   DEL_QUEUE_NUM=0x1be8>

IPS_DEL_QUEUE_NUMBER.IPS1[0x124]=0x1401be8: <QUEUE_LAST_CR_TIME=0x14,

   DEL_QUEUE_NUM=0x1be8>                                    

This identifies the queue that was experiencing the DeqDelete event. 0x1be8 translates to 7144 in decimal. 

To identify which port this maps to, use the following command:

switch#show platform fap mapping

Jericho5/3 (FapId: 24  BaseSystemCoreId: 48)

      Port                SysPhyPort    Voq  Core  FapPort  OtmPort QPairs Xlge NifPort Qsgmii Serdes

-----------------------------------------------------------------------------------------------------
Ethernet5/7/3                    352   7144     0       14      128      8    -      34      -   (34)

If the VOQ is associated with the CPU port, it is likely the DeqDelete events are happening due to expected oversubscription scenarios with CPU bound traffic.

The VOQ number is always a multiple of 8. The decimal number from the diag command is the addition of the VOQ number and the Traffic Class. In the above example of 7144, we are seeing drops for et5/7/3 for TC0 (7144 + 0). If the decimal number had been 7146 (7144 + 2), that would indicate drops on et5/7/3 for TC2.

Starting EOS-4.22.0, we can also use the “show platform fap voq delete” command to get this information. Please refer to the below link for more details on this feature:

VOQ Delete Monitoring

Step 4: For a CPU VOQ, look for signs of high levels of IP traffic hitting the CPU on the chip reporting the DeqDelete:

switch#show cpu counters queue | nz

Jericho5/3 :

 CPU Queue                      Pkts        Octets     DropPkts       DropOctets
--------------------------------------------------------------------------------

 CoppSystemL3Ttl1IpOptions    462367     597850052        26708         40393461

 CoppSystemL3Ttl1IpOptions    165924     152731916         1571          2374189

In case the DeqDeletes are incrementing due to Control plane traffic, we need to find source interface from which we are receiving a high amount of control plane packets. We can run below tcpdump as shown below to understand the source of the packets.

switch#bash sudo tcpdump -nevvi any

You can also use tcpdump options to filter traffic and view only the packets you are interested in. The following link gives more information about the tcpdump:

https://eos.arista.com/using-tcpdump-for-troubleshooting/

 

Step 5: DeqDeletes can also be due to traffic from front panel ports. In that case, use the following command to get details about the different traffic classes coming in on that interface. As mentioned previously, a large amount of traffic in a higher priority traffic class could cause starvation of lower priority traffic class, leading to DeqDeletes in the lower traffic class. 

switch#show interfaces counters queue | nz

Aggregate VoQ Counters

Egress                Traffic              Pkts            Octets        DropPkts      DropOctets
Port                  Class  

Et5/7/3                   TC1        1618214102     1401158461686               0               0

Et5/7/3                   TC3         874715849      126071452651               0               0

Take multiple iterations of this command to make sure the counters are incrementing and not stale.

Recommendations

  • In case DeqDeletes are incrementing due to traffic from front panel ports, check if any QOS config is applied to the egress port for which we are seeing DeqDeletePktCnt increment. It is possible that we are starving lower priority queues (TC0, TC1) while preferring higher priority queues.
  • Credits can also be blocked to a queue due to a higher priority queue continuously servicing a high amount of traffic (when the egress scheduling is strict-priority).
 switch#show interfaces counters queue | nz

Aggregate VoQ Counters

Egress                   Traffic            Pkts             Octets       DropPkts     DropOctets
Port                     Class  

Et5/7/3                      TC1       1618214102     1401158461686              0              0

Et5/7/3                      TC3        874715849      126071452651              0              0

In the example above, it is possible that TC1 traffic does not get credits due to high traffic for higher priority TC3, and results in DeqDeletes in TC1.

This can be avoided by using robin scheduling. Following is the config for the same. This is at the interface level:

switch(config)#interface ethernet 5/7/3

switch(config-if-Et5/7/3)#tx-queue 0-3

switch(config-if-Et5/7/3-txq-0-3)#no priority

This will make the egress scheduling round-robin for TC 0-3, so we ensure at least some credits to the lower priority queue. Depending on which all lower priority queues are seeing drops, this configuration can accordingly be modified. Following is a link for documentation on round-robin scheduling:

https://www.arista.com/en/um-eos/eos-section-27-10-quality-of-service-configuration-commands#ww1162819

 

If the issue you are seeing does not match any of the scenarios described above, please contact Arista Support for further investigation.

Follow

Get every new post on this blog delivered to your Inbox.

Join other followers: