• Troubleshooting High CPU Utilization

 
 
Print Friendly, PDF & Email

Introduction

 

This article aims at helping you define what is CPU load on an Arista switch, how to know when it has a high load, and help troubleshoot high CPU utilization.

We will cover different topics related to High CPU utilization:

  1. How do I identify if my switch has a high CPU?
  2. What is considered normal CPU % utilization?
  3. What is example of a High CPU utilization?
  4. How to identify average load on an Arista switch?
  5. What is the difference between CPU load average and CPU utilization
  6. How to interpret the load numbers?
  7. Can the load average be greater than 1.0 and what does that mean?
  8. Does having a multi-core switch make a difference on load average?
  9. What is a high load on an Arista switch?
  10. What number should I be looking at for the load average i.e. 1,5 or 15 minutes?
  11. What are the symptoms of a switch with high CPU
  12. How to determine root cause of high load average?
  13. What could be causing a CPU-bound load?
  14. What is causing out of memory issues?
  15. Can a single process bring down the switch?

 

 

1) How to view the CPU usage

 

To view the CPU usage, use the top command from either CLI or Bash:

  • CLI: show process top
  • Bash: top

The following shows the CPU utilization on an Arista Switch:

switch# show process top
 
 ***Header***
 top - 15:52:02 up  6:55,  1 user,  load average: 0.08, 0.02, 0.01
 Tasks: 132 total,   1 running, 131 sleeping,   0 stopped,   0 zombie
 Cpu(s): 13.5%us,  0.5%sy,  0.0%ni, 86.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Mem:   2043420k total,  1296716k used,   746704k free,   107812k buffers
 Swap:        0k total,        0k used,        0k free,   768644k cached
 
 ***Processes***
 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
1469 root      20   0  460m  79m  33m S 12.0  4.0  53:54.75 FocalPoint
1483 root      20   0  167m  41m  15m S  2.3  2.1  15:04.18 PhyTn
1491 root      20   0  169m  42m  13m S  2.3  2.1  13:57.29 Mdio
1494 root      20   0  173m  43m  16m S  2.0  2.2   3:44.96 FruSnmp
1444 root      20   0  183m  78m  42m S  1.7  4.0   8:06.73 Sysdb
1442 root      20   0  161m  26m 2120 S  0.7  1.3   1:37.10 ProcMgr-worker
1474 root      20   0  163m  36m  11m S  0.7  1.8   1:49.16 PhyEthtool
1445 root      20   0  182m  65m  33m S  0.3  3.3   0:58.03 Fru
1461 root      20   0  186m  55m  27m S  0.3  2.8   0:50.46 Snmp
1468 root      20   0  183m  52m  25m S  0.3  2.6   0:29.40 Mlag
1470 root      20   0  175m  46m  19m S  0.3  2.3   0:58.93 Lag+LacpAgent
1475 root      20   0  164m  38m  12m S  0.3  1.9   1:04.16 Adt7462Agent
1479 root      20   0  172m  40m  15m S  0.3  2.0   1:22.02 MlagTunnel
1482 root      20   0  164m  37m  12m S  0.3  1.9   1:18.18 Smbus
1503 root      20   0  171m  39m  17m S  0.3  2.0   0:52.25 Sflow
1505 root      20   0  163m  35m  10m S  0.3  1.8   0:31.05 Thermostat
   1 root      20   0  2064  916  688 S  0.0  0.0   0:00.43 init

To break out of the output press CTRL + C
In bash mode, to see further options for the TOP command, you may visit this page :

http://linux.about.com/od/commands/l/blcmdl1_top.htm

 

 

2) What is considered normal CPU % utilization?

 

The following table provides an overview of different ranges of CPU utilization, and their respective level of normality:

 

CPU % utilization
CPU % range Comments
0%-75% Normal CPU % utilization
75% to 90% Review average load interval – ensure it is <20% during the 15 minutes period
90%+ Review high process and/or open TAC case immediately

 

Temporary high CPU utilization can be considered normal during topology changes, reconvergence, or other intensive, but short, activities. Such high usage is normally just temporary.

If high CPU usage persists (for example more than few minutes), there could be a problem.

Temporary bursts over 80% that last only few seconds are normal.

 

 

3) What is example of a High CPU utilization?

 

 [switch~]$ top
 
 top - 21:48:08 up 5 days,  6:57,  2 users,  load average: 1.23, 0.62, 0.63      
 Tasks: 214 total,   7 running, 207 sleeping,   0 stopped,   0 zombie            
 Cpu(s): 53.2%us,  9.0%sy,  0.0%ni, 35.6%id,  0.0%wa,  1.3%hi,  1.0%si,  0.0%st  
 Mem:   4103584k total,  2451264k used,  1652320k free,   115760k buffers        
 Swap:        0k total,        0k used,        0k free,   818920k cached         
 
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
 1811 root      20   0  277m 160m  54m S 13.1  4.0 704:47.48 Sysdb              
 2640 root      20   0  177m  46m  17m S 10.2  1.2 843:10.34 PhyAeluros         
 2608 root      20   0  177m  46m  17m S  8.7  1.2 133:07.10 PhyAeluros         
 1947 root      20   0  164m  39m  16m R  7.3  1.0 490:45.58 Lldp
 (Output truncated....)

 

 

4) How to identify average load on an Arista switch?

 

To view the load on an arista switch either use the top command or the uptime command in bash:

 [switch~]$ uptime
 top - 21:48:08 up 5 days,  6:57,  2 users,  load average:  0.30, 0.65, 0.80 

You also can see that the switch average load is 0.30, 0.65, 0.80 or 30%, 65% or 80%. These numbers represent average system load during the last 1, 5 and 15 minutes, respectively. Technically speaking, the load average represents the average number of processes that have to wait for CPU time during the last 1, 5 or 15 minutes.

For instance, if the switch has a current load of 0, the system is completely idle. If it has a load of 1, the CPU is busy enough that one process is having to wait for CPU time. If you do have a load of 1 and then spawn another process that normally would tie up a CPU, the switch load should go to 2. With a load average, the switch will give you a good idea of how consistently busy it has been over the past 1, 5 and 10 minutes.

NOTE – load average is not normalized according to the number of CPUs on the switch. Generally speaking, a consistent load of 1 means one CPU on the system is tied up. In simplified terms, this means that a single-CPU system with a load of 1 is roughly as busy as a two-CPU system with a load of 2.

 

 

5) What is the difference between CPU load average and CPU utlization

 

CPU load average

  measures the trend in CPU utilization over a period of time.

CPU Utilization

  The CPU percentage value measures in an instantaneous snapshot or a point of time.

The load average includes all demand for the CPU (running as well as waiting for CPU) but the CPU percentage take into account only how much was active at the time of measurement.

So it is clear that the real picture is given by the load average as it provides a more comprehensive information about the health of the CPU.

 

 

6) How to interpret the load numbers?

 

To illustrate the interpretation, we are going to use an example with load average: 0.30, 0.65, 0.80, as per the below output:

top - 17:39:57 up  2:18,  1 user,  load average: 0.30, 0.65, 0.80

For this example, we assume that the above output is from a single-CPU system.

Decreasing Load

If an administrator were to log into the switch and observe the above load CPU average, he would probably assume that the server had a relatively high load (0.80 = 80%) during the last 15 minutes that spiked around 5 minutes ago (0.65 = 65%). However more recently, during the last 1 minute, the load has significantly  dropped (30%). If one was to review these values, he could assume that the cause of the load has subsided.

Increasing Load

If we modify slightly the previous example and sort the load averages value in a different order: 0.80,0.65,0.30, one would conclude that the high load had likely started in the last 5 minutes and was getting worse.

 

 

7) Can the load average be greater than 1.0 and what does that mean?

 

Yes it can. The load average numbers for 1,5, and 15 minutes can range from 0 to 8.0+. As a general rule of thumb the load average should be below 1.00. If there are temporary spikes, it would be fine, as long as the 15 minutes average is below 1.00. In more details, the general rule of thumb for the long (5 , 15 min) term average is:

 - > 0.70 - You need to start investigating the issues before things get worse
 - > 1.00 - If the load average stays above 1.00, find the problem and fix it now. 
     Otherwise, you're going to get woken up in the middle of the night, and it's
 not going to be fun.
 - > 2.00 - If your load average is above 2.00, there could be serious issues, and the
     switch is either hanging or slowing way down.

A load average of 1.00 means that the CPU is busy all the time but there is no single process waiting for the CPU (no processing congestion). This is the ideal scenario when the CPU is utilized fully and optimally.

A load average value of 0.30 indicates that the CPU remained idle for 70% of the time during that interval, and it was busy only 30% of the time.

A load average value higher than 1.00 indicates that the processor is overloaded. For example, if the value is 1.8, it simply means that the processor is busy all the time, and moreover there are processes waiting for the CPU time. If the value is 1.8 then the CPU would need 80% more capacity of itself to satisfy all the CPU cycle request and bring down the value to 1. Even more capacity would be needed to bring the value down to below 1.00.

 

 

8) Does having a multi-core switch make a difference on load average?

 

The simple answer is YES, i.e. you have more capacity to process more threads. The general rule of thumb is:

 On a multicore system, your load should not exceed the number of cores available.

Therefore, on multi-core switches, the load is relative to the number of processor cores available. For example, on a dual-core CPU, the “100% utilization” mark is 2.00. To view the processor information on an Arista switch, use the command from BASH:

 $cat /proc/cpuinfo

The following is the cpuinfo information from an Arista 7050-S64 switch:

[switch ~]$ cat /proc/cpuinfo
 processor	: 0
 vendor_id	: AuthenticAMD
 cpu family	: 16
 model		: 6
 model name	: AMD Turion(tm) II Neo N41H Dual-Core Processor
 stepping	: 3
 cpu MHz	: 1500.109
 cache size	: 1024 KB
 physical id	: 0
 siblings	: 2
 core id	: 0
 cpu cores	: 2
 apicid	: 0
 initial apicid: 0
 fpu		: yes
 fpu_exception	: yes
 cpuid level	: 5
 wp		: yes
 flags		: fpu v
 bogomips	: 3000.21
 TLB size	: 1024 4K pages
 clflush size	: 64
 cache_alignment: 64
 address sizes	: 48 bits physical, 48 bits virtual
 power management: ts ttp tm stc 100mhzsteps hwpstate
 
 processor	: 1
 vendor_id	: AuthenticAMD
 cpu family	: 16
 model		: 6
 model name	: AMD Turion(tm) II Neo N41H Dual-Core Processor
 stepping	: 3
 cpu MHz	: 1500.109
 cache size	: 1024 KB
 physical id	: 0
 siblings	: 2
 core id	: 1
 cpu cores	: 2
 apicid	: 1
 initial apicid: 1
 fpu		: yes
 fpu_exception	: yes
 cpuid level	: 5
 wp		: yes
 flags		: fpu vme 
 bogomips	: 3000.22
 TLB size	: 1024 4K pages
 clflush size	: 64
 cache_alignment: 64
 address sizes	: 48 bits physical, 48 bits virtual
 power management: ts ttp tm stc 100mhzsteps hwpstate

 

 

9) What is a high load on an Arista switch?

 

“It depends.” A lot of different things may cause load to be high on a switch, each of which affects performance differently. A switch might have a load of 1.50 and still be pretty responsive, while another switch might have a load of 0.5 and take forever to log in to.

What really matters when troubleshooting a switch with high load average is – why the load is high.

 

 

10) What number should I be looking at for the load average: 1, 5, or 15 minutes?

 

Always look at the 5 minutes and the 15 minute load average numbers. If your switch starts to spike above 1.0 on the one-minute average, it might only be transient activities, normal convergence, and is not a concern in itself. However when the 15-minute load average goes above 2.0 and remains that high, then you should investigate immediately.

The following is a real example of a switch load average going above 1.0 in the last 1 minute, and the average over a 15 minutes period has been increasing. In this case, the root cause of this spike should be reviewed immediately:

 top - 02:33:24 up 1 day,  5:11,  5 users,  load average: 1.50, 0.97, 0.99
 Tasks: 270 total,   3 running, 261 sleeping,   0 stopped,   6 zombie
 Cpu(s): 42.8%us,  6.8%sy,  0.0%ni, 50.0%id,  0.0%wa,  0.0%hi,  0.5%si,  0.0%st
 Mem:   4096408k total,  2953152k used,  1143256k free,   118028k buffers
 Swap:        0k total,        0k used,        0k free,  1085984k cached

 

 

11) What are the symptoms of a switch with high CPU

 

Typically a switch with high CPU/load may be very slow and unresponsive to management access or to control-plane protocol (Layer2 and Layer3 protocols). This may indicate something going abnormally. However, as all Arista switches have CPU hardware rate limiting with various protocols prioritized to the management place (no DoS or starvation possible), users should always be able to reach the switch via SSH or Telnet.

 

 

12) How to determine root cause of high load average?

 

High load averages fall into three categories:

  1. CPU-bound load
  2. Out of memory issues
  3. I/O-bound load

The following will explain these categories and how to use tools like top to isolate the root cause.

 

 

13) What could be causing a CPU-bound load?

 

CPU-bound load is load caused when you have too many CPU-intensive processes running at once. Because each process needs CPU resources, they all must wait their turn. To check whether load is CPU-bound, check the CPU line in the top output:

 Cpu(s): 53.2%us,  9.0%sy,  0.0%ni, 35.6%id,  0.0%wa,  1.3%hi,  1.0%si,  0.0%st

Trying to interpret the above output:

   us: user CPU time. More often than not, when you have CPU-bound load, it's due
   to a process run by a user on the system, such as spanning-tree or MLAG etc. If this
   percentage is high, a single user process such as those may be the likely cause of
   the load.
   sy: system CPU time. The system CPU time is the percentage of the CPU tied up by
   kernel and other system processes. CPU-bound load should manifest either as a high
   percentage of user or high system CPU time.
   id: CPU idle time. This is the percentage of the time that the CPU spends idle. The
   higher the number here the better! In fact, if you see really high CPU idle time,
   it's a good indication that any high load is not CPU-bound.
   wa: I/O wait. The I/O wait value tells the percentage of time the CPU is spending
   waiting on I/O (typically disk I/O). If you have high load and this value is high, 
   it's likely the load is not CPU-bound but is due to either RAM issues or high disk
   I/O. 

If you do see a high percentage for user or system values, there is a good chance the load is CPU-bound. To track down the root cause, skip down a few lines to where top displays a list of current processes running on the system. By default, top will sort these based on the percentage of CPU used with the processes using the most on top.

 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 1423 root      20   0  671m 118m  33m R 45.6  5.9   2839:25 Mdio
 1461 root      20   0  156m  44m  17m R 45.1  2.3 133:41.55 PhyAeluros
 1399 root      20   0  167m  79m  40m R 11.0  4.0 564:41.03 Sysdb
 1460 root      20   0 87756  59m  39m S  5.0  3.0  35:10.43 ribd
 1439 root      20   0  148m  40m  15m S  2.7  2.0 149:42.00 FruSnmp

In the above example, the %CPU column highlights how much CPU each individual process is consuming. In the above example, the Mdio and PhyAeluros agents are consuming ~45% each of the CPU time. This is not normal behavior and Arista TAC should be contacted immediately.

Generally speaking, either a single process will try to consume most of the CPU e.g. 99%, or a number of smaller processes will contest for CPU time. In either case, using the top command, it’s relatively simple to see the processes that are causing the problem.

 

 

14) What is causing out of memory issues?

 

Although very rare, running out of RAM could be the second cause of high CPU or load average. When linux starts to run out of RAM, it will start to swap with the I/O i.e. flash memory. This is slower than RAM, as such each process can slow down dramatically.

In order to diagnose low memory issues, review the last two lines of the header information displayed in the top:

 Mem:   2043420k total,  1296716k used,   746704k free,   107812k buffers
 Swap:        0k total,        0k used,        0k free,   768644k cached

These lines will indicate the total amount of RAM and swap along with how much is used and free, however, this can be misleading. In order to investigate the root cause of the issue, view the processes that are taking the most amount of memory:

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
1469 root      20   0  460m  79m  33m S 12.0  4.0  53:54.75 FocalPoint
1483 root      20   0  167m  41m  15m S  2.3  2.1  15:04.18 PhyTn
1491 root      20   0  169m  42m  13m S  2.3  2.1  13:57.29 Mdio

Once the top command has been entered, type M and processes with the highest percentage of RAM utilization will be displayed. In the above example, Focalpoint is using 12% of CPU and 4% of RAM memory.

 

 

15) Can a single process bring down the switch?

 

EOS is linux based and each Linux process can be preempted by the kernel as the Linux scheduler will make sure that no single process can hog the CPU completely. There might be some slowness if the CPU is overloaded, but no single process will be able to hog the CPU entirely as there will be preemption.

 

Follow

Get every new post on this blog delivered to your Inbox.

Join other followers: