- 1) How to view the CPU usage
- 2) What is considered normal CPU % utilization?
- 3) What is example of a High CPU utilization?
- 4) How to identify average load on an Arista switch?
- 5) What is the difference between CPU load average and CPU utlization
- 6) How to interpret the load numbers?
- 7) Can the load average be greater than 1.0 and what does that mean?
- 8) Does having a multi-core switch make a difference on load average?
- 9) What is a high load on an Arista switch?
- 10) What number should I be looking at for the load average: 1, 5, or 15 minutes?
- 11) What are the symptoms of a switch with high CPU
- 12) How to determine root cause of high load average?
- 13) What could be causing a CPU-bound load?
- 14) What is causing out of memory issues?
- 15) Can a single process bring down the switch?
This article aims at helping you define what is CPU load on an Arista switch, how to know when it has a high load, and help troubleshoot high CPU utilization.
We will cover different topics related to High CPU utilization:
- How do I identify if my switch has a high CPU?
- What is considered normal CPU % utilization?
- What is example of a High CPU utilization?
- How to identify average load on an Arista switch?
- What is the difference between CPU load average and CPU utilization
- How to interpret the load numbers?
- Can the load average be greater than 1.0 and what does that mean?
- Does having a multi-core switch make a difference on load average?
- What is a high load on an Arista switch?
- What number should I be looking at for the load average i.e. 1,5 or 15 minutes?
- What are the symptoms of a switch with high CPU
- How to determine root cause of high load average?
- What could be causing a CPU-bound load?
- What is causing out of memory issues?
- Can a single process bring down the switch?
1) How to view the CPU usage
To view the CPU usage, use the top command from either CLI or Bash:
- CLI: show process top
- Bash: top
The following shows the CPU utilization on an Arista Switch:
switch# show process top ***Header*** top - 15:52:02 up 6:55, 1 user, load average: 0.08, 0.02, 0.01 Tasks: 132 total, 1 running, 131 sleeping, 0 stopped, 0 zombie Cpu(s): 13.5%us, 0.5%sy, 0.0%ni, 86.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 2043420k total, 1296716k used, 746704k free, 107812k buffers Swap: 0k total, 0k used, 0k free, 768644k cached ***Processes*** PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1469 root 20 0 460m 79m 33m S 12.0 4.0 53:54.75 FocalPoint 1483 root 20 0 167m 41m 15m S 2.3 2.1 15:04.18 PhyTn 1491 root 20 0 169m 42m 13m S 2.3 2.1 13:57.29 Mdio 1494 root 20 0 173m 43m 16m S 2.0 2.2 3:44.96 FruSnmp 1444 root 20 0 183m 78m 42m S 1.7 4.0 8:06.73 Sysdb 1442 root 20 0 161m 26m 2120 S 0.7 1.3 1:37.10 ProcMgr-worker 1474 root 20 0 163m 36m 11m S 0.7 1.8 1:49.16 PhyEthtool 1445 root 20 0 182m 65m 33m S 0.3 3.3 0:58.03 Fru 1461 root 20 0 186m 55m 27m S 0.3 2.8 0:50.46 Snmp 1468 root 20 0 183m 52m 25m S 0.3 2.6 0:29.40 Mlag 1470 root 20 0 175m 46m 19m S 0.3 2.3 0:58.93 Lag+LacpAgent 1475 root 20 0 164m 38m 12m S 0.3 1.9 1:04.16 Adt7462Agent 1479 root 20 0 172m 40m 15m S 0.3 2.0 1:22.02 MlagTunnel 1482 root 20 0 164m 37m 12m S 0.3 1.9 1:18.18 Smbus 1503 root 20 0 171m 39m 17m S 0.3 2.0 0:52.25 Sflow 1505 root 20 0 163m 35m 10m S 0.3 1.8 0:31.05 Thermostat 1 root 20 0 2064 916 688 S 0.0 0.0 0:00.43 init
To break out of the output press CTRL + C
In bash mode, to see further options for the TOP command, you may visit this page :
2) What is considered normal CPU % utilization?
The following table provides an overview of different ranges of CPU utilization, and their respective level of normality:
|CPU % range||Comments|
|0%-75%||Normal CPU % utilization|
|75% to 90%||Review average load interval – ensure it is <20% during the 15 minutes period|
|90%+||Review high process and/or open TAC case immediately|
Temporary high CPU utilization can be considered normal during topology changes, reconvergence, or other intensive, but short, activities. Such high usage is normally just temporary.
If high CPU usage persists (for example more than few minutes), there could be a problem.
Temporary bursts over 80% that last only few seconds are normal.
3) What is example of a High CPU utilization?
[switch~]$ top top - 21:48:08 up 5 days, 6:57, 2 users, load average: 1.23, 0.62, 0.63 Tasks: 214 total, 7 running, 207 sleeping, 0 stopped, 0 zombie Cpu(s): 53.2%us, 9.0%sy, 0.0%ni, 35.6%id, 0.0%wa, 1.3%hi, 1.0%si, 0.0%st Mem: 4103584k total, 2451264k used, 1652320k free, 115760k buffers Swap: 0k total, 0k used, 0k free, 818920k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1811 root 20 0 277m 160m 54m S 13.1 4.0 704:47.48 Sysdb 2640 root 20 0 177m 46m 17m S 10.2 1.2 843:10.34 PhyAeluros 2608 root 20 0 177m 46m 17m S 8.7 1.2 133:07.10 PhyAeluros 1947 root 20 0 164m 39m 16m R 7.3 1.0 490:45.58 Lldp (Output truncated....)
4) How to identify average load on an Arista switch?
To view the load on an arista switch either use the top command or the uptime command in bash:
[switch~]$ uptime top - 21:48:08 up 5 days, 6:57, 2 users, load average: 0.30, 0.65, 0.80
You also can see that the switch average load is 0.30, 0.65, 0.80 or 30%, 65% or 80%. These numbers represent average system load during the last 1, 5 and 15 minutes, respectively. Technically speaking, the load average represents the average number of processes that have to wait for CPU time during the last 1, 5 or 15 minutes.
For instance, if the switch has a current load of 0, the system is completely idle. If it has a load of 1, the CPU is busy enough that one process is having to wait for CPU time. If you do have a load of 1 and then spawn another process that normally would tie up a CPU, the switch load should go to 2. With a load average, the switch will give you a good idea of how consistently busy it has been over the past 1, 5 and 10 minutes.
NOTE – load average is not normalized according to the number of CPUs on the switch. Generally speaking, a consistent load of 1 means one CPU on the system is tied up. In simplified terms, this means that a single-CPU system with a load of 1 is roughly as busy as a two-CPU system with a load of 2.
5) What is the difference between CPU load average and CPU utlization
CPU load average
measures the trend in CPU utilization over a period of time.
The CPU percentage value measures in an instantaneous snapshot or a point of time.
The load average includes all demand for the CPU (running as well as waiting for CPU) but the CPU percentage take into account only how much was active at the time of measurement.
So it is clear that the real picture is given by the load average as it provides a more comprehensive information about the health of the CPU.
6) How to interpret the load numbers?
To illustrate the interpretation, we are going to use an example with load average: 0.30, 0.65, 0.80, as per the below output:
top - 17:39:57 up 2:18, 1 user, load average: 0.30, 0.65, 0.80
For this example, we assume that the above output is from a single-CPU system.
If an administrator were to log into the switch and observe the above load CPU average, he would probably assume that the server had a relatively high load (0.80 = 80%) during the last 15 minutes that spiked around 5 minutes ago (0.65 = 65%). However more recently, during the last 1 minute, the load has significantly dropped (30%). If one was to review these values, he could assume that the cause of the load has subsided.
If we modify slightly the previous example and sort the load averages value in a different order: 0.80,0.65,0.30, one would conclude that the high load had likely started in the last 5 minutes and was getting worse.
7) Can the load average be greater than 1.0 and what does that mean?
Yes it can. The load average numbers for 1,5, and 15 minutes can range from 0 to 8.0+. As a general rule of thumb the load average should be below 1.00. If there are temporary spikes, it would be fine, as long as the 15 minutes average is below 1.00. In more details, the general rule of thumb for the long (5 , 15 min) term average is:
- > 0.70 - You need to start investigating the issues before things get worse - > 1.00 - If the load average stays above 1.00, find the problem and fix it now. Otherwise, you're going to get woken up in the middle of the night, and it's not going to be fun. - > 2.00 - If your load average is above 2.00, there could be serious issues, and the switch is either hanging or slowing way down.
A load average of 1.00 means that the CPU is busy all the time but there is no single process waiting for the CPU (no processing congestion). This is the ideal scenario when the CPU is utilized fully and optimally.
A load average value of 0.30 indicates that the CPU remained idle for 70% of the time during that interval, and it was busy only 30% of the time.
A load average value higher than 1.00 indicates that the processor is overloaded. For example, if the value is 1.8, it simply means that the processor is busy all the time, and moreover there are processes waiting for the CPU time. If the value is 1.8 then the CPU would need 80% more capacity of itself to satisfy all the CPU cycle request and bring down the value to 1. Even more capacity would be needed to bring the value down to below 1.00.
8) Does having a multi-core switch make a difference on load average?
The simple answer is YES, i.e. you have more capacity to process more threads. The general rule of thumb is:
On a multicore system, your load should not exceed the number of cores available.
Therefore, on multi-core switches, the load is relative to the number of processor cores available. For example, on a dual-core CPU, the “100% utilization” mark is 2.00. To view the processor information on an Arista switch, use the command from BASH:
The following is the cpuinfo information from an Arista 7050-S64 switch:
[switch ~]$ cat /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 16 model : 6 model name : AMD Turion(tm) II Neo N41H Dual-Core Processor stepping : 3 cpu MHz : 1500.109 cache size : 1024 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 2 apicid : 0 initial apicid: 0 fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu v bogomips : 3000.21 TLB size : 1024 4K pages clflush size : 64 cache_alignment: 64 address sizes : 48 bits physical, 48 bits virtual power management: ts ttp tm stc 100mhzsteps hwpstate processor : 1 vendor_id : AuthenticAMD cpu family : 16 model : 6 model name : AMD Turion(tm) II Neo N41H Dual-Core Processor stepping : 3 cpu MHz : 1500.109 cache size : 1024 KB physical id : 0 siblings : 2 core id : 1 cpu cores : 2 apicid : 1 initial apicid: 1 fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu vme bogomips : 3000.22 TLB size : 1024 4K pages clflush size : 64 cache_alignment: 64 address sizes : 48 bits physical, 48 bits virtual power management: ts ttp tm stc 100mhzsteps hwpstate
9) What is a high load on an Arista switch?
“It depends.” A lot of different things may cause load to be high on a switch, each of which affects performance differently. A switch might have a load of 1.50 and still be pretty responsive, while another switch might have a load of 0.5 and take forever to log in to.
What really matters when troubleshooting a switch with high load average is – why the load is high.
10) What number should I be looking at for the load average: 1, 5, or 15 minutes?
Always look at the 5 minutes and the 15 minute load average numbers. If your switch starts to spike above 1.0 on the one-minute average, it might only be transient activities, normal convergence, and is not a concern in itself. However when the 15-minute load average goes above 2.0 and remains that high, then you should investigate immediately.
The following is a real example of a switch load average going above 1.0 in the last 1 minute, and the average over a 15 minutes period has been increasing. In this case, the root cause of this spike should be reviewed immediately:
top - 02:33:24 up 1 day, 5:11, 5 users, load average: 1.50, 0.97, 0.99 Tasks: 270 total, 3 running, 261 sleeping, 0 stopped, 6 zombie Cpu(s): 42.8%us, 6.8%sy, 0.0%ni, 50.0%id, 0.0%wa, 0.0%hi, 0.5%si, 0.0%st Mem: 4096408k total, 2953152k used, 1143256k free, 118028k buffers Swap: 0k total, 0k used, 0k free, 1085984k cached
11) What are the symptoms of a switch with high CPU
Typically a switch with high CPU/load may be very slow and unresponsive to management access or to control-plane protocol (Layer2 and Layer3 protocols). This may indicate something going abnormally. However, as all Arista switches have CPU hardware rate limiting with various protocols prioritized to the management place (no DoS or starvation possible), users should always be able to reach the switch via SSH or Telnet.
12) How to determine root cause of high load average?
High load averages fall into three categories:
- CPU-bound load
- Out of memory issues
- I/O-bound load
The following will explain these categories and how to use tools like top to isolate the root cause.
13) What could be causing a CPU-bound load?
CPU-bound load is load caused when you have too many CPU-intensive processes running at once. Because each process needs CPU resources, they all must wait their turn. To check whether load is CPU-bound, check the CPU line in the top output:
Cpu(s): 53.2%us, 9.0%sy, 0.0%ni, 35.6%id, 0.0%wa, 1.3%hi, 1.0%si, 0.0%st
Trying to interpret the above output:
us: user CPU time. More often than not, when you have CPU-bound load, it's due to a process run by a user on the system, such as spanning-tree or MLAG etc. If this percentage is high, a single user process such as those may be the likely cause of the load.
sy: system CPU time. The system CPU time is the percentage of the CPU tied up by kernel and other system processes. CPU-bound load should manifest either as a high percentage of user or high system CPU time.
id: CPU idle time. This is the percentage of the time that the CPU spends idle. The higher the number here the better! In fact, if you see really high CPU idle time, it's a good indication that any high load is not CPU-bound.
wa: I/O wait. The I/O wait value tells the percentage of time the CPU is spending waiting on I/O (typically disk I/O). If you have high load and this value is high, it's likely the load is not CPU-bound but is due to either RAM issues or high disk I/O.
If you do see a high percentage for user or system values, there is a good chance the load is CPU-bound. To track down the root cause, skip down a few lines to where top displays a list of current processes running on the system. By default, top will sort these based on the percentage of CPU used with the processes using the most on top.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1423 root 20 0 671m 118m 33m R 45.6 5.9 2839:25 Mdio 1461 root 20 0 156m 44m 17m R 45.1 2.3 133:41.55 PhyAeluros 1399 root 20 0 167m 79m 40m R 11.0 4.0 564:41.03 Sysdb 1460 root 20 0 87756 59m 39m S 5.0 3.0 35:10.43 ribd 1439 root 20 0 148m 40m 15m S 2.7 2.0 149:42.00 FruSnmp
In the above example, the %CPU column highlights how much CPU each individual process is consuming. In the above example, the Mdio and PhyAeluros agents are consuming ~45% each of the CPU time. This is not normal behavior and Arista TAC should be contacted immediately.
Generally speaking, either a single process will try to consume most of the CPU e.g. 99%, or a number of smaller processes will contest for CPU time. In either case, using the top command, it’s relatively simple to see the processes that are causing the problem.
14) What is causing out of memory issues?
Although very rare, running out of RAM could be the second cause of high CPU or load average. When linux starts to run out of RAM, it will start to swap with the I/O i.e. flash memory. This is slower than RAM, as such each process can slow down dramatically.
In order to diagnose low memory issues, review the last two lines of the header information displayed in the top:
Mem: 2043420k total, 1296716k used, 746704k free, 107812k buffers Swap: 0k total, 0k used, 0k free, 768644k cached
These lines will indicate the total amount of RAM and swap along with how much is used and free, however, this can be misleading. In order to investigate the root cause of the issue, view the processes that are taking the most amount of memory:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1469 root 20 0 460m 79m 33m S 12.0 4.0 53:54.75 FocalPoint 1483 root 20 0 167m 41m 15m S 2.3 2.1 15:04.18 PhyTn 1491 root 20 0 169m 42m 13m S 2.3 2.1 13:57.29 Mdio
Once the top command has been entered, type M and processes with the highest percentage of RAM utilization will be displayed. In the above example, Focalpoint is using 12% of CPU and 4% of RAM memory.
15) Can a single process bring down the switch?
EOS is linux based and each Linux process can be preempted by the kernel as the Linux scheduler will make sure that no single process can hog the CPU completely. There might be some slowness if the CPU is overloaded, but no single process will be able to hog the CPU entirely as there will be preemption.