• Monitoring some agent’s memory utilisation

 
 
Print Friendly, PDF & Email

 

Monitoring some agent’s memory utilisation

 

This article develops further https://eos.arista.com/introduction-to-managing-eos-devices-memory-utilisation/ authored by Colin MacGiollaEain to bring the context to a specific agent’s memory utilisation and how to remediate.

 

1) Introduction

Monitoring the memory usage of specific EOS processes maybe useful to detect which features consume the control-plane resources, as a first step to clarify whether it is a normal behaviour or not. In abnormal circumstances the overall system may be running low on memory, in which case some culprit agent may be restarted, or some other agent may suffer a restart too (collateral damage) by the process scheduler.

Examples of expected but undesirable such conditions:

– vEOS-lab (virtual image for lab) was assigned too little RAM on its virtual host; for example recent version require at least 2GB. If it is allocated only 1.5GB or 1GB you would very quickly see some agents restarting because of memory starvation.

– In BGP peering, the routing agent may be loaded with extremely high amount of routes – let’s take 10 million as an extreme for pure example (note: this is NOT supported today), then a single process, being the routing process in this instance, may consume by itself a extremely large portion of the overall memory. The system may restart that agent if starvation occurs, and some other agents may suffer restart as well. This scenario is not desirable but trying to reach unsupported scale is expected to lead to such behaviour

 

There are also some scenarios involving software faults such as memory leaks, where the configuration and the scale are supported, but a code anomaly might cause excessive memory utilisation, causing the faulty agent (and potentially some other agents) to be restarted as the system automatically tries to recover the situation. Note that restarting the agent is not worse than completely running out of RAM. Running out of memory would be disastrous for not only the data-plane or control-plane but also the management plane. The agent restart is a last resort option that allow the agent to restart from clean, the other healthy processes to keep running normally, and the operators to access and manage the device normally.

 

2) Check memory with EOS commands

To monitor the memory utilisation on an Arista switch you may use the following methods. In these example we look at the Strata agent (hardware driver for the Strata family: 7010, 7050X, 7060X, 7250X, 7260, 7300X, 7320X series).

 

arista(config)#show agent memory | i Strata
Agent Name             Last Memory(Kb) Max Memory(Kb) 
---------------------- --------------- --------------
Strata-T0              973008          973008
Strata-T1              972984          972988
Strata-T2              972984          972984
Strata-T3              972984          972984
Strata-TT1             957852          957852
Strata-TT0             957844          957844
StrataCounters         812396          812396
StrataL3               807036          807036
StrataL2               791168          791168
StrataVlanTopo         788020          788024
StrataLag              768280          768280
StrataLanz             764060          764060
StrataCentral          758864          758864
StrataMirror           757328          757328

 

You can also look at all the different processes running on your device, classified by order of memory utilisation with the command “show processes top memory once“:

 
arista#show processes top memory once
top - 15:20:30 up  4:56,  1 user,  load average: 0.80, 0.80, 0.75
Tasks: 309 total,   1 running, 308 sleeping,   0 stopped,   0 zombie
%Cpu(s): 13.4 us,  1.6 sy,  0.1 ni, 84.4 id,  0.0 wa,  0.4 hi,  0.1 si,  0.0 st
KiB Mem:   8008948 total,  5164332 used,  2844616 free,   250112 buffers
KiB Swap:        0 total,        0 used,        0 free,  2519376 caced


  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND
 2523 root      20   0  892m 344m 141m S   0.0  4.4   5:39.95 Sysdb
 2644 admin     20   0  877m 265m 113m S   0.0  3.4   2:08.70 Cli
 2874 root      20   0  873m 262m 112m S   0.0  3.4   3:01.61 Cli
 3279 root      20   0  950m 257m 114m S   0.0  3.3  15:30.79 Strata
 3246 root      20   0  950m 257m 114m S   8.9  3.3  15:29.88 Strata
 3308 root      20   0  950m 256m 113m S  13.3  3.3  15:26.08 Strata
 3292 root      20   0  950m 254m 111m S   0.0  3.2  15:30.29 Strata
 [...]

 

 

3) Monitor with telemetry

The Arista telemetry application (via CloudVision Portal), or other tools of your choice consuming the Arista streams, provide an intuitive realisation of the situation, at present and through time. You can easily scale out the graphical representation and spot long-term trends.

Below is an example of an actual slight memory increase over time in the telemetry application:

 

 

4) Monitor with SNMP

The memory status can also be found using SNMP, under the HOST-RESOURCES MIB:

arista#show snmp mib walk 1.3.6.1.2.1.25

 

Memory utilization can be monitored using the following OIDs that provide the description, the total amount of memory and its utilization, these items are common to all most Arista switches.

HOST-RESOURCES-MIB::hrStorageDescr[1] = STRING: RAM
HOST-RESOURCES-MIB::hrStorageSize[1] = INTEGER: 4037448
HOST-RESOURCES-MIB::hrStorageUsed[1] = INTEGER: 1543660

 

 

5) Check with Bash commands

As you may know, you can run bash shell commands from the EOS CLI with the command “bash <bash_command>“. Some commands require “sudo” (if you are a user allowed to).

Therefore, from EOS you can run the below set of commands to check the PID and related pmap outputs into a summarised output:

bash for PID in `sudo ps -A | grep 'Strata$' | awk '{print $1}'`; do echo ProcessID: $PID; sudo pmap $PID | grep total; done

 

If you think you may use this command in a regular basis then you can simplify its use by configuring an alias. For example:

!
alias showmem bash for PID in `sudo ps -A | grep 'Strata$' | awk '{print $1}'`; do echo ProcessID: $PID; sudo pmap $PID | grep total; done
!

 

Output example for the ‘Strata’ agent(s), executing the alias (‘total’ values are in Bytes):

arista#showmem
ProcessID: 3246
 total   972980K
ProcessID: 3279
 total   972980K
ProcessID: 3292
 total   972980K
ProcessID: 3295
 total   957840K
ProcessID: 3298
 total   957848K
ProcessID: 3308
 total   973004K

 

 

If you are interested in the memory consumption of you Layer3 agent then the below scripts verifies it across all Arista platforms. The different possible L3 agent names, across all the Arista platforms are: SandL3Unicast (E-series and R-series), StrataL3 (X-series), XpL3Unicast (7160-series), BfnL3 (7170-series), FocalPointV2 (7150-series).

 

#!/usr/bin/bash

l3AgentPid=""
for file in /var/run/agents/ar.SandL3Unicast /var/run/agents/ar.StrataL3 /var/run/agents/ar.XpL3Unicast /var/run/agents/ar.BfnL3 /var/run/agents/ar.FocalPointV2
do
   if [[ -f $file ]]; then
      l3AgentPid=$(head -n1 $file)
   fi
done


if [[ -z "$l3AgentPid" ]]; then
   echo "Unable to find L3 agent process id."
   exit 1
fi

sudo grep VmSize /proc/$l3AgentPid/status

 

 

Output example for memcheck.sh that contains the above shell script:

[admin@arista]$ bash /mnt/flash/memcheck.sh 
VmSize:   762720 kB

 

If the amount of memory consumed by that Layer3 agent exceeds 4GB then it would be restarted. If it goes beyond 3GB then you should consider restarting the agent in a planned manner, as detailed later.

 

The overall system memory resource is summarised in the below. Although it does not give details for each process, it may be a good instantaneous indicator of the overall system health. For example, the below look heathy:

arista#bash free -mh

             total       used       free     shared    buffers     cached
Mem:          7.5G       4.3G       3.2G         0B       265M       2.8G
-/+ buffers/cache:       1.2G       6.2G

 

Note:

The buffers and cached memory are not really “used”, you should consider the line “-/+ buffers/cache” rather than “Mem

 

6) Remediation (last resort only)

If you find you are lacking of resources in an abnormal situation and want to control the agent restart manually (e.g. during a change window) rather than letting the system self-recover, then you may restart an agent the following way.

 

arista#show processes top memory once
top - 17:30:25 up 1:08, 1 user, load average: 0.42, 0.28, 0.26
Tasks: 278 total, 1 running, 277 sleeping, 0 stopped, 0 zombie
%Cpu(s): 6.8 us, 1.5 sy, 0.4 ni, 90.7 id, 0.2 wa, 0.2 hi, 0.1 si, 0.0 st
KiB Mem: 3818208 total, 3735908 used, 82300 free, 239160 buffers
KiB Swap: 0 total, 0 used, 0 free, 2336068 cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
[...]
2610 root 20 0 713m 263m 176m S 0.0 7.1 4:59.54 Strata
^
The current PID is 2610

 

Once you have gathered the PID (Process ID) you will be able to restart it. You effectively just nee to kill that agent, the process scheduler will automatically restart it.
 
arista#bash sudo kill 2610

arista#show processes top memory once
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND [...] 3184 root 20 0 705m 260m 175m S 80.4 7.0 0:06.73 Strata
^
New PID for the new process that wasautomatically started.

 

Note that that in the above output the CPU might be temporarily high while the agent starts.

 

 

Credits

Colin MacGiollaEain for the other article on memory usage

Edmund Roche-Kelly for the telemetry capture

 

Follow

Get every new post on this blog delivered to your Inbox.

Join other followers: