• Troubleshooting filesystem full issues

 
 
Print Friendly, PDF & Email

Objective

The document aims at describing scenarios that cause filesystems to get full and suggests ways to free up space in the occupied directories.  

Introduction

At times it is observed that after logging into the switch, EOS may display a warning message as follows:

Warning: the following filesystems have less than 10% free space left:

tmpfs (on /var/core) 0% (0 Available)

tmpfs (on /var/log) 0% (0 Available)

Please remove configuration such as tracing and clean up the space.

The above message indicates that the /var/log and /var/core directories have reached their maximum utilization.

If no action is taken to clear the warning message, it can cause the switch to enter a standalone shell or cancel all existing and new SSH session requests.

Although this may not have an impact on data plane forwarding of the switch, it is imperative to determine the root cause and free up space at the earliest.

The next section describes in detail scenarios that cause the above warning message to get displayed.

Scenarios

Continuous agent restarts on the system

If there is an agent constantly crashing on the system, it could keep generating core files and save them under /var/core directory. The log file for that agent under /var/log/agents folder will also grow large in size due to the signature of the crash included, eventually causing both /var/log/ and /var/core directories to become full.

How to confirm?

1. Check the output of df –h. Either or both the directories will be 100% utilized.

------------- bash df -h --------------
Filesystem      Size  Used Avail Use% Mounted on
none            569M   67M  503M  12% /
none            569M   67M  503M  12% /.overlay
tmpfs           1.9G     0  1.9G   0% /var/run/netns
tmpfs           380M  380M     0 100% /var/core
tmpfs           380M  380M     0 100% /var/log
tmpfs           948M   78M  871M   9% /var/shmem
/dev/sda1       3.7G  2.7G  1.1G  72% /mnt/flash

 2. Check ‘show agent logs crash’ and confirm if the time of crash reported is recent.

------------- show agent logs crash -------------
===> /var/log/agents/Mlag-13981 Wed Mar 18 00:16:40 2020 <===
===== Output from /usr/bin/Mlag [] (PID=13981) started Mar 18 00:16:38.284520 ===

  3. Check the /var/core directory if there are core files saved with the name of an agent.

------------- bash ls -ltr /var/core -------------
total 388256
-rw-rw-rw- 1 root root 11108953 Mar 18 00:16 core.2344.1584483862.Mlag.gz
-rw-rw-rw- 1 root root 10856262 Mar 18 00:16 core.23101.1584483867.Mlag.gz

What to collect for Arista TAC in determining the cause of the crash?

#bash sudo tar -cvf - /var/core/ > /tmp/core-$HOSTNAME-$(date +%m_%d.%H%M).tar.gz [Files will be saved in /tmp as /mnt/flash may be full]
#show tech | gzip > /tmp/$HOSTNAME-showTech-$(date +%m_%d.%H%M).gz
#show agent logs | gzip > /tmp/$HOSTNAME-showAgentLogs-$(date +%m_%d.%H%M).gz
#show agent qtrace | gzip > /tmp/$HOSTNAME-showAgentQtrace-$(date +%m_%d.%H%M).gz
#show logging system | gzip > /tmp/$HOSTNAME-showLogging-$(date +%m_%d.%H%M).gz

Temporary mitigation
Once the above logs have been collected, delete the core files from the /var/core directory using the following command:

#bash
$cd /var/core
$sudo rm –rf *

We may have to keep repeating the delete operation depending on the frequency of the crash to make sure utilization of /var/core remains below 100%. We also need to delete the older agent files from /var/log/agents directory to free up space once the agent is stabilized.

Change dir to /var/log/agents

[admin@switch ~]$ cd /var/log/agents/

The /var/log/agents directory will be filled with files for all active agents including the stale files from the restarting agent. The below example is for a Mlag agent restart:


[admin@switch agents]$ ls -ltr | grep -i Mlag
-rw-rw-rw- 1 root root 732K Apr 20 09:05 Mlag-13981
-rw-rw-rw- 1 root root 732K Apr 20 13:30 Mlag-13983
-rw-rw-rw- 1 root root 727K Apr 20 15:40 Mlag-13985
-rw-rw-rw- 1 root root 725K Apr 21 15:42 Mlag-13987
-rw-rw-rw- 1 root root 724K Apr 21 09:01 Mlag-13990   
-rw-rw-rw- 1 root root 724K Apr 21 15:30 Mlag-14092

The below command will delete all the stale files for a given agent and keep only the last two:

[admin@switch agents]$ ls -1at <AgentName>-* | tail -n+3 | xargs -i sudo rm {}
Eg: Below command will remove files for the Mlag agent and keep the last 2  
[admin@switch agents]$ ls -1at Mlag-* | tail -n+3 | xargs -i sudo rm {} 

Note: Ensure that the agent file is correctly specified to avoid deleting unrelated active agent files.

A single agent file has grown too large

How to confirm?
1. Check the output of df –h

#bash df -h
Filesystem      Size  Used Avail Use% Mounted on
none            1.2G   52M  1.1G   5% /
none            1.2G   52M  1.1G   5% /.overlay
tmpfs           757M    0   757M   0% /var/core
tmpfs                       757M    757M      0      100%    /var/log
tmpfs           1.9G   17M  1.9G   1% /var/shmem
/dev/mmcblk0p1  3.7G  2.0G  1.7G  55% /mnt/flash

2. Check if there is a particular agent file consuming maximum space under /var/log directory.

The following command displays the size of folders under /var/log in order highest to lowest, first being the one using maximum space.

#bash sudo du -sch /var/log/* |  sort -k1 -hr 
757M total 
653M    /var/log/agents 
90M     /var/log/qt      
9.3M    /var/log/eos
4.2M    /var/log/messages

The following command displays the agent file sizes under /var/log/agents directory

#bash sudo ls -alSRh /var/log/agents
/var/log/agents:
total 653M
-rw-rw-rw- 1 root root 650M Oct 31 11:59 Pimsm-15746
-rw-rw-rw- 1 root root 393K Oct 22 00:10 SandAcl-2613
-rw-rw-rw- 1 root root 316K Nov 11 22:39 Ebra-2228
-rw-rw-rw- 1 root root 280K Nov 12 15:45 Sysdb-1598


3. Check if there is tracing enabled for that agent

#show running-config section trace
Enabling tracing can cause the agent to write verbose output to its log file and over a period of time can cause the file size to grow considerably large. It is recommended to enable tracing only after consulting with an Arista support engineer. If the trace is no longer required, the line can be deleted to prevent the filesystem from filling up again. 

How to recover?
1. Copy the overgrown agent log file to your local machine and using the below command truncate the file size to 0. By using this command we ensure that we continue running the agent process with the same PID without deleting the file or restarting the agent to avoid any potential impact.

#bash
$cd /var/log/agents
$cat /dev/null > Pimsm-15746

2. Confirm if the size of the file has reduced

#bash sudo ls -ltr /var/log/agents/Pimsm-15746  
-rw-rw-rw- 1 root root 0 Apr 10 19:44 /var/log/agents/Pimsm-15746

/mnt/flash becomes full due to a large output in tech-support

We may encounter the below system log or error message when saving the running-config. 

Switch# write memory
% Error copying system:/running-config to flash:/startup-config (No space left on device)

#show logging system 
SuperServer: %SYS-4-CLI_SCHEDULER_FILESYSTEM_FULL: Execution of scheduled CLI execution job 'tech-support' was aborted due to target filesystem being full.

This indicates that the flash filesystem on the switch is full.

Running-config file gets saved under /mnt/flash while scheduled tech-support files get saved under /mnt/flash/schedule/tech-support directory

How to confirm?
1. Check the output of df –h

#bash df -h
tmpfs           3.2G     0  3.2G   0% /var/core
tmpfs           3.2G  132M  3.1G   5% /var/log
tmpfs           7.9G  290M  7.6G   4% /var/shmem
tmpfs            24G     0   24G   0% /var/shmem/stashes
/dev/mmcblk0p1        3.3G    3.3G          0   100%     /mnt/flash
/dev/sda1       118G  5.1G  107G   5% /mnt/drive

 Check if there are agent restarts occurring as mentioned in the previous scenarios. Agent crash signatures get saved in tech-support files and hence can cause the file size to increase. 

 2. Compare it with the size of the previous tech-support files to check if it has increased recently. Typically, the file size varies depending on the configuration applied and features enabled on the switch.

 Below command displays the sizes of the historic tech-support file saved under this directory

#bash ls -ltrh /mnt/flash/schedule/tech-support
total 1.1G
-rwxrwx--- 1 root eosadmin 22M Dec 17 22:41 switch001_tech-support_2019-12-17.2236.log.gz
-rwxrwx--- 1 root eosadmin 22M Dec 17 23:41 switch001_tech-support_2019-12-17.2336.log.gz
-rwxrwx--- 1 root eosadmin 22M Dec 18 00:41 switch001_tech-support_2019-12-18.0036.log.gz
-rwxrwx--- 1 root eosadmin 22M Dec 18 01:41 switch001_tech-support_2019-12-18.0136.log.gz

If there are no agent restarts, determine if the switch is running at a high scale. It is our general observation that when the switch contains large routing table entries or an Internet-scale routing table, the relevant show commands for these routes can consume considerable space in the tech-support files.

This would increase the size of every historic tech-support file eventually filling up /mnt/flash directory.

How to recover?
1. For switches running at a high scale, we can configure the switch to exclude these commands in the future historic tech-support files. The command to do this is below:

management tech-support
  policy show tech-support
   exclude command <>
   eg:
 management tech-support     
    policy show tech-support         
      exclude command show ip route vrf all detail         
      exclude command show kernel ip route vrf all
      exclude command show platform fap ip route
      exclude command show ip bgp vrf all

2. For both agent restarts and high scale scenarios, a workaround can be to copy the files to the local machine for reference and delete the currently occupied tech-support files from this directory :

#bash 
$cd /mnt/flash/schedule/tech-support
$sudo rm –rf *

What to collect for TAC in case the above steps do not help?

#bash ls -ltrh /mnt/flash/schedule/tech-support
#bash df –h
#show tech | gzip > /tmp/$HOSTNAME-showTech-$(date +%m_%d.%H%M).gz
#Files stored in /mnt/flash/schedule/tech-support/

Deleted file contributing to filesystem getting full

The system may at times not report any agent crash or higher utilization of a single file, but the directory would still show as 100% utilized. 

In such cases, there may be a deleted file that still has an active process running in the background and is continuing to consume memory. This is very common when a user/process deletes a file that was actively being written into. 

Below are the additional checks we can perform to determine if such a file exists: 

How to confirm?
Check if a deleted file exists on the system :

#bash
$cd /var/core
$sudo lsof /var/log | grep -i deleted   

Eg:  
#bash sudo lsof /var/log | grep -i deleted
TerminAtt 6945   root    1w   REG   0,25     1905  35072  /var/log/agents/TerminAttr-6945 (deleted)

The above output indicates that TerminAttr-6945 file was deleted from /var/log/agents but its process is actively running. 

How to recover?
You may kill the running process using the below command.

$sudo kill -9 <PID>
Eg: $sudo kill -9 6945

The file should now disappear from the below output

#bash sudo lsof /var/log | grep -i deleted

For any further issues or scenarios not covered above, please contact support@arista.com for further assistance.

Follow

Get every new post on this blog delivered to your Inbox.

Join other followers: