Posted on June 2, 2020 5:05 pm
 |  Asked by eui jin Jeong
 |  122 views
RESOLVED
0
0
Print Friendly, PDF & Email

The following log occurred.

What is the effect on log content and operation?

I am wondering if this is a temporary login.

Answers I’ll wait.

Thank you.


MAY 23 06:13:06 PB-UP2 SandFap: %SAND-3-INTERRUPT_OCCURRED: Parity/ECC Interrrupt ARAD_INT_IHP_PARITYERRINT[428] occurred in block 0 of Arad6.0.

0
Posted by Roberto Salazar
Answered on June 2, 2020 6:23 pm

That message means that a parity error was detected and corrected:

INTERRUPT_OCCURRED: Interrupt %s on %s. %s

Severity: Error

Explanation: Parity/Ecc errors have been detected in hardware. Most of these errors will be corrected by software Recommended Action: If the problem persists, contact your support representative. Otherwise, no action is required

 

If the parity error is happening frequently even though the parity is getting corrected, the bit error in the Arad is still present, usual recommendation is to power cycle the line card or the switch (fixed system).  If power cycle does not make the parity errors go away the chip maybe marginal and will need replacement.

 

0
Posted by Shreyas Ruwala
Answered on June 2, 2020 10:58 pm

Hi

Here is a brief explanation about this error:

ARAD_INT_IHP_PARITYERRINT: When there is a bit flip in the LEM (Memory), SandFap agent is going to handle it by verifying if the address is in the LEM table range and will clear the corresponding key, payload and age. If the entry is shuffled due to a bit flip, SandFap might end up clearing a valid key, which gets reprogrammed or relearned subsequently.
Hence, the SandFap agent will clear the corrupt entry in hardware and log it in as information in Syslog.

Please note that these ARAD_INT_IHP_PARITYERRINT logs are only informational. This log indicates that there was a parity error or SEU (single event upset) in the IHP memory block on the chip Arad6 which appears to have been auto-corrected by the software. These logs should not be impacting.

As long as the memory errors continue to be corrected by the switch and you experience no issues, there shouldn't be any issues by the occurrence of these corrected errors.

As of now, we recommend you monitor the device for a few days and let us know if such errors occur again.

If we see multiple errors reoccurring on the same chip or memory location, the next recommendation would be to restart that chip.
This would be done in a maintenance window.

As reset of Arad6/0 chip will cause associated ports to flaps and momentary traffic loss

An example to reset the chip is, The Arad chip is taken from the log message

#reset platform fap arad6/0 full

Long story short, this is non-impacting with  monitor as my current recommendation.

If it re-surfaces, next steps already shared above

Sincerely,

Shreyas Ruwala

TAC-ECC

 

0
Posted by Akshita
Answered on June 3, 2020 7:00 am

Hi Jeong,

Thanks for writing to the portal.

To give an overview, this log is particularly generated when the device has experienced a Single Event Upset a.k.a SEU. The SEUs occurs because of bit flip in memory.  This particular event is not unusual and happens infrequently and unpredictably across all memory components on all electronic devices. The Arista devices are capable of correcting most of these errors automatically when feasible.

The log message indicates that a Single Event Upset has occurred in one of the memories of the ASIC chip. As mentioned some of these SEUs are automatically corrected by the software whereas some are not.

There are a lot of factors involved in so as to check what further action needs to be taken when this particular log is experienced by the device, for example:-

      1.) what is the exact EOS version and platform of the device.

      2.) If the log has been generated multiple times on the same memory location.

       3.) If there are any other suspicious log messages or activities that has been logged over the device.

To investigate the log further, we would request you to kindly write us to support@arista.com, so that we can analyze the logs on the device completely and provide you with the detailed explanation so as to why it happened.

Thanks and Regards

Akshita

Post your Answer

You must be logged in to post an answer.