Posted on July 21, 2021 4:39 pm
 |  Asked by Ismail Kalolwala
 |  94 views
0
0
Print Friendly, PDF & Email

Hi All,

 

Recently, i have initiated a software upgrade activity through Cloud Vision Portal. Following was the staging procedure on a cabinet – in which we have two Switches in MLAG peer.

 

Staging Procedure :

 

  1. Pre Configuration Snapshot
  2. Checking MLAG Health
  3. Entering BGP Maintenance
  4. Upgrade the EOS
  5. Existing BGP Maintenance
  6. Post Configuration Snapshot

First Stage of Pre Configuration Snapshot is successful.

For the second stage to be successful, i had to re configure the reload delay under MLAG from 60 – 300. With value of 60, Stage is failing.

Stage 3 is also successful , where in my Switch enters in to the Maintenance mode.

Stage 4 is also successful, where in Switch gets loaded with the required version of EOS and reboots happens and Switch turns back On.

Real issue is with Stage 4 in which we get all below mentioned logs :

“Checking BGP convergence of all VRFs every 30s, attempt 1, error: BGP doesn’t converge on VRFs: map[default:BGP convergence state is: Timeout reached. Pending Peers is not 0: 5]â€

After this log,

Exit BGP Maint
Timeout in 11m0s…

Exit BGP Maint
Checking BGP convergence of all VRFs every 30s, attempt 2, error: BGP doesn’t converge on VRFs: map[default:BGP convergence state is: Timeout reached. Pending Peers is not 0: 5]

Exit BGP Maint
Timeout in 10m0s…

Exit BGP Maint
Checking BGP convergence of all VRFs every 30s, attempt 3, error: BGP doesn’t converge on VRFs: map[default:BGP convergence state is: Timeout reached. Pending Peers is not 0: 5]

Exit BGP Maint
Timeout in 8m0s…

Exit BGP Maint
Timeout in 0s…

Exit BGP Maint
Action exitbgpmaintmode on device DC1-B7LEAF-SW01 completed with error: Pre-healthcheck failed: timeout: BGP doesn’t converge on VRFs: map[default:BGP convergence state is: Timeout reached. Pending Peers is not 0: 5]

Exit BGP Maint
Action exitbgpmaintmode completed with error Error executing stage ZHUBUjsAj : Pre-healthcheck failed: timeout: BGP doesn’t converge on VRFs: map[default:BGP convergence state is: Timeout reached. Pending Peers is not 0: 5]

 

Our Switch Model is as follows :

 

Arista DCS-7160-32CQ-F

Current Software version is 4.25.4M

 

Any help/info sharing would be appreciated.

 

 

 

0
Posted by Keerthi Bharathi
Answered on July 30, 2021 3:51 am

Hello Ismail, 

While exiting the Maintenance mode, several health checks are performed so as to prevent early exit from maintenance mode which could lead to potential traffic loss.

One of the checks that EXIT MM action does is to wait for BGP to converge in all VRF's. This is checked using the show bgp convergence vrf all. 

The state that BGP convergence can be is in :

  1. Converged 
  2. Timed out

By default, the configured convergence timeout is 5 mins and Slow peer timeout is 1.5 mins. The BGP convergence times out in 300 secs after the first BGP peering comes up. At this point the pending peers should be 0 for the Exit MM Action to continue to next steps of health check.

For more details regarding this, please check the following TOI: 

https://eos.arista.com/toi/cvp-2019-1-0/bgp-maintenance-mode-and-mlag-issu-change-control-actions/#Combined_BGP_and_MLAG_CC

 

Now coming back to the issue you faced. I noticed that before entering Maintenance mode, you were prompted to change the reload-delay to 300secs (5 mins)

Could you please confirm if you have changed the reload-delay for non-mlag interfaces to 300 secs too. Also, do you have a BGP neighborship between the Mlag peers? 

What I believe could have happened is: 

  1. The BGP neighborship between the mlag peers would come up first. The Convergence timeout starts. 
  2. The reload-delay for non-mlag interfaces is set to 300 secs so these interfaces would be in err-disabled state for 300secs. 
  3. At the end of 5 mins(300 secs), Exit MM action notices that the BGP convergence is timed out and the pending peers is not 0 (peering couldn't be established since the interfaces were in err-disabled state)
  4. It would throw the error that you noticed. 

To Correct this, as mentioned in the TOI(Known Limitations and Resolutions section), please change the BGP convergence timeouts to a value greater than that of reload-delay value configured. 

 

Please let us know if you have any further queries. 

Post your Answer

You must be logged in to post an answer.