• CVP HA Deployment Guide

 
 
Print Friendly, PDF & Email

Introduction

This document describes high availability deployment scenarios and best practices for Arista CloudVision Portal (CVP). The guide is intended for network architects and engineers who are planning, designing, or implementing an on-premises deployment of CVP.

Scope

In Scope

  • CVP version 2020.2.x/2020.3.x/2021.1.x
  • On-premises deployment
  • CVP virtual appliance
  • CVP physical appliance
  • Single-site
  • Multi-site
  • Disaster recovery

Out of Scope

  • Cloud-based deployment
  • CloudVision as a Service (CVaaS)
  • CVP virtual appliance in the public cloud
  • Detailed implementation steps (refer to the CVP user guide on arista.com)
  • CloudVision Exchange (CVX)

Deployment Models

Arista offers the on-premises CVP platform in both a virtual and physical form factor.

CloudVision Virtual Appliance

The CVP virtual appliance is a packaged OVA file that consists of the base OS, database, and web application. Please refer to the release notes on arista.com for the latest system requirements and device scale information.

CloudVision Physical Appliance

The CloudVision Appliance (CVA) is an Arista-branded physical server with the CVP software pre-installed.

The common use cases for the physical appliance include:

  • Customer preference for a physical server over a virtual machine deployment
  • Customer preference for an Arista-branded device to indicate a ‘network device’ entity within their data centre operation
  • Customer preference for a consistent platform across all deployments
  • CVP virtual appliance not supported by the customer’s hypervisor platform (Hyper-V, etc.)

Further details can be found in the CVP data sheet.

High Availability

Clustering

The CVP virtual and physical appliance can be deployed stand-alone (single-node) or as a cluster of three (multi-node). A single-node deployment is recommended for testing purposes as it provides a simpler setup and requires less resources. For production deployments a cluster of three (multi-node) is recommended for performance and N+1 redundancy.

The table below compares single-node and multi-node deployments.

Feature Single-Node Multi-Node
Scale and Performance
  • Limited to 250 devices and 10K interfaces
  • 6x higher scale compared to single-node
  • Provides load sharing across nodes
Redundancy
  • Single point of failure
  • N+1 redundancy
  • If a node goes down, Kubernetes schedules the lost pods on the other two nodes
  • Uninterrupted provisioning and telemetry for single node failures
  • Simplified Return Merchandise Authorization (RMA) for node failures
Corruption Management
  • No recovery for lost data
  • Manual intervention to resolve data corruption
  • Data has three replicas
  • Automatic recovery from data corruption

A multi-node cluster is formed of exactly three nodes with identical system resources (three virtual appliances or three physical appliances). The requirement for three nodes is due to the underlying components including Zookeeper, Hadoop/HDFS and HBase which require a quorum mechanism for failover negotiation and split-brain scenarios.

Note. A multi-node cluster can survive the failure of a single node while providing uninterrupted provisioning and telemetry.

To protect against hardware failures impacting cluster operation, Arista recommends the following:

  • Deploy the three virtual appliances across three different hosts
  • Connect each host/physical appliance to two independent power feeds
  • Multi-home each host/physical appliance to two management switches
  • Use dedicated NICs for cluster communication
  • Implement a load balancer VIP for frontend UI access

The following diagram shows a three-node cluster deployment.

CVP Multi-Node Cluster

Bandwidth and Latency Requirements

All three cluster nodes should reside within the same data centre location. It is not recommended to geographically separate cluster nodes as this can introduce performance issues due to low bandwidth or high latency.

CVP can manage switches at remote sites providing bandwidth and latency requirements are met.

Traffic Bandwidth Latency
Cluster Sync

(CVP ←→ CVP)

1Gbps uncontended 5ms or less
Device Management

(CVP ←→ Devices)

Generally 200-400Kb/s per device with gzip compression enabled 500ms or less

Installation Requirements

The following table provides the CVP multi-node installation requirements. Ensure these items are documented prior to running the installation scripts.

Item Node 1 Node 2 Node 3
Role Primary Secondary Tertiary
Hostname (FQDN)
IP Address (eth0)
Netmask (eth0)
IP Address (eth1)
Netmask (eth1)
Default Route
DNS Server
NTP Server
Telemetry Ingest Key
Cluster Interface Name
Device Interface Name
Host Netmask (Physical Appliance)
iDRAC Netmask (Physical Appliance)
Host IP Address (Physical Appliance)
Host Netmask (Physical Appliance)

CVP Multi-Node Cluster Example

Single-Site Deployment

For single-site deployments Arista recommends locating the CVP cluster with the switches under management. This ensures that congestion or outages on wide area links does not impact CVP’s ability to manage and receive telemetry data from the switches. The CVP nodes should be connected to a dedicated management network as per the recommendations in the high availability section. It is also recommended to implement a backup strategy to export the system backup to a remote location at regular intervals. This allows for disaster recovery situations in which a cluster rebuild is required.

CVP Single-Site Deployment

Multi-Site Deployment

This section describes two multi-site deployment models, namely centralised cluster and dual cluster.

Centralised Cluster

In the centralised cluster model, a single cluster, usually located at a hub site manages switches at a number of remote sites.

Network architects considering a centralised cluster model must verify:

  • If a centralised cluster can support the required scale (total number of devices and interfaces)
  • If the WAN/DCI can support the required bandwidth and latency (see previous section)
  • If the recovery time objective (RTO) and recovery point objective (RPO) is met should the centralised cluster become unavailable (due to a cluster failure, power/WAN outage, etc. at the hub site). See the backup and restore section for details

As in the single-site deployment model, it is recommended to implement a backup strategy to export the system backup to a remote location at regular intervals.

CVP Multi-Site Deployment with Centralised Cluster

Dual Cluster

The dual cluster model supports geo-redundancy by providing a secondary cluster at a geographically separate location. This deployment model provides application and service continuity should the primary cluster become unavailable. A dual cluster can be achieved in different ways, depending on RTO/RPO requirements.

Cold Standby

The following table describes the cold standby model.

Type Description Notes
Cold Standby If the primary cluster goes down, a standby cluster is powered-on and the configuration is restored from a backup.
  • The standby cluster can retain the same or use different IP addresses, depending on the type of network
  • If the standby cluster is using different IP addresses, you must reregister the switches to repoint the streaming telemetry agent to the new cluster
  • The recovery timeline depends on failure detection, backup frequency and restore time
  • Telemetry data is not included in the backup and restore process
  • The primary and backup cluster must run identical software versions

The following diagram shows the cold standby model.

Dual Cluster with Cold Standby

Warm Standby

The following table describes the warm standby model.

Type Description Notes
Warm Standby Switches stream telemetry to two clusters, which provides uninterrupted monitoring should the primary cluster become unavailable. Switches are also registered with both clusters for provisioning, with one cluster acting as primary while the other is standby.
  • Telemetry services are available in both clusters (active-active)
  • Provisioning services are available in the primary cluster with failover to the secondary cluster if required (active-standby)
  • Both clusters must support the required scale for all sites (total number of devices and interfaces)
  • Clusters can run different software versions, allowing for upgrades and testing
  • A synchronization strategy must be defined for the provisioning dataset
  • Provides a loosely coupled HA architecture as each cluster maintains an independent database. Only text-based configlets are synchronised

The following diagram shows the warm standby model.

Dual Cluster with Warm Standby

Backup and Restore

Default Backup Schedule

Arista provides a script at /cvpi/tools/backup.py which is scheduled to run daily at 2:00 am to backup the provisioning dataset, and retain the last 5 backups in /data/cvpbackup/.

Default backup schedule:

[root@cvp1 cvpbackup]# crontab -l
0 2 * * * /cvpi/tools/backup.py --limit 5

It is a good practice to export backups at regular intervals to ensure that you have an adequate supply of backup files available to restore the provisioning dataset. See the automatic backup export section for details.

Note. There is no backup or restore of the telemetry dataset.

Manual Backup

Use the cvpi backup command to save a copy of the provisioning dataset.

[root@cvp1 cvpbackup]# cvpi backup cvp

The cvpi backup command creates two backup files in the /data/cvpbackup directory. The eosimages.tgz is generated only when it differs from the currently available copy of the eosimages.tgz, and is an optional parameter for restore if the CVP system already contains the same EOS image.

To check the progress of the backup, read the latest backup_cvp.*.log file in /cvpi/logs/cvpbackup.

Note. For a multi-node cluster, you can run this command only on the primary node.

Manual Restore

Use the cvpi restore command to restore the provisioning dataset.

[root@cvp1 cvpbackup]# cvpi restore cvp cvp.timestamp.tgz eosimages.timestamp.tgz

The cvpi restore command will stop the cvp application and disrupt the service for the duration of the restore. If the restore is from a backup on a different CVP system to a new CVP system, it may also be required to on-board the EOS devices or restart the Terminattr daemons on the EOS devices after the restore.

To check the progress of the restore, read the latest restore_cvp.*.log file in /cvpi/logs/cvpbackup.

Note. For a multi-node cluster, you can run this command only on the primary node.

Automatic Backup Export

A cronjob can be used to automate the backup export process. The following example transfers the latest backup files to a remote server daily at 4:00 am using SCP.

[root@cvp1 cvpbackup]# crontab -e
0 2 * * * /cvpi/tools/backup.py --limit 5 ← Default backup schedule
0 4 * * * scp $(ls -t /data/cvpbackup/cvp.20* | head -1) root@backup1:/cvp/
0 4 * * * scp $(ls -t /data/cvpbackup/cvp.eos* | head -1) root@backup1:/cvp/

If you did not specify a password in the cronjob, you will need to copy the SSH key from CVP node to the remote server:

1. If you do not have a public key, generate one with the ssh-keygen command:

[root@cvp1 cvpbackup]# ssh-keygen -t rsa -b 4096

2. Copy the public key of the CVP node:

[root@cvp1 cvpbackup]# cat ~/.ssh/id_rsa.pub
ssh-rsa …
root@cvp1.local

3. Add the public key of the CVP node to the remote server:

[root@backup1 ~]# vi ~/.ssh/authorized_keys
ssh-rsa …
root@cvp1.local

4. Add the remote server to the known hosts file:

[root@cvp1 cvpbackup]# ssh root@backup1
The authenticity of host 'backup1' can't be established.
ECDSA key fingerprint is …
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'backup1' (ECDSA) to the list of known hosts.

Note. For a multi-node cluster, enable the cronjob on the primary node.

Streaming Telemetry to Multiple CVP Clusters

TerminAttr v1.7.1 introduced support for streaming to multiple CVP clusters. The following table describes how the -cvopt option is used to configure TerminAttr for multi-cluster streaming.

Option Description
-cvopt={cluster-name}.addr={ip-address}
Single or comma separated list of CVP cluster IP addresses
-cvopt={cluster-name}.auth=key,{key-phrase}
Authentication key phrase (shared key authentication)
-cvopt={cluster-name}.auth=token,{token-path}
Enrolment token file (certificate-based authentication)
-cvopt={cluster-name}.vrf={vrf}
VRF used to connect to CVP

Shared Key Authentication Example

The following example configures TerminAttr to stream to primary and secondary clusters with shared key authentication:

daemon TerminAttr
   exec /usr/bin/TerminAttr -cvopt=primary.addr=1.1.1.1:9910,1.1.1.2:9910,1.1.1.3:9910 -cvopt=primary.auth=key,Arista123 -cvopt=primary.vrf=MGMT -cvopt=secondary.addr=2.2.2.1:9910,2.2.2.2:9910,2.2.2.3:9910 -cvopt=secondary.auth=key,Arista123 -cvopt=secondary.vrf=MGMT -cvcompression=gzip -smashexcludes=ale,flexCounter,hardware,kni,pulse,strata -ingestexclude=/Sysdb/cell/1/agent,/Sysdb/cell/2/agent -taillogs
   no shutdown

Certificate-based Authentication Example

Generate enrolment tokens on primary and secondary clusters:

[root@cvp1 ~]# curl -d '{"reenrollDevices":["*"]}' -k https://127.0.0.1:9911/cert/createtoken
{"token":"(cvp1-token)"}
[root@cvp2 ~]# curl -d '{"reenrollDevices":["*"]}' -k https://127.0.0.1:9911/cert/createtoken
{"token":"(cvp2-token)"}

Import enrolment tokens on the switch (only the token value should be copied and not the entire dictionary):

switch#copy terminal: file:/tmp/primary-token
enter input line by line; when done enter one or more control-d
(cvp1-token)
Copy completed successfully.
switch#copy terminal: file:/tmp/secondary-token
enter input line by line; when done enter one or more control-d
(cvp2-token)
Copy completed successfully.

Enable TerminAttr to stream to primary and secondary clusters with certificate-based authentication:

daemon TerminAttr
   exec /usr/bin/TerminAttr -cvopt=primary.addr=1.1.1.1:9910,1.1.1.2:9910,1.1.1.3:9910 -cvopt=primary.auth=token,/tmp/primary-token -cvopt=primary.vrf=MGMT -cvopt=secondary.addr=2.2.2.1:9910,2.2.2.2:9910,2.2.2.3:9910 -cvopt=secondary.auth=token,/tmp/secondary-token -cvopt=secondary.vrf=MGMT -cvcompression=gzip -smashexcludes=ale,flexCounter,hardware,kni,pulse,strata -ingestexclude=/Sysdb/cell/1/agent,/Sysdb/cell/2/agent -taillogs
   no shutdown

References:

Provisioning from Multiple Clusters

To register a device with multiple CVP clusters for provisioning:

1. Apply a multi-cluster streaming configuration (see previous section)
2. Confirm the device appears in the inventory page on both clusters (‘Streaming Only’ status)
3. Select ‘Provision Device’ from the Device Overview page on both clusters

Provision a ‘Streaming Only’ device

4. Create identical container and configlet structures on both clusters
5. Move the device to the target container and apply configlet(s) on both clusters
6. Save the changes and execute the task using the change control page

Note. There is no automatic synchronization of the provisioning dataset between clusters, therefore a configuration change made on one cluster will cause out of compliance status on the other. It is recommended to designate one cluster as primary and implement a synchronization strategy for the standby.

Synchronize the Provisioning Dataset

Arista provides a script at /cvpi/tools/cvptool.py which can backup and restore provisioning data (configlets, containers, etc.)

The following example shows two cronjobs enabled on cvp2:

  • Backup configlets from cvp1 at 5:00 am and save to cvp1.tar in the local file system
  • Restore configlets from cvp1.tar at 6:00 am
[root@cvp2 ~]# crontab -e
0 2 * * * /cvpi/tools/backup.py --limit 5 ← Default backup schedule
0 5 * * * /cvpi/tools/cvptool.py --user cvpadmin --password Arista123 --host cvp1 --action backup --objects configlets --tarFile cvp1.tar
0 6 * * * /cvpi/tools/cvptool.py --user cvpadmin --password Arista123 --host 127.0.0.1 --action restore --objects configlets --tarFile cvp1.tar

The cronjobs ensure cvp2 resynchronizes configlets from cvp1 every 24 hours, resolving any out of compliance issues. It also ensures cvp2 can provide provisioning services should cvp1 be offline for an extended period of time.

Note. For a multi-node cluster, enable the cronjobs on the primary node.

Note. The restore will create tasks on the receiving cluster. These can be executed or cancelled (the running-config and the design configuration should be resynchronized after the restore).

Follow

Get every new post on this blog delivered to your Inbox.

Join other followers: