Network simulation environments have always been limited to a single compute node, which made the labbing of a full-scale production network an exercise in compromise and trade-offs. At the same time compute resources are cheap and abundant and modern application designs are making use of them by adopting meshed scale-out architectures, treating multiple hosts as a single pool of resources. In this post, we’ll see how (with just a few clicks*) we can build a replica of a real production network, orchestrated by Kubernetes based on information extracted from Arista’s CloudVision Portal (CVP).
* Assuming all the prerequisites are met
Kubernetes as a network simulation platform
In order to spin up multiple virtual network devices on different hosts, we need a system that can do (CPU/RAM) resource tracking and reservation, mapping of each network device to a particular host (a.k.a. scheduling) and lifecycle management, i.e. health monitoring, restarting of the failed devices etc. Multiple systems can meet these criteria so the choice of Kubernetes (K8s) is mostly random in this case. One benefit the K8s brings is its pervasiveness in modern public and private clouds, as it’s becoming the “cloud-native operating system”.
However the K8s cluster networking model is somewhat restrictive and doesn’t provide an easy way to run network simulations on top of it. Multiple solutions exist that try to address some of these limitations, mostly targeted at running NFV-based applications on top of K8s. In our case we’ll be using meshnet-cni – a CNI plugin custom-built to create arbitrary virtual network topologies in K8s. It works by creating point-to-point links between pods and can co-exist with any standard CNI plugin (e.g. flannel). It works in conjunction with k8s-topo, an orchestration tool that uses K8s APIs to create and destroy network topologies based on a single YAML file describing all interconnected devices.
cEOS vs c(vEOS)
Choosing the workload to use for the simulations is an iterative process. cEOS-lab has been released in 2018 and provides a way to run Arista’s EOS in a container-native format. Due to it’s smaller resource footprint it makes sense to consider it as the first (default) choice. However cEOS-lab has a number of limitations (e.g. lack of dot1Q support) that may prevent us from using it to simulate all features used in production environment (e.g. MLAG). In which case we can use a containerised vEOS, which runs a vEOS inside a KVM hypervisor wrapped in a docker image format. This option provides a closer approximation to our production environment but has a bigger footprint and may present a challenge with inconsistent interface naming (we can’t have Ethernet3/1/1 inside vEOS like we do inside cEOS).
CVP as topology and configuration source of truth
Finally, we need to find a way to capture the current state of a network topology to be used in our simulations. By default, all network devices include LLDP information as a part of state telemetry that’s being streamed to CVP. This information is used by CVP to build the telemetry data that can be viewed in a GUI but is also accessible via the recently added telemetry API. This provides us with a single point of contact for all topology information and includes both Arista and 3rd-party devices discovered through LLDP globally. We only need to make two API calls to extract all nodes and edges information to build a full topological graph of our network. Running device configurations can be extracted through the same telemetry API to be used as a startup configs for all virtual network devices. All of these API calls and data post-processing are incorporated into a single python script cvp-netsim, which takes the CVP ip address and credentials as input and produces a topology file that can be consumed directly by k8s-topo.
Now it’s time to put all of these tools together to see how we can build a virtual replica of our production network. At a high level this process will involve running two scripts – cvp-netsim and k8s-topo, the former will pull the required information from CVP and the latter provision all the necessary K8s resources.
Before we can run those scripts, we need to make sure we’ve satisfied some of the obvious (and not so) requirements:
- We need a have working network with CVP consuming streaming telemetry from all Arista devices.
- We need a K8s cluster with enough resources to accommodate the virtual network of our size. The rough guidelines would be 1vCPU + 1GB of RAM per cEOS or 2vCPUs + 2GB of RAM per vEOS.
- This K8s cluster needs to have a meshnet-cni plugin installed on all nodes and k8s-topo hosted in one of the pods.
- We need to have docker images of our virtual devices uploaded to a local docker registry accessible from our K8s cluster. See instructions for for cEOS and vEOS.
Once all of the above is done, the process looks really easy. First get a local copy of cvp-netsim.
~> git clone -b cvp-k8s https://github.com/aristanetworks/eoscentral.git cvp-netsim && cd cvp-netsim
Next, pull all the necessary data from CVP and change it to be easily consumed by vEOS. The latter means that it re-numbers connected interfaces sequentially starting from Ethernet1 and updates the running configuration accordingly. This needs to be done only for vEOS since cEOS can connect out-of-sequence interfaces with arbitrary names.
~/cvp-netsim> ./ingest.py --veos localhost:9443 cvpadmin cvppassword
The last command produces a cvp_topology.tar.gz archive that contains the topology and configuration files which can now be copied to the host with K8s cluster admin credentials (e.g. k8s-lab-node-1 in this example).
~/cvp-netsim> scp cvp_topology.tar.gz k8s-lab-node-1:/home/core
From here we can copy the archive into the k8s-topo pod.
$ kubectl cp /home/core/cvp_topology.tar.gz k8s-topo:/
And extract the required information into a local directory.
# rm -rf ./lab && mkdir -p ./lab # tar zxvf /cvp_topology.tar.gz -C lab
We need to tell k8s-topo which docker image to use by creating a custom hostname-to-image mapping.
# grep -A 1 custom_image lab/cvp_topology.yml custom_image: acme: "10.1.1.1:5000/veos:latest"
Finally, we can create our topology inside the K8s cluster.
# ./bin/k8s-topo --create lab/cvp_topology.yml INFO:__main__:All data has been uploaded to etcd INFO:__main__:All pods have been created successfully
A few minutes later we can check that all of the devices have been deployed.
# kubectl get pods NAME READY STATUS RESTARTS AGE etcd0 1/1 Running 0 2d12h etcd1 1/1 Running 0 2d12h etcd2 1/1 Running 0 2d12h internal-docker-registry-7999859b-fm22v 1/1 Running 0 2d12h k8s-topo 1/1 Running 0 2d12h acme-dc1b2001-a 1/1 Running 0 3m27s acme-dc1b2001-b 1/1 Running 0 3m27s acme-dc1l3001-a 1/1 Running 0 3m29s acme-dc1l3001-b 1/1 Running 0 3m28s acme-dc1l3002-a 1/1 Running 0 3m31s acme-dc1l3002-b 1/1 Running 0 3m27s acme-dc1l3003-a 1/1 Running 0 3m31s acme-dc1l3003-b 1/1 Running 0 3m27s acme-dc1l3004-a 1/1 Running 0 3m30s acme-dc1l3004-b 1/1 Running 0 3m29s acme-dc1l3005-a 1/1 Running 0 3m31s acme-dc1l3005-b 1/1 Running 0 3m30s acme-dc1l3006-a 1/1 Running 0 3m30s acme-dc1l3006-b 1/1 Running 0 3m31s acme-dc1l3007-a 1/1 Running 0 3m28s acme-dc1l3007-b 1/1 Running 0 3m31s acme-dc1l3008-a 1/1 Running 0 3m31s acme-dc1l3008-b 1/1 Running 0 3m30s acme-dc1l3009-a 1/1 Running 0 3m29s acme-dc1l3009-b 1/1 Running 0 3m30s acme-dc1l3010-a 1/1 Running 0 3m31s acme-dc1l3010-b 1/1 Running 0 3m30s acme-dc1l3011-a 1/1 Running 0 3m27s acme-dc1l3011-b 1/1 Running 0 3m29s acme-dc1l3012-a 1/1 Running 0 3m31s acme-dc1l3012-b 1/1 Running 0 3m31s acme-dc1l3013-a 1/1 Running 0 3m30s acme-dc1l3013-b 1/1 Running 0 3m31s acme-dc1l3014-a 1/1 Running 0 3m30s acme-dc1l3014-b 1/1 Running 0 3m28s acme-dc1l3015-a 1/1 Running 0 3m31s acme-dc1l3015-b 1/1 Running 0 3m28s acme-dc1l3016-a 1/1 Running 0 3m30s acme-dc1l3016-b 1/1 Running 0 3m29s acme-dc1l3017-a 1/1 Running 0 3m31s acme-dc1l3017-b 1/1 Running 0 3m29s acme-dc1l3018-a 1/1 Running 0 3m30s acme-dc1l3018-b 1/1 Running 0 3m29s acme-dc1l3019-a 1/1 Running 0 3m31s acme-dc1l3019-b 1/1 Running 0 3m31s acme-dc1l3020-a 1/1 Running 0 3m30s acme-dc1l3020-b 1/1 Running 0 3m28s acme-dc1l3021-a 1/1 Running 0 3m27s acme-dc1l3021-b 1/1 Running 0 3m30s acme-dc1l3022-a 1/1 Running 0 3m31s acme-dc1l3022-b 1/1 Running 0 3m27s acme-dc1l3023-a 1/1 Running 0 3m31s acme-dc1l3023-b 1/1 Running 0 3m28s acme-dc1l3024-a 1/1 Running 0 3m31s acme-dc1l3024-b 1/1 Running 0 3m31s acme-dc1l3025-a 1/1 Running 0 3m28s acme-dc1l3025-b 1/1 Running 0 3m27s acme-dc1l3026-a 1/1 Running 0 3m27s acme-dc1l3026-b 1/1 Running 0 3m31s acme-dc1l3027-a 1/1 Running 0 3m31s acme-dc1l3028-a 1/1 Running 0 3m31s acme-dc1s1001-a 1/1 Running 0 3m27s acme-dc1s1001-b 1/1 Running 0 3m31s
And connect to any one of them from inside the k8s-topo pod:
# kubectl exec -it acme-dc1s1001-a bash sh-4.2# telnet localhost 23 Trying 127.0.0.1... Connected to localhost. Escape character is '^]'. acme-dc1s1001-a login:
Now that we have tools to build production network replicas of literally any size, testing network changes and building network CI pipelines is becoming easier than ever. Network engineering community should take this opportunity to challenge the legacy change management procedures and make them more automated and reliable, all the while learning new technologies and bridging gaps between IT infrastructure and networking teams by using the same tools and methodologies.