- Previously on Network CI/CD…
- CI/CD environment overview
- Stage 1 – Building device configurations
- Stage 2 – Network analysis with Batfish
- Stage 3 – Testing and verification with Robot Framework
- Stage 4 – Dry-running the change and generating diffs
- Stage 5 – Pushing to production and generating state diffs
Previously on Network CI/CD…
We’ve kicked off this series with an overview of cEOS-lab and different container orchestration tools that can be used for network simulations. In the second post we’ve seen how to automate network verification and testing with Arista’s Robot framework library. In this final post, we’ll put it all together to demonstrate a simple data centre network CI pipeline that will run through a sequence of stages to build and test every new configuration change.
Let’s take a typical data centre leaf-spine network as an example and let’s assume that Leaf-1/2 and Spine-1/2 are already built and are fully functional. Our goal would be to introduce the new Leaf-3 switch, automate the generation of configs for ALL network devices and verify that the proposed changes not only establish the end-to-end reachability between hosts, but also meet the multipath and resiliency requirements of a typical Clos network.
We’ll do that by utilising a number of freely available open-source and Arista software packages and tools that will be strung together to form a 5-stage CI/CD pipeline:
- Ansible will be used in the first stage to generate production and lab device configurations from a simple data model
- Batfish will be used in the second stage to analyse generated configurations and look for anomalies in the resulting control and data plane
- Arista’s Robot framework library will be used next, to run a series of control plane and data plane tests against a lab network built out of cEOS-lab Docker images.
- The fourth stage will again use Ansible to dry-run the proposed changes against a production network and generate configuration diffs
- The final stage will use Ansible to push the proposed changes into production network and will collect and compare the contents of IPv4 FIB and ARP tables before and after the change.
Another assumption would be that Leaf-3 has already got its initial config from ZTP server, enough to make it remotely manageable by our CI system. In fact, ZTP can be a part of a separate, initial service provisioning pipeline that includes the generation of initial configs, DHCP server update, cabling verification using LLDP and finally user notification through IM/email once the switch becomes ready for deployment. The details of this kind of pipeline are outside of the scope of this article.
CI/CD environment overview
At a high level, our CI environment consists of the following main components:
- Gitlab Server performing the functions of a git repository, CI/CD server and a website hosting engine Gitlab Pages (to render test reports and other artifacts).
- Gitlab Runner – a worker node for CI/CD pipeline, which receives jobs from the Gitlab server, runs them and sends back the results.
- Private Docker registry – stores cEOS and other docker images required by our pipeline.
- Batfish server – a separate containerised process that receives network configurations from a batfish client and calculates control and data plane properties.
- Executor – a Docker container with Ansible, Robot Framework and other required dependencies pre-installed. This container is stored in the private Docker registry and is used by our Gitlab runner to execute individual jobs.
The mechanism of triggering the pipeline run is very simple. Every time we push a change to the Gitlab server, it looks for the .gitlab-ci.yml file, describing all stages of the pipeline, and schedules to run them on our Gitlab runner. Let’s have a closer look at each one of these stages.
Stage 1 – Building device configurations
In this stage Gitlab will invoke Ansible to build configuration files for all network devices in our topology. The input to this stage is a data model that describes all the required properties of our leaf-spine network, including its topology, routing and access port configuration. Here’s an abridged example of this data model, showing only information about Leaf-1:
The goal of this data model is to provide all the required information, without duplication, but enough to cover both production and lab (cEOS-lab) environments. In its current form, this information cannot be used to populate configuration templates, since it doesn’t contain specific IP addresses or BGP sessions that need to be configured. To generate them we’ll use a custom Ansible module called my_ipam, which is going to assign all the required IP addresses and build a full table of BGP sessions (assuming each point-to-point link is going to run BGP). In real life this module can simply call some external database or IPAM and get the required information from them.
The final task of this stage is to generate full device configurations from Jinja templates using the data model built in the previous step.
The full device configuration is going to be built out of multiple individual templates. Below is an example of a inter-switch links template:
As mentioned previously, this step will generate two sets of config – one for production and one for the lab environment. Once those configs are generated, we will store them as job artifacts – a set of files that are not part of our git repository, but that will be made available to all subsequent stages of the pipeline.
Stage 2 – Network analysis with Batfish
Batfish is a static network configuration analysis tool that can find many potential configuration issues without the need to interact with any physical or virtual network devices. It does that by analysing network configuration files, converting them into a vendor-independent model and calculating the resulting control and data plane states. In addition to that, it can simulate failures and calculate how the routing and traffic flows will change as the result of that. The client-side component of Batfish can ask “questions” about the expected state of the network and can get the results in Pandas dataframe format. This makes it really easy to do queries like this:
- There are no undefined or unused data structures (e.g. route-maps or prefix-lists) in our configuration files
- All leaf switches will successfully establish BGP peerings with each one of the spine switches
- Traceroute between the new Leaf-3 and Leaf-1/2 switches will take multiple paths (equal to the number of spines) and will traverse at most two hops
- In case of a complete outage of Spine-2, traceroutes from Leaf-3 to Leaf-1/2 will still succeed
Using Batfish so early in the pipeline allows us to save a lot of time and resources and catch a large number of errors before we get to the simulation-based tests in later stages. For example, Batfish would catch the error in our routing template, where a prefix-list name is misspelled in the route-map that is redistributing connected subnets into BGP. As you can see from the below output, it also highlighted the fact that traceroutes between leaf loopbacks would fail since they would not get advertised by BGP.
More details about fixing errors found by Batfish can be found in the demo walkthrough.
Stage 3 – Testing and verification with Robot Framework
The next stage starts with our CI pipeline building a lab network, pre-populated with the lab configs generated in stage 1. To build it, we use docker-topo, a container topology orchestration tool, that reads a topology definition file and builds a lab network with cEOS-lab Docker containers. Once the lab is built, we use Arista Network Validation tool, which was described in the previous post, to validate the desired control plane and data plane properties. Specifically, we verify that:
- BGP peerings between Leaf-3 and Spine-1/2 are in the Established state.
- Loopbacks of Leaf-1/2 are learned by Leaf-3 via BGP
- Leaf-3 can ping both loopbacks of Leaf-1/2
- Host-3 can ping Host-1 and Host-2
End hosts are simulated using the lightweight Alpine Linux containers, so verifying end-to-end connectivity would require Robot to connect to locally running Docker containers, which is achieved using the Run Process keyword:
At the end of these tests, Robot Framework generates a report, which we store as an artifact of this stage of the pipeline.
Stage 4 – Dry-running the change and generating diffs
Finally, once all the tests are completed, we can dry-run our change against a production network. To do that we run Ansible with “check” and “diff” flags:
At this stage no changes will be applied to any of the production devices and Ansible makes use of Arista’s “session-config diff” feature to get the differences between the proposed and the running configs, like in this example for Spine-2:
All diffs get saved in a file and stored as this job’s artifacts. The automatic run of the pipeline stops at this stage and the next stage requires manual intervention from a user.
Stage 5 – Pushing to production and generating state diffs
Before the new configs get pushed to production, a change reviewer gets a chance to examine all of the artifacts collected previously. By this stage the artifacts would include:
- Full configurations for all devices in the network
- Any errors found by Batfish
- Test report produced by Robot
- Configuration diffs collected by Ansible
If the change reviewer is satisfied with these artifacts, he or she can decide to trigger the last job, which uses Ansible to replace the current configurations with the new ones.
In order to provide a better understanding of the impact the change has had on the network, we collect the contents of IPv4 FIB and ARP tables and compare them before and after the change. This is done using Ansible’s json_query filter to extract the interesting values from JSON responses and the difference filter to compare pre- and post-change values. For example here’s how Leaf-3’s IPv4 routing table is going to change as the result of the push:
The detailed walkthrough of all of the above 5 stages with additional explanations and code are available on Github.
There is no one single “best practice” for CI pipelines and the approach described in this post was just an example. The best CI pipeline is that one that suits your particular situation and it can evolve over time as the network and its environment change. You may do everything with just Robot Framework, but if you think that Batfish or any other testing engine would help – don’t be afraid to add them to the mix. Ultimately any CI pipeline is only as good as the test coverage it provides, so rigid testing discipline is a must for any network to function reliably, which means that everyone who operates the network should be able to write meaningful tests for all issues and bugs encountered in production, using the testing engine of your choice.