In a previous post, I demonstrated the ability of CVP Telemetry to do automatic analytics and event correlation. While each release of CVP includes additional analytic, event correlation, and visualization capabilities, there are still situations for which there is not yet automatic correlation.
In this post, we are going to investigate a network problem and identify the root cause using telemetry data. The lab topology is a layer3 spine-leaf topology with 2 spines and 6 leafs configured as 3 MLAG pairs.
As stated in the previous post, Cloud Tracer is a very useful tool for tracking the operational status of the network and resources on the network. It is able to test and track the rechability of any URL which could be another networking device or an endpoint/resource. In the virtual lab setup for this post the normal operational view from Cloud Tracer from within the CVP Telemetry viewer is shown below.
The picture shows that we are tracking the status of three Ubuntu devices from 6 different leaf switches. We could check the status of other performance indicators such as http response time, jitter, or latency but for this example it is not necessary.
After the failure condition is introduced to the virtual lab topology, the same data output now changes to show that something is clearly wrong and negatively affecting ubuntu2.
The tracked resource ubuntu2 is clearly not responding to Cloud Tracer from any of the leaf switches and we have been reassured externally that the server itself is running and operational.
We will start the investigation by learning a bit more about this resource. Mousing over any one of the 100% failure values on the screen will bring up the optional hyperlink ‘Compare Metrics’.
Selecting the ‘Compare Metrics’ link brings a more detailed telemetry screen that displays additional information on a timeline about the resource. Importantly for this example, that information includes the IP address.
Expanding the timeline a bit and eliminating some of the unnecessary information from the display gives a view that also shows the time of the failure.
It is important to take note of the accuracy of the telemetry within CVP. All events are timestamped and streamed in real time from the switches with sub-millisecond resolution in the database. The Leaf1 switch reported the outage at 7:48:05 but the cloud tracer configuration on the switches is configured to send tests every 5 seconds so the actual time of the failure is within 5 seconds of that specific time.
The tracked resource is clearly shown as IP address 10.1.3.11. Within CVP, there are many ways we can start to investigate. For this exercise, I will first check the status of the routing table in one of the spine switches. This will confirm if the route still exists, if it has changed in any way, and importantly where the route points to as a potential next step in the investigation. To do this, we are going to go to the Devices view and look at the IPv4 Routing Table on Spine1.
The right hand pane of this view shows historical routing changes and would allow us to change to the time of one of those events to investigate if it was a routing change but there were not changes on the day of the event (September 21st). Scrolling down through the current routing table as I’ve already done in this screen shot, we can see the route for the subnet we are investigating 10.1.3.0/24 is learned through eBGP from two different eBGP neighbors (172.16.1.13 on Ethernet 3 and 172.16.1.15 on Ethernet 4).
By Changing the left hand pane to now show us the LLDP neighbors. We can clearly see below that Ethernet 3 is connected to Leaf3 and Ethernet4 is connected to Leaf4.
We now know that the subnet that the failing resource in located on has been stable from a routing perspective and that the subnet is located on Leaf switches 3 and 4. Leaf 3 and 4 are an MLAG pair as part of the layer3 Leaf-Spine topology the labe network is based on.
We can now change the device we are looking at to Leaf3 and we will start by looking at the ARP table.
Clearly shown in the ARP table is the address of the resource we are interested in (10.1.3.11), it’s MAC address (50:01:00:0b:00:01) and the interface the ARP was learned on (Vlan105). If the failure happened far enough in the past that the current ARP table no longer contained this information, we could simple grab the time slider at the bottom of the screen and move back to a time nearer the failure to show this information. Since CVP keeps a very long history of the telemetry, we are able to get information that happened in the past and is no longer current.
The ARP table shows the layer 3 interface on which the ARP was received which is a VLAN in this instance. If we wanted to continue the investigation, we can simply click on MAC Address Table in the left hand pane to show additional investigative information.
This screen shot show us some very interesting information. We can see MAC address we are interested in (50:01:00:0b:00:01) in on Ethernet4 if we wanted to look into error counters but we have some clues already on this screen. It clearly shows that the MAC address has moved once and that the move was 2 seconds before the resource failure time we noted previously. Comparing this screen shot with the previous screen shot will show that the ARP was learned on VLAN 105 and that the MAC address currently is on VLAN 100. Clicking on the hyperlink ‘Compare data snapshots’ neatly ties all this information together.
There is now confirmation that the tracked resource has been moved to an incorrect VLAN. Switches don’t arbitrarily change devices to different VLANs so the investigation is not yet complete. While we are still looking at telemetry at the device level for Leaf3, we can select the system log message while we are here.
Scrolling through the logs we can see that a configuration was changed from CVP (192.168.1.254) on from user ‘server_admin’.
If we look at the CVP tasks screen, we can see task 447 was run just before the failure report by user ‘server_admin’.
Selecting that task and checking the configuration change that was made, we can see that the VLAN was incorrectly changed via Ansible.
The incorrect configuration change was most likely a user error as Ansible itself did exactly what it was supposed to do. To restore the resource, the same user can use Ansible to restore the configuration and this same information will likely help them understand why the resource they wanted to provision in VLAN100 is not yet operational. A simple run of the playbook with the correct values will restore the system.
Checking the CVP Telemetry Cloud Tracer view again, we can verify if the environment is restored.
In this example, we did some basic troubleshooting on an ongoing issue with connectivity to a specific resource. It is worth noting that CV keeps a history of these events so that we could perform this same investigative process if the issue had self restored and happened yesterday or even last week.