- Working of Metawatch
- Troubleshooting Metawatch
Working of Metawatch
Metawatch is an application (app) that runs on the Arista 7130 devices to perform highly accurate timestamping and aggregation across a large number of ports in a single device. The aggregated output is suited for feeding to an analytics application server or backend. The layer 1 functionality in Arista 7130 devices facilitates dynamic configuration of low latency bi-directional data paths with 5ns deterministic latency. Arista 7130 devices allow this data to be “tapped” with no impact on the performance of the pass through latency and passed to the Metawatch App for high-resolution time stamping, aggregation and buffering
In the above diagram, we could see that Server A and Server B , sends and receives the packets to / from the switch connected to the 7130 device and we were able to tap the packets in the L1 passthrough path and send it to the Metawatch app for adding the timestamp trailers and eventually send it out of the aggregation app port of the metawatch core. This aggregated app port port is connected to the et32 panel port through the crosspoint and finally reaches the Analytics / capture system
Timestamps are generally referenced to the start of the Ethernet frame. For MetaWatch, Arista defines this point in time as when the middle of the first bit of the incoming Ethernet frame, immediately following the start frame delimiter, as it leaves the receiving 10GbE SFP+ transceiver.
If you see issues with the accuracy in the timestamps of the metawatch timestamped packets , following things can be considered
Metawatch timestamps are based on the Timesources like NTP , PTP , PPS whichever being used in your network
Below is an example of using the timesource as system if the system is synced by the NTP it uses the NTP feeds as the timesource , if the system time is synced using the PTP, then PTP feeds are used to ensure the correctness of the timestamp counters accordingly. When using the timesource as system and using both the NTP and PTP is not advisable, either one of them should only be used
When NTP is used for the sync , Checking the NTP related things for its correctness is the key in troubleshooting the issue if incurred , below is an example of the Good NTP sync
(config)#Show ntp status synchronised to NTP server (10.210.8.1) at stratum 3 time correct to within 11 ms polling server every 32 s (config)#Show ntp association remote refid st t when poll reach delay offset jitter ============================================================================== *10.210.8.1 10.217.1.230 2 u 30 32 377 0.944 0.126 0.027 +10.210.8.2 10.217.1.230 2 u 11 32 377 0.966 -0.155 0.014
If NTP related things are good , then checking on whether the Metawatch Daemon was able to update the time sync as per the samples will be helpful , below is an example of the good sync
(config)#show watch status Sync Error (ns) Samples Min Max Average Std Deviation ----------------- ------- --------- --------- --------- ------------- Now 1 -2.5 Last 1 Minute 60 -6.5 6.5 0.6333 2.8905 Last 1 Hour 3600 -90.5 36.5 0.4981 10.8284
Below is an example of a bad sync , if you see the large values in the min / max / Average / deviations and reduced number of samples for a minute, we need to make sure whether the timesource configured are synced as per expectations . For example checking on the accuracy of the NTP sync ,PTP sync , PPS sync whichever applies according to the configuration will be helpful to understand the issue
(config)#show metawatch status Sync Error (ns) Samples Min Max Average Std Deviation ----------------- ------- --------- --------- --------- ------------- Now 1 -1.5 Last 1 Minute 6 -2.5 3.5 0.1667 2.0656 Last 1 Hour 3203 -167340044.5 71519699.5 2824.2540 5679850.7261
If PTP is being used to sync the clock , it’s better to check on the following sync details to confirm its correctness. Following is an example of a good PTP sync with its PTP master. We need to ensure that this PTP sync is not flapping frequently , by checking on the syslogs .
7130-L-Series(config)#show ptp status PTP: Running Configuration: Clock Identity: 7c534a.fffe.0abe02 Domain: 0 (as default) Interface: ma2 Transport: ipv4 (as default) Current clock status: Master present: true Current state: SLAVE Steps removed from master: 3 Offset from master: 11.0 ns Offset of master: 11 ns Mean path delay: 826.0 ns DEFAULT_DATA_SET: twoStepFlag: 1 slaveOnly: 1 numberPorts: 1 priority1: 255 clockClass: 255 clockAccuracy: 0xfe offsetScaledLogVariance: 0xffff priority2: 128 clockIdentity: 7c534a.fffe.0abe02
When using ‘timesource pps‘ mode, MetaWatch synchronizes to the top of the second indicated by the rising-edge of the PPS signal and uses the system clock to determine which second that rising-edge indicated. The system clock of the device is synchronized to within +/- 250ms. Synchronizing the system clock can be achieved using either PTP or NTP.
Checking on the correctness of the PPS
7130-L-Series(config-app-metawatch)#show timesource Configured Timesource: pps 7130-L-Series(config)#show sync source PPS Source: front-panel
In the “show sync status” output check the measure count. It should be around 125,000,000 for its correctness.
Make sure the values are as per the Good sync example shown below
7130-L-Series(config)#show sync status PPS Source: front-panel PPS Signal: Up (Measure count 124999936.0) PPS has been stable since last status Configured cable delay: 0ns 7130-L-Series#debug show pps 10 Looking for PPS pulses, times are approximate (to 0.1s) 0: 0.0000 Pulse... 1: 0.9418 Pulse... 2: 1.9432 Pulse... 3: 2.9404 Pulse... 4: 3.9394 Pulse...
- If you execute the show metawatch counters or show metawatch counters debug , if you see the receive drops counters incrementing , it might be due to the buffer drops , in that case , we need to confirm whether any form of rate limiting , flow control are configured under the metawatch app configuration mode
We can also check the current usage of the sdram in the metawatch using the following example , metawatch usually has deep buffers available along with each core to help the buffering needs to avoid the drops due to oversubscription
(config-app-metawatch)#show sdram status Description Core Status -------------------------- ---- --------------- Buffer Used 0 0.000% Buffer Size 0 8.000GB Single ECC Error Count 0 0 Multiple ECC Error Count 0 0 Buffer Used 1 0.000% Buffer Size 1 8.000GB Single ECC Error Count 1 0 Multiple ECC Error Count 1 0 Buffer Used 2 0.000% Buffer Size 2 8.000GB Single ECC Error Count 2 0 Multiple ECC Error Count 2 0 Buffer Used 3 0.000% Buffer Size 3 8.000GB Single ECC Error Count 3 0 Multiple ECC Error Count 3 0
- If you see the buffer overflows and if you see the rate limiting configured we should consider tweaking the rate limiting to avoid seeing the drops in metawatch
- If you see the buffer overflows and if you see the flow control configured , make sure whether we are receiving the pause frames from the analytics server . in the below example if you see the line_mac_rx_pause_frames increments , then there is a possibility of the pause frames being sent by the server towards the Metawatch core , this can cause the metawatch to use more buffers as it stops sending the frames as per the pause quanta defined on the pause frames
(config-app-metawatch)#show interface et32 counters verbose | grep -i pause Collecting all statistics for port et32 "host_mac_rx_pause_frames" : 0, "host_mac_tx_inserted_pause_quanta" : 0, "host_mac_tx_pause_detected_frames" : 0, "host_mac_tx_pause_err_frames" : 0, "host_mac_tx_pause_frames" : 0, "host_mac_tx_total_pause_quanta" : 0, "line_mac_rx_pause_frames" : 0, "line_mac_tx_inserted_pause_quanta" : 0, "line_mac_tx_pause_detected_frames" : 0, "line_mac_tx_pause_err_frames" : 0, "line_mac_tx_pause_frames" : 0, "line_mac_tx_total_pause_quanta" : 0,
If this is the case then we need to check on the Analytics server to understand whether it can handle the data rate sent from the metawatch , as pause frames are sent when the server was not able to handle the receive rate and its seeing drops on its rx buffer
Some packets are missing sporadically and we were doubtful whether Metawatch was dropping those frames
We can always capture the packets being sent out of the output Aggregation port of the metawatch core. For example , lets say ap 57 is an output agg port of one of the core where we suspect that the packets are getting dropped with the below config example we can feed the traffic sent out of the ap57 to hit the ma5 port of the 7130 device ( Note : I am using the ma5 as its 10g port) and then we can use the regular tcpdump to do the packet capture to check whether we still miss those packets
interface ma5 no shutdown no ip address source ap57
To do the packet capture , we can leverage the bash tcpdump utility like the example
bash tcpdump -nevvvi ma5
To store pcaps to a file
bash tcpdump -nevvvi ma5 -w /mnt/flash/metawatch.pcap
Note: Please be advised that the packet captures can also be lossy , if we miss to see some packets in the packet capture that doesn’t mean that metawatch is not sending those frames out of the app port , to do a perfect capture of all the packets , you can use tap feature of the 7130 to send those traffic to a capture card ( external server ) and we can capture all the traffic without any loss