• Troubleshooting Metawatch

 
 
Print Friendly, PDF & Email

Working of Metawatch 

Metawatch is an application (app) that runs on the Arista 7130 devices to perform highly accurate timestamping and aggregation across a large number of ports in a single device. The aggregated output is suited for feeding to an analytics application server or backend. The layer 1 functionality in Arista 7130 devices facilitates dynamic configuration of low latency bi-directional data paths with 5ns deterministic latency. Arista 7130 devices allow this data to be “tapped” with no impact on the performance of the pass through latency and passed to the Metawatch App for high-resolution time stamping, aggregation and buffering

 

 

In the above diagram,  we could see that Server A and Server B , sends and receives the packets to / from the switch connected to the 7130 device and we were able to tap the packets in the L1 passthrough path and send it to the Metawatch app for adding the timestamp trailers and eventually send it out of the aggregation app port of the metawatch core. This aggregated app port port is connected to the et32 panel port through the crosspoint and finally reaches the Analytics / capture system 

Timestamps are  generally referenced to the start of the Ethernet frame. For MetaWatch, Arista defines this point in time as when the middle of the first bit of the incoming Ethernet frame, immediately following the start frame delimiter, as it leaves the receiving 10GbE SFP+ transceiver. 

 

Troubleshooting Metawatch 

 

Issues related to the Time source and its sync 

 

If you see issues with the accuracy in the timestamps of the metawatch timestamped packets , following things can be considered 

 

Metawatch timestamps are based on the Timesources like NTP , PTP , PPS whichever being used in your network 

 

Below is an example of using the timesource as system if the system is synced by the NTP it uses the NTP feeds as the timesource , if the system time is synced using the PTP,  then PTP feeds are used to ensure the correctness of the timestamp counters accordingly.  When using the timesource as system and using both the NTP and PTP is not advisable, either one of them should only be used 

 

 7130-L-Series(config-app-metawatch)#timesource system  

 

When NTP is used for the sync , Checking the NTP related things for its correctness is the key in troubleshooting the issue if incurred , below is an example of the Good NTP sync 

 

(config)#Show ntp status 

synchronised to NTP server (10.210.8.1) at stratum 3 
   time correct to within 11 ms
   polling server every 32 s


(config)#Show ntp association

     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================

*10.210.8.1      10.217.1.230     2 u   30   32  377    0.944    0.126   0.027
+10.210.8.2      10.217.1.230     2 u   11   32  377    0.966   -0.155   0.014

 

If NTP related things are good , then checking on whether the Metawatch Daemon was able to update the time sync as per the samples will be helpful , below is an example of the good sync 

 

(config)#show watch status 

Sync Error (ns)   Samples       Min       Max   Average Std Deviation
----------------- ------- --------- --------- --------- -------------
Now                     1                          -2.5              
Last 1 Minute          60      -6.5       6.5    0.6333        2.8905
Last 1 Hour          3600     -90.5      36.5    0.4981       10.8284

 

Below is an example of a bad sync , if you see the large values in the min / max / Average / deviations and reduced number of samples for a minute, we need to make sure whether the timesource configured are synced as per expectations . For example checking on the accuracy of the NTP sync ,PTP sync , PPS sync whichever applies according to the configuration will be helpful to understand the issue 

 

(config)#show metawatch status

Sync Error (ns)   Samples       Min       Max   Average Std Deviation
----------------- ------- --------- --------- --------- -------------
Now                     1                          -1.5
Last 1 Minute           6      -2.5       3.5    0.1667        2.0656
Last 1 Hour          3203 -167340044.5 71519699.5 2824.2540  5679850.7261

 

If PTP is being used to sync the clock , it’s better to check on the following sync details to confirm its correctness. Following is an example of a good PTP sync with its PTP master. We need to ensure that this PTP sync is not flapping frequently , by checking on the syslogs .

 

7130-L-Series(config)#show ptp status

PTP: Running

Configuration:

    Clock Identity: 7c534a.fffe.0abe02
    Domain: 0 (as default)
    Interface: ma2
    Transport: ipv4 (as default)

Current clock status:
    Master present: true
    Current state: SLAVE
    Steps removed from master: 3
    Offset from master: 11.0 ns
    Offset of master: 11 ns
    Mean path delay: 826.0 ns

DEFAULT_DATA_SET:
    twoStepFlag: 1
    slaveOnly: 1
    numberPorts: 1
    priority1: 255
    clockClass: 255
    clockAccuracy: 0xfe
    offsetScaledLogVariance: 0xffff
    priority2: 128
    clockIdentity: 7c534a.fffe.0abe02

 

When using ‘timesource pps‘ mode, MetaWatch synchronizes to the top of the second indicated by the rising-edge of the PPS signal and uses the system clock to determine which second that rising-edge indicated. The system clock of the device is synchronized to within +/- 250ms. Synchronizing the system clock can be achieved using either PTP or NTP.

 

Checking on the correctness of the PPS 

 

7130-L-Series(config-app-metawatch)#show timesource

Configured Timesource: pps


7130-L-Series(config)#show sync source

PPS Source: front-panel


In the “show sync status” output check the measure count. It should be around 125,000,000 for its correctness. 

Make sure the values are as per the Good sync example shown below 

 

7130-L-Series(config)#show sync status 

PPS Source: front-panel

PPS Signal: Up

(Measure count 124999936.0)

PPS has been stable since last status

Configured cable delay: 0ns




7130-L-Series#debug show pps 10

Looking for PPS pulses, times are approximate (to 0.1s)

0: 0.0000 Pulse...

1: 0.9418 Pulse...

2: 1.9432 Pulse...

3: 2.9404 Pulse...

4: 3.9394 Pulse...

 

Issues related to the Packet drops , which can cause the gaps in the multicast flow or other flows which are getting timestamped 

 

  • If you execute the show metawatch counters or show metawatch counters debug , if you see the receive drops counters incrementing , it might be due to the buffer drops , in that case , we need to confirm whether  any form of rate limiting , flow control are configured under the metawatch app configuration mode 

 

We can also check the current usage of the sdram in the metawatch  using the following example , metawatch usually has deep buffers available along with each core to help the buffering needs to avoid the drops due to oversubscription 

 

(config-app-metawatch)#show sdram status

Description                Core          Status

-------------------------- ---- ---------------

Buffer Used                   0          0.000%

Buffer Size                   0         8.000GB

Single ECC Error Count        0               0

Multiple ECC Error Count      0               0


Buffer Used                   1          0.000%

Buffer Size                   1         8.000GB

Single ECC Error Count        1               0

Multiple ECC Error Count      1               0


Buffer Used                   2          0.000%

Buffer Size                   2         8.000GB

Single ECC Error Count        2               0

Multiple ECC Error Count      2               0


Buffer Used                   3          0.000%

Buffer Size                   3         8.000GB

Single ECC Error Count        3               0

Multiple ECC Error Count      3               0


  • If you see the buffer overflows and if you see the rate limiting configured we should consider tweaking the rate limiting to avoid seeing the drops in metawatch 
  • If you see the buffer overflows and if you see the flow control configured , make sure whether we are receiving the pause frames from the analytics server . in the below example if you see the line_mac_rx_pause_frames increments , then there is a possibility of the pause frames being sent by the server towards the Metawatch core , this can cause the metawatch to use more buffers as it stops sending the frames as per the pause quanta defined on the pause frames 

 

(config-app-metawatch)#show interface et32 counters verbose | grep -i pause

Collecting all statistics for port et32

      "host_mac_rx_pause_frames" : 0,

      "host_mac_tx_inserted_pause_quanta" : 0,

      "host_mac_tx_pause_detected_frames" : 0,

      "host_mac_tx_pause_err_frames" : 0,

      "host_mac_tx_pause_frames" : 0,

      "host_mac_tx_total_pause_quanta" : 0,

      "line_mac_rx_pause_frames" : 0,

      "line_mac_tx_inserted_pause_quanta" : 0,

      "line_mac_tx_pause_detected_frames" : 0,

      "line_mac_tx_pause_err_frames" : 0,

      "line_mac_tx_pause_frames" : 0,

      "line_mac_tx_total_pause_quanta" : 0,

 

If this is the case then we need to check on the Analytics server to understand whether it can handle the data rate sent from the metawatch , as pause frames are sent when the server was not able to handle the receive rate and its seeing drops on its rx buffer 

 

Some packets are missing sporadically and we were doubtful whether Metawatch was dropping those frames 

 

We can always capture the packets being sent out of the output Aggregation port of the metawatch core. For example , lets say ap 57 is an output agg port of one of the core where we suspect that the packets are getting dropped with the below config example we can feed the traffic sent out of the ap57 to hit the ma5 port of the 7130 device ( Note : I am using the ma5 as its 10g port)  and then we can use the regular tcpdump to do the packet capture to check whether we still miss those packets 

 

interface ma5

    no shutdown

    no ip address

    source ap57

 

To do the packet capture , we can leverage the bash tcpdump utility like the example

bash tcpdump -nevvvi ma5 

Or 

To store pcaps to a file 

bash tcpdump -nevvvi ma5 -w /mnt/flash/metawatch.pcap

 

Note: Please be advised that the packet captures can also be lossy , if we miss to see some packets in the packet capture that doesn’t mean that metawatch is not sending those frames out of the app port , to do a perfect  capture of all the packets , you can use tap feature of the 7130 to send those  traffic to a capture card ( external server )  and we can capture all the traffic without any loss

 

 

 

Follow

Get every new post on this blog delivered to your Inbox.

Join other followers: