Understanding Deduplication in Tap Aggregation (NPB)

 

1) What is deduplication ?

Deduplication in the context of packet broker networks (Tap Aggregation) is the ability to detect duplicates of a packet, allowing only the first packet and dropping other iterations of the same packet.

 

2) Hardware impacts the Deduplication performance

Deduplication, like many features, requires certain hardware characteristics to be supported by the silicon (network processor), which is the foundation of hardware packet processing and forwarding in networking/Ethernet equipment. It allows matching packet, manipulating, and making forwarding decisions in hardware.

 

2.1) Processing performance

The Arista switches are based on high performance network processors of different familiers (Alta on 7150S, Arad on 7280SE/7500E, etc). Each network processor is a single high-performance chip that can parse, manipulate, forward, timestamp, and filter:

  • at Layer2/3/4 in hardware
  • at line rate for any production traffic 

On Arista devices, all the Tap Aggregation features are operated in hardware, without bottleneck. But understanding the internal architecture is important to assess if bandwidth limitations exist in your other equipment, by the way internal components are interconnected.

Particularly important in that interface architecture are extra processors (ASIC or FPGA) that might be used by some vendors for specific features that common Ethernet network processors cannot provide. For example: Encryption/Decryption, Deduplication, etc. To achieve, any significant rate those functionalities cannot be processed by the CPU only (in software), dedicated processors are typically used. Like any silicon that is highly specialised, these have finite performance and a price/performance ratio far from ideal.

 

Why would those specialised silicon have worse performance than merchant silicon ? Like any silicon, ASIC (Application Specific) or FPGA have a limited amount of transistors/gates, cache, and other space/power/cooling/Design generation (14nm vs 22nm, etc). The high flexibility of FPGA and specialized silicon designs do come at the sacrifice of density and performance. 

For example, a typical Deduplication FPGA would have about 80 Gbps capacity.

In comparison, a modern network processor (merchant silicon) can forward 6400 Gbps on a single chip.

The specialised processors might therefore represents a severe performance limitation compared to the amount of data transiting networks. It means that only a small portion of traffic could be treated. Most of your traffic could not be de-duplicated, or limited to a part of your interesting traffic (after initial filtering).

This section described how understanding the internal architecture is important to assess the performance of specialised features such as Deduplication.

 

2.2) Hardware tables 

Beyond the throughput element, silicon has specific functional areas, TCAM or other hardware tables that are responsible for storing matching, alteration, forwarding, and forwarding rules. The bigger the TCAM, the more rules the silicon can store for hardware processing.

The silicon die is limited in space, for design/manufacturing/power/cooling reasons. This is an electronic limitation, and as you know, industrial research make advancements every year. In particular main silicon founders like Broadcom, Intel, etc. progress at a much faster pace than any individual vendor custom built ASICs, following the same simple scale efficiency as in CPU/GPU manufacturing. 

Despite those progresses, silicon die is still limited and the most modern silicon can hold approximately few thousands entries in TCAM, inside a single chip.

 

2.3) Why do hardware table size matter?

 

Deduplication involves storing an initial frame, and matching all subsequent traffic and every single packet against that initial entry. A link at 10Gbps is sending in the order of 15 million packets per seconds each way.

Hardware can therefore only store few milliseconds of packets (in the order of tens of thousands). Once the hardware tables are full, no further hardware lookup can be operated.

At the scale of a Tap Aggregator with Terabit/s performance, the storage/lookup duration would be in the order 10-100 microseconds.

There is no hardware capable of providing long hold of entries for the amount of traffic seen in modern environments.

 

2.4) Conclusion on the impact of hardware on deduplication

Hardware silicon can provide high performance forwarding. Additional specialised processor can provide advanced features such as en/decryption or Deduplication, at the cost of seriously limiting performance (due to inevitable internal architecture bottlenecks), which might be adequate for specific use cases, with careful deployment considerations.


3) You might need duplicates

The main purpose of packet broker networks is to deliver full visibility of the traffic you have selected to monitor.

Many customers strive to know more about where the traffic is coming from, and going to, or being able to trace the path of packets in large complex topologies.

If only partial monitoring (part of the packets) were available, then the full visibility and traceability could not be achieved.

Operators can benefit from receiving the same packet multiple times. It allows them to see exactly how that packet flowed through the network, for example:

  1. Packet enters at the WAN edge
  2. Firewall
  3. DMZ
  4. Frontal server
  5. etc

 

Such end-to-end visibility can provide not only visibility of the traffic path but also give instant warning about network or application troubles, by providing lapse time in transit at each section of the path for every single packet:

  1. Packet enters at the WAN edge time t0
  2. Firewall time t1 (t0+1µs)
  3. DMZ time t2 (t1+1µs)
  4. Frontal server time t3 (t2 + 15ms)
  5. etc

 

In the following example with Corvil, the source of traffic (e.g. firewall1, router1, router2, etc) is critical to provide operators with immediately consumable information. Various information is correlated and presented to increase visibility. This critical level of information relies on seeing all the packets, especially the duplicate ones, at different points in the network.

The feature source origin identification applies a unique 802.1Q header per frame from different capture point, which make every frame unique, not duplicate anymore, easy to filter/differentiate even with simple tools. Precise timestamping improve the transit timing measurement. These features remove detriments from duplicates; there is not “too much information” anymore. Deduplication would at the contrary, and awkwardly, remove useful information from the operator.

 

Deduplication might weaken the purpose of packet broker networks, preventing end-to-end visibility. The below illustrates CorvilNet from Corvil benefitting from packets and timing throughout the topology to correlate data and provide intelligent reporting:

 

4) SPAN/Mirroring:

 This section investigates the statement that SPAN/mirroring might create duplicates, and provide some implementation guidance.

The source of duplicate frames coming from a single network device’s SPAN/mirroring session is the consequence of duplicating points of capture. The simple rule to not receive multiple copies would be to not capture those multiple copies.

Most SPAN/mirroring deployments naturally avoid duplicates by:

  • monitoring only the uplinks
  • if you also want to monitor access ports, then monitoring only one direction (not both RX and TX)

 

Would Deduplication save the effort of implementing with consideration ? Yes, but with compromised visibility:

Consider that implementing blanket capture (Rx+Tx, creating duplicates) and relying on Deduplication effectively mean the network packet broker is to decide which packet you will see, hence preventing you from knowing what the exact source is. The frame could come from a port, or another. You would not know, and cannot regain that visibility.

 

On the opposite, by being explicit and selective in your configuration, specifying points of capture rather than blanket capture of every port in all directions, you avoid duplicate at the source. You also know exactly where the frames are coming from: you chose.

 

Details:

Monitoring RX+TX is often the default when selecting a port as mirror/monitor source, and this is suitable for monitoring both directions of conversations passing a port: egress and ingress (typically only on uplinks)

However, if you monitor RX+TX on every single port of a switch, to monitor local east-west traffic for example, then a transiting frame would be seen twice:

  1. Once entering the switch
  2. The second time exiting the switch.

If you do not want to see frames twice, then you can select to capture only RX, packets on ingress.

With “RX-only” setting, one might then think he lose visibility of egress. However, if both RX and TX frames are wanted for increased visibility, then why wanting deduplication ? Deduplication would drop one of the two frames, either RX or TX.

You would therefore have to choose whether you want to see packets both at ingress and egress, or only one of the two.

  • Deduplication drops a packet, you cannot know which one for sure
  • Best practice (for predictability) would be to explicitly specify what to capture, allowing you to guarantee the origin of what you see

 

The following sections investigate in more details the configuration options of SPAN/Mirroring.

 

5) How to configure SPAN/Mirroring to avoid duplicates? 

Firstly, consider the ports and traffic directions that matter to you. Secondly, consider the time-lapse of the configurations: permanent or temporary

  • Temporary monitoring : can be implemented dynamically (or also manually more or less rapidly) by applying configuration on ports of interest, often for troubleshoot or brief investigation
  • Permanent monitoring : provides regulatory information, traffic inspection, security monitoring, IDS/IPS,  transactional compliance, etc.

 

5.1) Selecting the ports and direction yourself rather than getting unknown origin

Important ports (such as uplinks, WAN links) where most or all traffic pass by are good candidate for being monitored both ingress and egress (RX+TX).

More common and generic ports might still be important to monitor, in particular when traffic is east-west, going from a port to another without going to only few particular links. For example, you might want to monitor traffic going between servers.

 

Most advanced tools can manage duplicate packet and do want all those packets to correlate events and provide you more advanced digested useful information. However operators using Wireshark or TCPdump might see an interest in avoiding those duplicate. This can be achieved by differentiating source (Origin ID).

 

If your tool does not support duplicates or basic filtering on 082.1Q (Origin ID), and you want to avoid duplicates, then the simplest remaining option is to not create duplicates at the source. This means to not configure switches to capture both RX+TX on all ports. Chose which packet you want among the duplicates, and tune your configuration to only capture where you really want it to happen.

 

Recommendation: Apply RX+TX only on 1 port through which your interesting traffic flows: Either uplink(s) or a 3rd party connection.

 

Configuration examples to avoid duplicates:

  • Eth1 (uplink): RX+TX
Arista(config)#monitor session DANZ-NoDup source eth1

or (but notogether):

  • Eth 10 (attached server): RX+TX
Arista(config)#monitor session DANZ-NoDup source eth10

or:

  • RX-only on both Eth1 and Eth10
Arista(config)#monitor session DANZ-NoDup source eth1 rx
Arista(config)#monitor session DANZ-NoDup source eth10 rx

 

Avoid:

  • RX+TX on both Eth1 (uplink) + Eth10 (attached server)

 

 5.2) Filtering mirrored traffic

If your interesting traffic crosses several interfaces in a meshed manner (e.g. server-to-server, not server-to-uplink), or you don’t know which ones, then you might not have other choice than applying relatively generic switch-wide configuration on most ports. However you can assure unique capture by filtering what traffic to mirror (per-port granularity). Note: This feature is available on 7150S only at time of writing.

Example:

In this example we illustrate knowing where 10.0.0.1 is (behind Eth1) but we don’t know on which port 192.168.0.99 is.

  • Eth1 : RX+TX,     capture only “TCP<any port> – 10.0.0.1 —> 192.168.0.99” wtih ACL
  • Eth2 to 24 : RX+TX,     capture only “TCP80 – 192.168.0.99 —> 10.0.0.1” with ACL

 

In the above example we always capture a traffic nearest to its source: 10.x.x.x is behind Eth1, while 192.x.x.x is behind the other ports.

This would achieve the following:

  1. Capture only traffic you are really interesting in, easing interpretation/correlation, save link bandwidth and storage space
  2. Naturally capture no duplicates while using both Rx+Tx mirroring.

 

If you are initially blind about location of resources you want to monitor, and need to figure out a first rough picture of the traffic pattern, then you can apply a generic ACL (with both source and destination) on all ports on all your estate.

Filtering limit the amount traffic to only what you specify as interesting. For example: Any ICMP between 10.0.0.1 <== ==> 192.168.0.99. This would be implemented by an ACL of 2 entries.

The benefits are:

  • You are not flooded with too much information. Getting information immediately is particularly important when running against the clock, during troubleshooting scenarios
  • Without filtering you might simply not be able to capture anything network-wide because of the volume of traffic.

 

 6) How to save bandwidth and storage space? 

You can benefit from the extra visibility offered by packets captured at each hop, Benefiting from pseudo-duplicate packets as an actual wealth of extra information.

However, if you have concerns about how duplicates impact the storage space consumption, then consider the following:

  • Storage arrays take care of that concern for you: they natively perform Deduplication at the block level
  • Truncation (slicing) at the Tap Aggregartion and filtering provide much more effective bandwidth and storage saving than Deduplication.

 The following sections provide more details on those technologies.

 

 6.1) Storage deduplication

Most storage solutions have advanced deduplication functions. Storage arrays perform deduplication at the block level, and can find patterns of duplication much more granularly, across an enormous sample size: in term of PetaBytes rather than just few within milliseconds of capture period. It is also possible for storage to find duplicate patterns in completely unrelated data, such as packet captures and text files, for example.

Deduplicating at the storage itself is therefore much more optimal than at any NPB/TapAggregatopm, for deduplication efficiency, amount of space saved, and cost effectiveness.

Deduplicating at the Packet Broker Network appliance, would save fewer storage space than the storage can, and would be sacrificing visibility for only little space or bandwidth saving.

What about bandwidth saving ? Since storage is the recipient of packet records, the following section covers further optimization techniques, saving a huge amount of both storage space and link bandwidth, in a more efficient manner than NPB deduplication:

 

 6.2) Keeping full visibility of all packets while saving space with slicing 

Most traffic analyzers offer you the option to store a portion of the original frame, usually the useful headers or initial part of the datagram where application-specific headers usually reside. Typically this is a matter of tuning the analyzer tools, and can greatly reduce the amount of storage required.

Deduplicating can reduce the amount of data by 1/2 (if we take an example of 2 frames), at the cost of reducing visibility by the same order of magnitude (prevents seeing packets along the traffic path)

Truncation can reduce the amount of data by 1/100. For example a jumbo frame of 9000 Bytes can be reduced to just 90 Bytes.

  • With deduplication : 2000MB —> 1000MB (Loss of packets and visibility)
  • With truncation: 2000MB —>     20MB (No Loss: all packets are preserved along the traffic path, offering visibility of headers and initial interesting portion of the datagram)

 

Truncation is a much more effective method of saving storage and link bandwidth than deduplication. It provides enormous storage space savings, even if the storage solution does not support deduplication 

Truncation can take place on the analyser, on the packet broker network (Tap Aggregator), or directly from an in-line switch doing SPAN/mirroring (with the Arista 7150S and DANZ features). Applying truncation on a SPAN/mirroring session provide you large bandwidth savings directly at the source of the traffic.

 

6.3) Filtering

Filtering traffic at the monitoring/SPAN source can also reduce the amount of bandwidth on the packet broker network and storage requirements on the analyzers and data stores.

The Arista 7150S supports DANZ features to filter Layer2/Layer3/Layer4 with Access-list (ACL):

  • Per monitor (mirroring/SPAN) session
  • Per individual port in a monitor sessions
    • In a same monitor session, source Eth2 can have a different ACL than Eth3
  • ACLs are applied only to the mirrored traffic, not the original production traffic
  • All traffic is forwarded, mirrored, filtered in hardware, at line rate, with no added latency (350-380 ns)

 

7) How to deduplicate on the capture tool (instead of on the Tap Aggregator) 

Most tools can benefit from duplicate packets to provide you more information along the traffic path.

 

7.1) Software Analyzer

Some advanced tools can also present your packets without duplicates (deduplication in software)

For example, under Wireshark, for real-time view, using “Do the “follow TCP stream” under recent versions of Wireshark follow the TCP specification correctly. Retransmitted packets with the same TCP sequence number only show up once.

 

On Wireshark, you can also enter the following in the filter field (for TCP only):

not tcp.analysis.duplicate_ack and not tcp.analysis.retransmission

 

Wireshark also allow post-processing pcap files, with the -d option (“d” for deduplication) in its command line tool editcap:

editcap -d input.pcap dedup.pcap

The resulting pcap is entirely deduplicated.

More details here: http://www.wireshark.org/docs/man-pages/editcap.html

 

 7.2) NIC 

Tools can also deduplicate in hardware, on their NIC. Some independent NIC vendors provide effective deduplication at the NIC (for example: http://www.napatech.com/products), which are the actual exact same hardware elements used by some Tap Aggretation vendors inside their product.

 

 8) Conclusion

Deduplication can be seen as a workaround to improve the operator’s visibility on simple tools, but Wireshark provides deduplication in “follow TCP stream” mode.

However, simple considerations when implementing SPAN/mirroring can avoid this issue at the root cause, and keeping you in control about which packets to capture

Deduplication drops useful information, preventing correlation of frames coming from different capture points.

Filtering and truncating at the source of capture are much more effective than deduplication for saving bandwidth and storage, by a factor of 50 times, while not discarding any frame and providing the complete visibility of traffic path hop by hop.