• LANZ – Tuning packet buffer monitoring thresholds – Gain the most adequate visibility to you

Print Friendly, PDF & Email

This article introduces LANZ briefly, and then concentrate on explaining how you may want to tune the threshold.

Threshold tuning allow you to have the right level of visibility for your environment.



1) LANZ Introduction

LANZ on the Arista 7150S and other platforms provide trigger-based micro-burst visibility. This guarantees capturing congestion events, even the shortest, as compared with any hit-and-miss polling mechanisms. For some other platform families whose hardware does not support trigger-based detection, the polling LANZ-lite alternative is available, still very useful but simply not as accurate. Refer to the manual for LANZ differences.


LANZ generated outputs

Both trigger and poll-based LANZ implementations rely on a customizable threshold to generate congestion event information. The information generated, depending on the platform, can be:

  • Syslog / SNMP traps
  • CLI report of every event
  • Mirroring to local switch’s CPU/Disk/PCAP/tcpdump or to destination ports
  • Streaming in Google Protocol Buffers format to be digested by free clients or Corvil/Packet2Disk/etc
  • Export with Arista’s Telemetry app to Splunk/VMware LogInsight
  • Add your own mechanism if you wish (easy EAPI extraction, CSV export, SDK, etc)



2) LANZ Thresholds


In quest for accuracy and proactivity, the packet buffer visibility provides a clear view of the internal switch mechanism during packet transit.


2.1) Microburst

While L2/L3 forwarding is line rate on all Arista products, any network experiences some form of congestion, even at the nanoseconds or microseconds level. Microbursts of traffic cause buffer utilization, even when traffic does not average high utilization on longer time scales, such as the second. A second can be considered a very long time lapse considering 10Gbps represent the passage of up to 19 Million packets.

Microbursts can be caused by speed differences (transitioning from 10G to 1G, for example), or incast (fan-in) of traffic from multiple sources to few destinations, and rely simply on the physical principle that only one packet can be sent out of an interface in one time interval; other packets must wait in the queue to be serialized.


2.2) When microburst exceeds a threshold

In the presence of congestion, most switches in general, provide a tail drop threshold which implies that packets at the tail of the queue would simply be dropped and there would be no visibility for these dropped packets.

Arista’s LANZ tool provides the ability to detect the microburst condition and provide information on the queue occupancy and duration of the congestion.


2.3) Microburst visibility – benefits

The benefits are:

  • Knowing what causes latency or drops in the network
    • No unknown or guessing anymore
  • Buffer utilization historical trends
    • Assess the evolution of an over-utilization risk per switch, per port, per queue.
  • Proactive stance on buffer consumption
    • Learn that networks are not necessarily healthy even though no drops are experienced
    • Correct the problem before it get worse rather than waiting for catastrophic drops to occur
    • Proactively resolve hot spots of buffering, rather than firefight drops


2.4) Differentiating thresholds in relation to time lapse

The threshold selection is therefore critical to suit your environment best, to reflect tightly what you consider as healthy or not. Consider the following buffer utilization levels, and their frequency of occurrence:

  • Normal utilization – mostly harmless
    • temporary
    • long term trend
  • Significant utilization – to resolve by considering network design improvements
    • temporary
    • long term trend
  • High utilization – risk of service impact
    • temporary
    • long term trend
  • Over-utilization – packet losses
    • temporary
    • long term trend


These criteria presented in a table offer a gradual differentiation opportunity:

Temporary event Long term trend
Normal utilization (% = ?) Mostly harmless No action required
Small amount of buffer are normal
No action required
Small amount of buffer are normal
Significant utilization (% = ??) Resolve this by considering network design improvements No action required
Microburst to medium level are normal.
Get visibility level for monitoring purpose, but this is usually not a risk
Consider finding the cause of medium (not low) buffer utilization. Look at the traffic pattern (incast, speed differences) stressing the buffer on particular interfaces.
While not a risk in itself, if burstiness spikes to higher than usual you may enter in an area of risk and drops
High utilization (% = ???) Risk of service impact Even without losing packets, reaching a high buffer utilization raise the risk of experiencing a worse impact if burstiness spikes to higher level.
Monitor the frequency of such high buffer utilization. Rare or occasional events might be normal. You should correlate with the long term trend.For example, high level of microburst on low-utilization long-term trend could be very acceptableHowever, a more frequent occurrence of high microburst, combined with long term trend at medium  levels, the risk would be seen as higher.
Long term trend of high buffer utilization is not healthy as it:
– increase latency
– has high risk of ove-rutilization in periods of peaksThis situation should be avoided by immediately considering:

– Redistribution of resources receiving the high level of microbursts
– In case of incast, redesign how the remote hosts are accessing the resources (e.g. load-balancing to other resources)
– In case of speed difference, minimize the choking point (high buffer stress) by reducing the difference between the low and the high speed sides.

For example, a switch might suffer high buffer stress going form 10Gbps to 1Gbps in a high performance environment. To minimise this, you may:- Add physical links to a port-channel, going from 1Gbps to 2Gbps on the slow side would ease the buffer stress by serializing twice faster out.
You may also consider avoid using oversized uplinks (e.g. 40G, 100G) if some hosts are connected at 1G speed on the same device. Otherwise, exclude low-speed host (e.g. 100M, 1G) from devices with 40G/100G uplinks.
Over-utilization (% = ????) Packet losses It might happen that in exceptional circumstances a high utilization peak cause buffer starvation and therefore packet losses, even if the long term trend was very low (seconds, minutes, …) LANZ provide you the visibility of such event so you can correlate service impact with network events.On 7150S series you may mirror over-threshold traffic for analysis, hence helping you to identify the exact source of traffic anomalies. You may look at both the traffic utilisation and buffer utilisation to explain your symptoms. There are two different scenarios of high utilisation:
1) High bandwidth utilization (e.g. 70-80%) with medium buffer consumption (often reaching 50%)
Permanent high-utilization as seen on some WAN/peering links (70-80% throughput utilization) is obviously leaving almost no margin for burstiness, and is often a cause of packet loss. The only radical solution to permanent high utilization is the increase of throughput.
2) Shallow buffers might and highly meshed traffic flows might exacerbate the impact such high utilization has, so deep buffer could marginally improve the situation, as long as it is kept in mind that the purpose of deep buffer is to resolve microbursts, not create bandwidth.
Medium bandwidth utilization (e.g. 50%) with high buffer consumption (often reaching 70-80%)
This use case reflect less traffic throughput, but more bursty application, where deep packet buffers would be particularly suitable. The available link throughput should remain monitored to provide burst margin


2.5) Know your network and applications

All these different use cases should be adapted to your environment.

You might have noticed that the utilization levels were indicated in undefined percentage (% = ??), because the “normal”, “significant”, and “high” levels of buffer utilization would vary depending on your application, network design, and latency/risk sensitivity.



3) Finding the right LANZ buffer threshold for you


Knowing one’s network or applications can be utopia. Even knowing them might not relate to knowing the impact on the switch packet buffer, as flows characteristics are not static.

This section aims at assisting you to choose the right LANZ threshold.


3.1) How much information ?

A first step to consider is the amount of information, granularity, and precision you desire.

The buffer threshold can be set as low as 2 (note: unit vary per platform), which means that in case of highly bursty traffic, you might receive an extremely high amount of microburst notifications. This overload of information might not be desired.

On the other side of  information volume, high threshold might never tell you about the congestion events occurring below threshold: you might miss out lot of near-threshold events you had wished knowing about.

The default LANZ threshold represents approximately 25% of a port buffer consumption, and is deemed a good compromise between having too much information, and not enough. With the default, most events should raise a careful attention.

However, this default choice might lack of feedback on how your network is behaving in common conditions. You might want to benefit from LANZ not just for exceptional bursts, but also for more frequent lower-consuming events, and therefore gain a better trend visibility.

Conclusion on information quantity

The amount of LANZ notification to aim at should reflect your preference between:

  • having too many false-positive (too much benign information)


  • too many false negative (missed malign information)


3.2) Empirical approach: starting with the default

3.2.1) Starting with the default

The safest approach is to start with the default threshold value, knowingly quite high, and monitor whether any threshold is being exceeded on any port or queue.

You could leave it as is for a short time if you are in a hurry, or guess/know that your burstiness should be low. For a production environment you might want to leave LANZ monitoring cycles to 24h or 7 days, depending on when traffic bursts occur in your network

Although there is no impact on hardware performance on LANZ buffer monitoring, the reporting of the event (syslog, streaming, TCPdump, etc) can consume resources, although CoPP would always protect the system. Report overload is simply not an elegant resource management.

Best practice is to configure LANZ only on interface that need it, and set the threshold adequately to avoid too much reporting pollution.

At the default LANZ level, you might only see temporary high-level buffer utilization.


3.2.2) Lower to 1/5th or 1/10th of the default

Beyond the initial default threshold monitoring, lower the threshold to a fraction of the default threshold:

  • 100 segments of 480B on the 7150S (instead of 512 for high threshold): 1/5th
  • 500KB on the 7500E/7280SE (instead of about 50MB): 1/10th. The Arad platforms have deep buffer, reaching the low burst level of notification requires a larger cut in the threshold.

At this level you might see more occasional peaks of high buffer utilisation, in particular when combining stresses such as incast and speed difference.

In low-burst environment you might still not see any microburst events


3.2.3) Lower to a further 1/5th

Then lower by a further 1/5th of the previous value:

  • 20 segments of 480B  on the 7150S
  • 100KB on the 7500E/7280SE

At this level, you are likely to see LANZ records when there is fair stress on the network. Many production environments would show LANZ records if they have symptoms of speed different or incast, at a fair level of performance.


3.2.3) Lowest level: 2

For lab environment, temporary testing, or environments with low buffer stress, then you may chose to set the LANZ queue threshold to the absolute minimum : 2 (unit varies)

  • 2 segments of 480B on the 7150S
  • 2KB on the 7500E/7280SE

At this level, you would see all congestion event as low as 1 packets queueing (while another is serialized), if it is bigger than the minimum threshold.

The use cases for setting such low value in production environment are:

  • Troubleshooting : verify the current level of burstiness with a short LANZ snapshot at low-threshold level
  • Very high monitoring accuracy: have real-time update on even smallest microbursts, verify latency caused by buffering of 1 Packet. Note: this is not suitable for environment with too frequent bursts.



3.3) Empirical approach: starting with the minimum

Another approach that might cringe some, merely for the polluting information it might temporarily generate, is lowering the threshold to the minimum (value=2) on ports that you know are the most likely suffering congestion. Note: The hardware forwarding would not be impacted by LANZ record overload (CoPP throttling).

You would immediately see LANZ records, and be able to quantify the size and frequency of you microbursts.

If you see several events per second, you might want to raise the threshold shortly.

If LANZ records are frequent by the order of several per minute, but not an overload (not several per second), then it might fit your target of manageability.

If the amount of records being generated is too high to see all the local historical data you wish, then increase the threshold accordingly.

The aim is to reach a right balance between manageability and usefulness in content



4) Conclusion


In simpler words, you should look at the queue depth over time, and adapt to the level you judge the most useful, giving you the best balance between granularity/details, and manageability (prevent overload for the eye).

Changing the thresholds has no impact on hardware forwarding performance, and even badly configured or insane settings would not impact the control-plane as it is protected with CoPP.




Get every new post on this blog delivered to your Inbox.

Join other followers: