- 1) Introduction
- 2) Measuring
- 3) Application behaviour and causes
- 4) Flow-Control / PFC
- 5) Buffer management and Queues usage
Congestion might not be obvious, it can be discovered reactively in disastrous situations, or proactively by collecting statistics off equipment and investigating symptoms demonstrated by the applications and systems.
Deep buffers on switches is a blanket and effortless solution to the problem, but it might not be materially possible or justifiable everywhere on a network.
This document discusses design considerations in case of congestion.
The first step (which might seem obvious) for understanding some potential issues is to translate the symptoms such as slow, unresponsive, poor performance, into measurable and baselined metrics collected on nodes surrounding the network (servers, VMs, storage, OS, applications, appliances, etc).
Here are some health metrics examples:
- IP packet drops rate
- Operating System TCP segment losses and retransmits rate
- Application execution speed (e.g. task completion time)
- Application latency 99% percentile round trip delay
- Goodput rate: throughput as seen by the application / storage (not just the NIC/network throughput)
- Server disk read and writing speed
- CPU/RAM utilization
- Traffic flow pattern across the network, detailed per flow pattern or application
- For example, sFlow historical trends per application, per talker: client<–>server, server<–>server, server<–>storage, storage<–>storage
It is useful to know (by testing and documenting) these indicators in two types of situations:
- Baseline (Normal conditions)
- Under stress (Max performance)
Measuring your baseline
Baseline (metrics measured in normal healthy conditions) is an important information. Without it, once in abnormal situation, you would not be able to identify which metric levels are expected, and which others are flagging issues and tipping for a root cause.
Measuring your maximum stress capacity
If you were to stress your application to its maximum performance capacity, do you know where the bottleneck would be in the end-to-end path? Application/code optimization, Operating System (TCP tuning, multi-threading), NIC/drivers (offload, DMA), CPU, PCI, RAM, Disk writing speed, network speed, network buffer, Network queue configuration, QoS and prioritization.
If you assume everything is line-rate until proven otherwise, you might have the bad surprise of having to figure out the actual bottleneck during an inconvenient time (during an outage or a serious performance issue). Having such performance information makes the infrastructure predictable and your actions proactive: you know what will happen and you know what to do in these situations.
Conclusion on metrics
Once you know what to expect both in normal condition and under maximum stress, then you are able to see deviations. Knowing is the first step to understanding what could cause issues in abnormal times.
3) Application behaviour and causes
It is not always the network. The application end-to-end performance can be suffering issues from:
- sub-optimal load-balancing (not distributing well) creating concentrated stress
- Is the OS optimized for high performance? (compute, storage, NIC drivers, MTU, CPU offload, disk management)
- Disk performance: can disks support the full load of packets they receive? How do you know?
The below outputs shows an example of unhealthy metrics on servers, with heavy TCP retransmits:
admin@server1:~$ netstat -s | grep -E 'send|retransmited' 917882534 segments send out 3706544 segments retransmited admin@server2:~$ netstat -s | grep -E 'send|retransmited' 3331949939 segments send out 3952631 segments retransmited
In the above output, even though only 0.4% of all the sent TCP segments have been retransmitted, the impact on the application should not be minimized, it can be enough to disrupt performance.
- Such average does not tell whether they happen in a distributed manner (drops happening once every 250 packets), or by large chunks (e.g. 40 drops then 10k ACK
- Recording/graphing rate changes over time is much more important than a single absolute number.
- In either cases, the drops are enough to cause lower average throughput by TCP window size fallback and slow step-up.
- Additionally, the OS might take a while to detect timeout (RTO), worsened by the OS increasing the RTO over time (varies per OS, often a matter of seconds or much more)
- TCP window sizes might synchronize network-wide, causing that average to impact all flows at the same time, not just one flow or node.
In conclusion, TCP retransmits can be very bad for the overall performance.
Application / Storage Latency
Some applications or storage might be able to provide the perceived latency in RTO, communication, read/write speeds across a network. Considering a DC environment (no WAN/wide area propagation delay)
Without congestion, the expected transit delay caused by a fast modern switch is around few hundred nanoseconds to few microseconds: roughly 400ns to 4µs, depending on the network silicon and architecture.
During congestion, any additional queue frame is adding waiting time. As an example, each 9000 Bytes jumbo frame (72000 bits) represents an additional 7.2µs delay if the port serializes at 10Gbps. A full queue in a switch with common dynamic shallow buffer could pile up few hundred microseconds of latency.
This relatively large amount is to contrast with application-related latency often reaching milliseconds or seconds of user-experienced delays: at least 1000x higher. Such level of latency is not caused by the switch buffering delays. Instead, TCP goodput collapse or CPU/Disk overload can cause low performance at the application level.
The way OS / applications / storage manage congestion will impact diagnosis: they need to detect drops before retransmitting. Applications may choose to use protocols other than TCP to implement their own timers/health check, because relying purely on TCP can be extremely slow. The TCP fast retransmission timer is 200ms. Some OS’ default Retransmit TimeOut (RTO) might cause tens of seconds of delay.
Once a packet is detected as lost (depending on the application / OS) then it needs to be retransmitted. Applications using TCP rely on the operating system to achieve this detection and retransmission. Some other protocols require the application itself to detect packet loss and retransmit where appropriate.
TCP slow start (where the TCP window size is gradually increased after the initial handshake) will reduce the bandwidth initially but it should ramp up quickly.
Monitoring the OS/Application packet drops, loss detection, timeout parameter, retransmit rate, and help understand better how impacted the TCP flows are (goodput vs badput)
4) Flow-Control / PFC
Flow-Control, and by extension DCBx/PFC, is a standard way for some Ethernet nodes to signal a desire for throughput to be controlled between each other.
If implemented it must be very carefully designed.
Flow-Control consist in sending pause frames, the node receive the Ethernet pause frame may then pause sending traffic to the requesting peers. The pause gives time for the originator of the pause frame to process traffic, for example to write data to disk sectors, or process in CPU, or transmit elsewhere… all these in-system activities (with potential resource bottleneck) can be source of congestion in an Ethernet node (server, switch, storage, …), and can be triggers for sending pause frames.
Hosts may send pause frames to the switch when they don’t write to disk quickly enough.
Switches (if configured to do so) can send pause when experiencing congestion on an egress port.
It is possible to set FlowControl in different modes: Rx, Tx, or both Rx+TX.
Poorly designed FlowControl implementation can cause problems, such as throttling too much.
The potential viciousness of FlowControl is that some hosts sending pause frames may throttle down the switch so much that it could impact other hosts negatively, reducing performance (possibly to some extremes).
Why can FlowControl / PFC be detrimental?
Example 1: Low-performance host
For understanding why this might be happening, take the simple example of a low-performing host among a rest of high performance network, storage and other servers.
If that host cannot cope with the received load of traffic, it may send pause frames. The switch would pause traffic, buffering all the incoming frames destined to that server. If that server is sending too many pause frames, buffering could be extreme on that port and the switch suffer congestion, triggering a chain reaction: the switch would be sending pause frames to other healthy nodes, in order to give time to resolve the congestion.
Although this mechanism aims at being lossless, a single congestion hot spot can impact all the other nodes.
Example 2: Fan-in creating permanent congestion to a single host
A node might not be congested because it low performance, but simply because the designed traffic flow creates fan-in (incast). Such many-to-one traffic profile might happen only on one or few ports, but it would be enough for the switch to send pause frames to all the hosts.
Although this would resolve the congestion to the impacted received, it would also have paused traffic to other receivers that might not have suffered congestion. These would then experience lowered throughput because the senders would all have paused to satisfy the congested node.
Example of incast creating congestion on egress:
Egress port to Storage A suffers congestion
The impacted traffic class to all hosts will be paused.
Given the above scenario, there could be repercussions (pause) to collateral flows, even though those might not be involved in congestion:
Even if egress ports to StorageB and Host3 are not congested, the sender will be paused, reducing the throughput to StorageB and Host3
In conclusion, pause frames may:
- Prevent congestion and drops
- Reduce throughput from some nodes and to uncongested nodes
- Poorly performing nodes or excessive in-cast can collapse/freeze throughput for all nodes
With FlowControl, the decision of preventing congestion with pause frames, at the risk of suffering excessive throughput freeze, must be a well-informed design based on understanding very well the nodes performance, and traffic pattern.
As a first recommendation, do not implement FlowControl if you do not have such understanding.
If you have such understanding, then you could carefully design your network and hosts for lossless traffic and increased application performance.
As a second recommendation, DCBx and PFC (per queue priority flow control) is strongly preferred over FlowControl (different protocols), to distinguish classes of traffic. For example:
- data-plane application traffic low-priority
- high priority applications
- control-plane protocols such as LLDP/LACP
Even if the all data-plane application traffic is the same protocol and has the same priority application, you may consider classifying traffic in different queues, not for prioritization but merely for having more granular queueing and pausing: if a switch suffer congestion in 1 queues, then only hosts in that traffic class would receive pauses. Other hosts sending traffic in uncongested queues would not be paused.
Consider the following about PFC / FlowControl:
- Do not use FlowControl between switches (as rule of thumb), to avoid network-wide pauses. A switch-wide implementation can have enough of an impact in term of throttling (sometimes too much), to spare other switches from it.
- Pause frames apply to a queue for all of the ports on a switch. Some vendors’ older implementation might additionally apply pauses to all queues.
- Pause mean that throughput get a temporary freeze. If it prevents drops and the TCP collapse, then it might improve overall performance. But if applied without careful considerations, some hosts might be paused too often unnecessarily (congestion elsewhere) just because it is on the same switch, on the same queue. That would prevent them from attaining their full potential. However, the overall performance might be greatly improved.
- Consider differentiating traffic (queues/QoS) for protecting different traffic types, but also for the same traffic type by simply for spreading the load across classes, to partition the FlowControl impact.
- Mode none: By default the Arista switches has got all the ports set as mode none: the switch may receive pause frames, but these would be not acted upon. No pause frame would be sent.
- Flow-Control is disabled by default on the switches as a protection to avoid undesired mass-throttling, especially avoiding to reduce egress throughput of capable hosts by less capable hosts; FlowControl has to be explicitly enabled on a per-port
- Mode Rx: The switch will accept incoming pause frames and act upon it: it will pause traffic sent to that node when requested. The switch does not send pause frames out of that port.
- Use case:
- Server/storage is overloaded, need to receive less traffic. Ignoring (and not resolving) the server overload (FlowControl or improving the server specs) would lead to potentially high in-server latency.
- Rx-only (no Tx) may be used on its own to resolve server overload, but without pausing what the server sends. This could be used to avoid restricting that server’s throughput during pauses. Although it might seem desirable, it would actually potentially create congestion if that server is a heavy talker: while the rest of the nodes/switch would be pausing, that particular node would flow traffic in that could not be sent during such pauses.
- Rx-only is useful for servers/storage, but can have a negative impact on switches
- Mode Tx: The switch will initiate pause frames, sending them to peer nodes (typically servers). Pause frames received on that port will be detected but ignored.
- Do not implement to other switches.
- Use case:
- It may be used on its own on a port (Tx-only, no Rx), if you do not want to allow the server to request pauses.
- This implementation assumes the switch does receive some pause frames from other sources
- This might is beneficial for the switch, as it doesn’t pause / buffer packets for that port, but might be negative for the servers that cannot ask for pauses.
- Tx-only might be a niche usage
- Mode Rx+Tx: The switch both receives (and act upon) pause frames, and send/relay pause frames
- This is the most suggested common implementation for ports facing servers/storage.
- Do not implement to other switches.
- Use case: have switch-wide generic FlowControl on all storage/servers ports to help prevent hosts overloading or being overloaded. It cannot prevent microbursts, hence buffers would still be used.
- Excessive pauses from few hosts may negatively impact other hosts
- Many servers/storages have Flow-Control enabled by default
- Use case:
Choices: QoS, FC, PFC
QoS and PFC / Flow-Control are configuration options, offering methods of protecting your most important traffic.
If all traffic is identical on some ports (e.g. iSCSI everywhere), then there might be no need to prioritize or implement Priority Flow-control or QoS for purpose of prioritization. However it could be implemented for other reasons, such as distributing load across multiples queues.
– QoS would involve either trusting the server/storage QoS settings (DSCP/CoS), or matching traffic against ACLs (e.g. source/Dest IP, L4 port, etc)
– PFC (Priority FlowControl, not interoperable with FlowControl, preferred over FlowControl) would allow what FlowControl does, but for a priority class (e.g. iSCSI). It distinguishes drop or no-drops queues.
5) Buffer management and Queues usage
Queue usage might be imbalanced, but buffers are by default equally shared across all of them. Usually buffer allocation reserves some private amount dedicate to each queue and each port. A larger amount is shared across all ports and queues as a dynamic buffer pool. Queues can get allocation from that pool when needed (under congestion), if there is some available buffer in the shared pool. Allocations are made on a 1st asking 1st served basis. However queues normally have a threshold assign, preventing them from borrowing too much, to prevent starvation.
Private reserved allocation and the dynamic allocation threshold may both be tuned (lowered or increased) to accommodate your needs.
You may feel like tuning these to avoid waste of unused resources, or allocate more resources to congested queues.
Here is an example of sub-optimal default if an infrastructure and hosts are left to their default (no buffer tuning, no QoS on the network or hosts)
– Unicast Queue1 might receive all the production traffic (default goes to queue 1) and be heavily used, possibly suffering congestion.
– In a lesser manner Unicast Queue8 (critical network control protocols: STP, LACP, …)
– Least used queue: Multicast queue1 (usually clustering protocols, redundancy, sync, etc)
Here is an extract from an ethernet port on an Arista switch, showing the queue utilization:
Port TxQ Counter/pkts Counter/bytes Drop/pkts Drop/bytes ------- ---- ------------ ------------ ------------ ------------ Et5 UC1 104121022 352266676891 7113 63942453 Et5 UC8 23276 3516944 0 0 Et5 MC1 19421 2643685 0 0
Since the other queues are not used (but you could, with QoS), their private buffer reservation is currently wasted.
You therefore have two optimization options:
1) use the other queues to differentiate traffic (QoS) and benefit having more queues (more buffer) supporting the same traffic (traffic being then distributed across queues)
2) re-purpose unused buffer from unused queues to your congested queues (on congested ports)
The option 2) brings performance advantages by allocating increased buffer.
But you might still use option 1) to benefit in a QoS policy, for example to prioritize VM traffic versus backup traffic, or simply load-balancing across queues.
You may use the two options together. When testing these options, test only one at time for measurable impacts.
We are going to explore these two options in more details
QoS to classify traffic and use more queues in the system
If you are unsure what to prioritize or classify, consider whether all your traffic are really the same in term of traffic-profile:
– resilience to drops (TCP vs UDP. If TCP: how does it detects losses: quickly or slowly?)
– elephant vs mice flows
– transactional in 2 directions, or 1-way stream.
For example, you might have to differentiate:
– VM storage
– Backup traffic
– customer transactions
– IP voice
QoS classification – Configuration example
The below shows a way of classifying some flows into a different tx-queue by applying policy maps in conjunction with ACLs (to filter some flows).
If you are interested in going that path, here is the configuration:
a) Classify traffic based on source IP (you may elect different match criteria)
ip access-list qos-acl 10 permit ip host < > < > =====> Filtering some flows using ACLs which you would want to classify in a different tx-queue
b) Configure a class-map to match the ACL
class-map type qos match-any qos-class-map match ip access-group qos-acl
c) Define a policy map and map the identified flow to the required traffic class
policy-map type qos qos-policy-map class qos-class-map set traffic-class 3 =====> Changing to Traffic Class 3 (aka TC3, was unused before)
d) Apply the policy map to the ingress port
interface Ethernet X =====> Ingress port service-policy type qos input qos-policy-map
sh int ethernet X counters queue detail
Port TxQ Counter/pkts Counter/bytes Drop/pkts Drop/bytes ------- ---- ------------ ------------ ------------ ------------ Et1 UC0 0 0 0 0 Et1 UC1 30 4140 0 0 =====> Traffic arriving on UC1
sh int ethernet 1 counters queue detail Port TxQ Counter/pkts Counter/bytes Drop/pkts Drop/bytes ------- ---- ------------ ------------ ------------ ------------ Et4 UC0 0 0 0 0 Et4 UC1 0 0 0 0 Et4 UC2 0 0 0 0 Et4 UC3 30 4140 0 0 =====> Traffic egressing on UC3
How to implement Buffer management:
The 2nd option for optimizing buffer management is to tune the allocation.
The method involves setting buffer memory custom profiles and applying them to the interface.
The buffer profile will allocate more/less private buffer on specific queues, and also set a different weight (threshold) for taking some of the shared dynamic buffer.
The congested ports/queues should be allocated:
– More private reserved buffer
– Higher threshold for the dynamic shared buffer
a) Check the usage and settings
Check the current status of your queue utilization
! complete output show interfaces counters queue detail ! ! output showing only the queues actively used show interfaces counters queue detail | nz !
Check the current reserved buffer and dynamic allocation (threshold). The below example is from a 7050 series. Check the relevant equivalent for your platform.
! show platform trident mmu queue status !
c) Configure profiles
Create new custom profile to change the buffer allocation to higher level.
If you do not have QoS for the moment, all of your traffic would currently be hitting unicast queue 1, so this might be the only one you need to set.
platform trident mmu queue profile CUSTOM-HIGHER egress unicast queue 1 reserved 3328 <===== just an example (e.g. could be 2x or 4x the original value. It is 2x in this example) egress unicast queue 1 threshold 2
You could choose to reduce buffer on some unused or almost silent ports, allowing to release even more buffering for where they are needed.
platform trident mmu queue profile CUSTOM-LOWER egress unicast queue 1 reserved 832 <===== just an example (e.g. could be 1/2 the original value) egress unicast queue 1 threshold 1/8
The minimum reserved amount has a range of :
<0-851968> Amount of Memory that should be reserved (in bytes)
The threshold can be:
1 A threshold value of 1 1/128 A threshold value of 1/128 1/16 A threshold value of 1/16 1/2 A threshold value of 1/2 1/32 A threshold value of 1/32 1/4 A threshold value of 1/4 1/64 A threshold value of 1/64 1/8 A threshold value of 1/8 2 A threshold value of 2 4 A threshold value of 4 8 A threshold value of 8
The profiles are then applied to the interface(s)
! On the congested port ! interface Ethernet1-5, etc-etc platform trident mmu queue interface-profile CUSTOM-HIGHER !
! On the light traffic port, never congested ! interface Ethernetx platform trident mmu queue interface-profile CUSTOM-LOWER !
You don’t need to implement too many changes at time. You may experiment solely with the “CUSTOM-HIGHER” class initially, and add the -LOWER one later. The standard procedure for experimentation would be to only apply one change at time, to ensure what has effects.
You can now assign the profile to the interface:
! platform trident mmu queue profile CUSTOM-HIGHER egress unicast queue 1 reserved 3328 egress unicast queue 1 threshold 1/8 ! interface Ethernet1 platform trident mmu queue interface-profile CUSTOM-HIGHER !
For some other profiles (like a CUSTOM-LOWER), you might want to apply them to the whole switch with a single command (rather than configuring all the interfaces):
! This applies to the whole switch ! platform trident mmu queue profile CUSTOM-LOWER apply !
Prior to making the change to Ethernet1 we could see that the number of reserved buffers for unicast queue 0 is 1664 and its threshold is 1/4
After making the change the reserved buffers for unicast queue 0 are 3328 and 2.
You will want to make this change to all of your congested interfaces.
>>>>>>>>>This is a trial and error method (set, monitor, tune) <<<<<<<<
You could increase the threshold step by step until you reach the most satisfactory results. For example, double at each step (1664 –> 3328 –> 6656 …).
Or lower it if you went a step too far, to keep the buffer highly shared and dynamic.
To see where you are dropping packets after applying the profiles, use the command:
! show interface counters queue detail | nz !
You can clear the counters via the “clear counters” command.
After applying the changes, monitor to see the effect. If you are still seeing a large number of drops then double the reserved buffers and thresholds again.
Extract before Change
SW1#show platform trident mmu queue status -------------------------------------------------- MMU Queue Status -------------------------------------------------- Ethernet1 MMU Queues Bytes Used Minimum Reserved Bytes Dynamic Threshold --------------- ---------- ---------------------- ----------------- Ingress Packet Queues 0 0 8 Ingress Priority-Group 0 0 0 8 Ingress Priority-Group 1 0 0 8 Ingress Priority-Group 2 0 0 8 Ingress Priority-Group 3 0 0 8 Ingress Priority-Group 4 0 0 8 Ingress Priority-Group 5 0 0 8 Ingress Priority-Group 6 0 3120 8 Ingress Priority-Group 7 0 0 8 Egress Unicast Queue 0 0 1664 1/4 Egress Unicast Queue 1 0 1664 1/4 <===== Original values Egress Unicast Queue 2 0 1664 1/4 Egress Unicast Queue 3 0 1664 1/4 Egress Unicast Queue 4 0 1664 1/4 Egress Unicast Queue 5 0 1664 1/4 Egress Unicast Queue 6 0 1664 1/4 Egress Unicast Queue 7 0 1664 1/4 Egress Multicast Queue 0 0 1664 1/4 Egress Multicast Queue 1 0 1664 1/4 [...]
After the change
SW1#show platform trident mmu queue status -------------------------------------------------- MMU Queue Status -------------------------------------------------- Ethernet1 MMU Queues Bytes Used Minimum Reserved Bytes Dynamic Threshold --------------- ---------- ---------------------- ----------------- Ingress Packet Queues 0 0 8 Ingress Priority-Group 0 0 0 8 Ingress Priority-Group 1 0 0 8 Ingress Priority-Group 2 0 0 8 Ingress Priority-Group 3 0 0 8 Ingress Priority-Group 4 0 0 8 Ingress Priority-Group 5 0 0 8 Ingress Priority-Group 6 0 3120 8 Ingress Priority-Group 7 0 0 8 Egress Unicast Queue 0 0 1664 1/4 Egress Unicast Queue 1 0 3328 2 <===== New values [...]
Relieving buffer faster
A way to minimize stress on buffers is to allow them emptying faster. This mean providing higher bandwidth out of the congested ports.
For example, where you have 2x10G you could increase to 3x10G or 4x10G. This is effectively reducing the oversubscription towards the nodes congested.
The goal is to empty buffers faster, providing some relief but not necessarily a resolution by itself, especially if traffic is very bursty.
This article provides specific guidelines for troubleshooting, baselining, measuring, and designing QoS/ Flow-control/buffer management, including configuration steps for you to implement them.
- QoS protect traffic and give more buffer with additional queues and default buffer settings
- PFC or Flow-control send Ethernet pause frames to prevent overload (per queue with PFC, preferred)
- Buffer tuning: Given there is currently no QoS, only 1 queue (default unicast queue 1), giving the option to allocate more buffer to the congested queue.
- More throughput: Increasing bandwidth to the nodes with congestion could help relieve buffers, if it’s possible to add throughput (medium-term option with existing switches but adding NIC/ports)
The QoS, flow-control, buffer tuning, increased bandwidth, may all be considered for working together, but for the purpose of this troubleshooting it would be beneficial to see the implementation of only one at time to see the effects.
Software configuration might be the easiest for you (QoS, Flow-control, buffers), skipping bandwidth increase if it is not possible.