High CPU due to PIM processes is not always a bug and may be caused by either a misconfiguration or a routing issue. For the purposes of this document we will focus primarily on network misconfiguration.
Network wide choppy video, music on hold streams or loud speaker issues are commonly caused by multicast problems in the network. Should the issue be network wide and not isolated to one area of the network, the next place to start looking is high CPU on the First Hop Routers (FHR) and/or the Rendezvous Point (RP). When the CPU process is high due to PIM processes there are a couple of triggers which may cause this. One would be routing inconsistencies and you would see other network issues in conjunction. Next would be a misconfiguration of the network.
While in the majority of multicast cases you want to start from a single receiver to the RP & Source, in network wide problems you want to start from the Source to the RP. For the first step confirm your IPs: Source; RP; and Test Receiver. Now that you have defined these IPs you narrow the scope of your troubleshooting and are ready to start.
The multicast registration process allows multicast streams to be advertised to the root of the multicast tree. This is explained in the following article.
The main responsibilities for the RP and FHR are summarized below. These steps are important to understand why this broken process would cause high CPU and to understand which step in the process is broken to identify what needs to be fixed.
- Source sends traffic to the FHR.
- FHR encapsulates the multicast traffic in a unicast Register packet to the RP which dynamically forms a GRE tunnel known as a PIM tunnel.
- The RP receives the Register packet, decapsulates the packets and forms a state (form S,G) with the FHR .
- The RP now generates the (S,G) join towards the FHR
- FHR now receives the (S,G) join, adds the Outgoing Interface List (OIL) towards the RP and starts sending native multicast traffic towards the RP.
- A Register Stop is sent from the RP to the FHR once the native multicast traffic arrives, tearing down the PIM tunnel
- Once the FHR receives the Register Stop, NULL registers are sent to the RP to maintain state with the RP.
When looking at a tcpdump of the PIM Registration, you can see the SRC & DST IPs of the GRE header. Further down in the packet you can see the actual source IP and the actual multicast group.
12:18:57.561616 28:99:3a:26:d4:4f > 98:5d:82:c1:83:ff, ethertype IPv4 (0x0800), length 88: (tos 0x0, ttl 255, id 54936, offset 0, flags [DF], proto PIM (103), length 74) 192.168.15.1 > 220.127.116.11: PIMv2, length 54 Register, cksum 0xdeff (correct), Flags [ none ] (tos 0x0, ttl 63, id 0, offset 0, flags [none], proto UDP (17), length 46) 18.104.22.168.50001 > 22.214.171.124.50001: UDP, length 18
It is essential to understand that the registration process leverages software forwarding (cpu cycles) via a dynamic GRE tunnel. Forwarding
packets require additional encapsulation by the FHR and deencapsulation by the RP. Traffic is not forwarded fully in hardware until the FHR receives the Register-Stop from the RP. As the Register-Stop is a PIM packet, you must ensure that there is a PIM path between the RP and the FHR. If there is not a PIM path then you will be stuck in Register and processing via CPU and not hardware.
For more information regarding how to troubleshoot packets at the CPU, please refer to the following article:
As you can see in the diagram above, PIM is only configured on three out of four interfaces between the FHR and the RP. The “PIM-Transit” router does not have PIM configured on its uplink port to RP. Without a complete PIM path from the RP to the FHR, the RP cannot complete RPF (reverse path forwarding), cut over to the native port which forwards in hardware and tell the FHR to tear down the PIM tunnel by sending a Register-Stop.
You can confirm that state (FHR notified the RP that there is a server, 126.96.36.199, for group 188.8.131.52) has been formed between the FHR and the RP as seen below. You can also see the incoming interface from which you would receive the traffic from the source and also an outgoing interface would be listed in the cases are receiver/s downstream. In this example, there are no interested receivers and thus no outgoing interface.
RP#show ip mroute 184.108.40.206 ***snip*** 220.127.116.11 18.104.22.168, 0:29:51, flags: SLP Incoming interface: Vlan100
The most common reason for high CPU on the FHR is “stuck in register”. First is the absence of PIM-Sparse enabled interfaces along the unicast path between FHR and the RP is the most common reason for this. The second most common cause is a routing inconsistencies could exist. Routing inconsistencies could result in Register Stop packet being lost in the network possibly due to asymmetric routing which would cause the path from the FHR to RP to be a PIM path but from the RP to the FHR to not be a PIM path.
But what happens when that request, which is a PIM packet, does not make it from the RP to the FHR? In such a scenario the PIM Tunnel remains up, all traffic for that group is still forwarded via CPU and the CPUs get high. What is the solution?
1. Confirm that the CPU is high on both the FHR and RP:
------------- show processes top once ------------- top - 15:39:59 up 244 days, 7:12, 0 users, load average: 1.44, 1.42, 1.44 Tasks: 353 total, 1 running, 352 sleeping, 0 stopped, 0 zombie %Cpu(s): 6.6 us, 1.2 sy, 0.0 ni, 91.4 id, 0.0 wa, 0.4 hi, 0.4 si, 0.0 st KiB Mem: 8171400 total, 7931960 used, 239440 free, 355096 buffers KiB Swap: 0 total, 0 used, 0 free, 2661712 cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 3672 root 20 0 547m 131m 106m S 60.4 1.6 134:39.68 PimReg 2. Check COPP counters for drops towards the CPU #show cpu counter queue | nz -------------------------------------------------------------------------------- Linecard0/0 -------------------------------------------------------------------------------- Queue Counter/pkts* Drops/pkts --------------- ------------------- ------------------- Sflow 2958474981 0 Other 54720792 333599 TTL1 179880 6728 L3 Slow Path 1738536 3107 ARP 1751402615 0 Glean 1242257317 2071 Multicast Miss 2229869 99390 IGMP 150295880 0 Multicast LL 390181238 55
3. Confirm on the FHR that the group is hashing to the proper RP:
show ip pim rp-hash 22.214.171.124 RP 126.96.36.199 <--- confirmed RP PIM v2 Hash Values: RP: 188.8.131.52 Uptime: 14d16h, Expires: never, Priority: 0, HashMaskLen: 30, HashMaskValue: 561587137, Override: False Hash Algorithm: Default
4. Trace the route path back from RP to the FHR and ensure that each L3 interface has PIM-Sparse configured.
interface Ethernet5 no switchport ip address 192.168.X.X/31 pim ipv4 sparse-mode
If the issue is still seen, collect the below outputs and reach out to Arista TAC support by sending an email at firstname.lastname@example.org
CLI commands: show ip mroute (group ip) show ip pim rp-hash (group ip) SW# show tech-support all | gzip > /mnt/flash/show-tech-$HOSTNAME-$(date +%m_%d.%H%M).log.gz