- Troubleshooting BFD commands:
- Troubleshooting steps:
- 1) BFD goes down because of underlying physical interface going down:
- 2) If bfd packets are blocked by any ACLs
- 3) Check if bfd packets are sent and received properly
- 4) In the above cases, the problem is live. But what if the problem is intermittent and bfd session goes down at unpredictable times.
- 5) BFD packets are sent properly but the session still goes down.
- 6) QOS settings:
BFD quick introduction:
What is BFD?
Bidirectional Forwarding Detection(BFD) is a low overhead protocol designed to provide rapid detection of failures in the path between adjacent forwarding engines over any media and at any protocol layer – this base protocol is defined in RFC5880.
It does not operate independently, but only as an adjunct to routing protocols
BFD is a simple Hello protocol that involves a pair of systems transmitting BFD packets periodically over a path between the two systems, and if a system stops receiving BFD packets for long enough , that bidirectional path is assumed to have failed.
BFD functions in asynchronous or demand mode, and also offers an echo function. EOS supports asynchronous mode and the echo function. Usually customers have asynchronous mode.
Asynchronous Mode: In asynchronous mode, BFD control packets are exchanged by neighboring systems at regular intervals. If a specified number of sequential packets are not received, BFD declares the session to be down.
Echo Function: When the echo function is in use, echo packets are looped back through the hardware forwarding path of the neighbor system without involving the CPU. Failure is detected by an interruption in the stream of echoed packets. The minimum reception rate for BFD control packets from the neighbor is also changed automatically when the echo function is operational, because liveness detection is supplied by the echo packets.
What port is used by BFD:
While BFD control messages are transmitted to port 3784.
BFD echo messages use DP port 3785.
following bfd intervals are configured globally by default:
bfd interval 300 min_rx 300 multiplier 3 default
bfd multihop interval 300 min_rx 300 multiplier 3
tip: do a show run all sec bfd or show run all | grep bfd to check default config
What do the values mean:
interval/tx-interval – specifies the rate, in milliseconds, at which BFD control packets will be sent to BFD peers. The valid range for the milliseconds argument is 50 to 60,000
min_rx/rx-interval – specifies the rate, in milliseconds, at which BFD control packets will be expected to be received from BFD peers.
The valid range for the milliseconds argument is from 50 to 60000.
multiplier – specifies the number of consecutive BFD control packets that must be missed from a BFD peer before BFD declares that the peer is unavailable and the Layer 3 BFD peer is informed of the failure. The valid range for the multiplier value argument is from 3 to 50.
You can change the default config. For example:
bfd interval 500 min_rx 500 multiplier 4 default
You can configure BFD on an interface:
These commands set the transmit and receive intervals to 200 milliseconds and the multiplier to 2 for all BFD sessions passing through Ethernet interface 3/20.
switch(config)#interface ethernet 3/20 switch(config-if-Et3/20)#bfd interval 200 min_rx 200 multiplier 2
The default values for these parameters are:
- transmission rate 300 milliseconds
- minimum receive rate 300 milliseconds
- multiplier 3
To configure different values for these parameters on an interface, use the bfd interval command
Question: What if we have different timers on each side?
The Detection Time (the period of time without receiving BFD packets after which the session is determined to have failed) is not carried explicitly in the protocol. Rather, it is calculated independently in each direction by the receiving system based on the negotiated transmit interval and the detection multiplier. Note that there may be different Detection Times in each direction.
The Detection Time calculated in the local system is equal to the value of Detect Multiplier received from the remote system, multiplied by the agreed transmit interval of the remote system (the greater of bfd.RequiredMinRxInterval and the last received Desired Min TX Interval). The Detect Multiplier value is (roughly speaking, due to jitter) the number of packets that have to be missed in a row to declare the session to be down.
So, for example lets consider cp143 <—-> co710
cp143 has bfd interval 200 min-rx 200 multiplier 5, whereas co710 has the default bfd interval 300 min_rx 300 multiplier 3
In this case, cp143 says that it’s required minimum RX interval is 200 and the received desired minimum tx interval from other side is 300. So the greater of this is chosen, i.e., 300msec and is multiplied by the received detect multiplier (3) to get the detection time which is 300 x 3 = 900msec. You can confirm this value from “show bfd peers detail”
cp143[16:11:06]#show bfd peers detail | grep -i “detect time”
Detect Time: 900
Similarly, on co710:
It’s required minimum RX interval is 300msec and the received desired minimum TX interval from cp143 is 200msec. The greater value of 300msec is chosen and multiplied by received detect multiplier (5) to get the detection time of 300 x 5 = 1500msec.
co710…11:46:50(config)#show bfd peers detail | grep -i “detect time”
Detect Time: 1500
For BFD to function as a failure detection mechanism, it must be enabled for each participating protocol.
Configuring BFD for protocols:
These commands configure VLAN interface 200 to use BFD for PIM-SM connection failure detection regardless of the global PIM BFD configuration.
switch(config)#interface vlan 200
switch(config-if-VL200)#pim ipv4 bfd
These commands enable BFD failure detection for BGP connections with the neighbor at 10.13.64.1.
switch(config)#router bgp 300
switch(config-router-bgp)#neighbor 10.13.64.1 bfd
These commands enable BFD in OSPF instance 100 for all OSPF neighbors on BFD-enabled interfaces except those connected to interfaces on which OSPF BFD has been explicitly disabled.
switch(config)#router ospf 100
This command enables OSPF BFD on Ethernet interface 3/21.
switch(config)#interface ethernet 3/21
switch(config-if-Et3/21)#ip ospf neighbor bfd
Troubleshooting BFD commands:
Below are some helpful show commands for troubleshooting BFD:
1) show bfd peers:
This command gives general information about the bfd neighbors/peers, the IP address of the peer, the interface being used for communication, state of the session and the time since it was up, etc.
2) show bfd peers detail
This is used to get in depth details of the bfd peer: co642...20:48:59(config)#show bfd peers detail VRF name: default ----------------- Peer Addr 188.8.131.52, Intf Ethernet49/1, Type normal, State Up VRF default, LAddr 184.108.40.206, LD/RD 4172577639/297798526 Session state is Up and not using echo function Last Up Dec 06 20:28:10 2020 Last Down NA Last Diag: No Diagnostic Authentication mode: None Shared-secret profile: None TxInt: 300, RxInt: 300, Multiplier: 3 Received RxInt: 300, Received Multiplier: 3 Rx Count: 5633, Rx Interval (ms) min/max/avg: 223/328/270 last: 261 ms ago Tx Count: 5625, Tx Interval (ms) min/max/avg: 223/328/271 last: 261 ms ago Detect Time: 900 Sched Delay: 1*TxInt: 4734, 2*TxInt: 895, 3*TxInt: 0, GT 3*TxInt: 0 Registered protocols: bgp Uptime: 25:25.73 Last packet: Version: 1 - Diagnostic: 0 State bit: Up - Demand bit: 0 Poll bit: 0 - Final bit: 0 Multiplier: 3 - Length: 24 My Discr.: 297798526 - Your Discr.: 4172577639 Min tx interval: 300 - Min rx interval: 300 Min Echo interval: 300
Peer Addr :: Remote IP
Peer Intf :: local egress interface
LAddr :: Local IP
LD/RD :: Local Discriminator/Remote Discriminator
Holdown :: rx-interval*multiplier for the egress interface
(mult) :: multiplier value for the egress interface
Last Up :: Timestamp(in 100ths of a second since system up) from the last time the peer session came up
Last Down :: Timestamp(in 100ths of a second since system up) from the last time the peer session went down
Diag :: Last diag value – diagnostic code specifying the local system’s reason for the last change to Down state.
Rx Count: 5633, Rx Interval (ms) min/max/avg: 223/328/270 last: 261 ms ago Tx Count: 5625, Tx Interval (ms) min/max/avg: 223/328/271 last: 261 ms ago Detect Time: 900 Sched Delay: 1*TxInt: 4734, 2*TxInt: 895, 3*TxInt: 0, GT 3*TxInt: 0
In Rx and Tx Count columns, you can check the minimum, maximum and average intervals of sent and received bfd packets and also the time elapsed since the last bfd packet was sent and received.
The Sched Delay counter also tells us in which interval this switch is sending bfd packets.
For example in above output, switch has sent bfd packets within the first and second intervals. Usually intervals are 300msecs with a multiplier of 3, which means if packets are sent after 300×3=900msec there can be a bfd flap.
So it is always preferable to see packets being sent within/before the 3rd interval.
3) show bfd peers history
This output gives us the history of all recent bfd events
We will see some examples below.
Logs to gather:
– show tech-suuport | no-more
– show logging system | no-more
– show agent qtrace | no-more
– show agent logs | no-more
– cpu graph during time of the problem
Consider the below lab setup:
co642 (et49/1) ←——–> (et49/1) co643
Configs: On co642: interface Ethernet49/1 no switchport ip address 220.127.116.11/24 ! router bgp 100 neighbor 18.104.22.168 remote-as 100 neighbor 22.214.171.124 bfd neighbor 126.96.36.199 maximum-routes 12000 On co643: interface Ethernet49/1 no switchport ip address 188.8.131.52/24 ! router bgp 100 neighbor 184.108.40.206 remote-as 100 neighbor 220.127.116.11 bfd neighbor 18.104.22.168 maximum-routes 12000
1) BFD goes down because of underlying physical interface going down:
Uusally we will see that the protocol (let’s take bgp as example here) bounced. Find out the interface used for establishing bgp session and check if it went down. Take a look at output of “show logging” and look for the following:
Dec 6 20:28:10 co642 Bfd: %BFD-5-STATE_CHANGE: peer (vrf:default, ip:22.214.171.124, intf:Ethernet49/1, srcIp:0.0.0.0, type:normal) changed state from Down to Up diag None Dec 6 20:57:26 co642 Rib: %BGP-3-NOTIFICATION: sent to neighbor 126.96.36.199 (AS 100) 6/6 (Cease/other configuration change <Hard Reset>) 0 bytes Dec 6 20:57:26 co642 Rib: %BGP-BFD-STATE-CHANGE: peer 188.8.131.52 (AS 100) Up to Down Dec 6 20:57:26 co642 Rib: %BGP-5-ADJCHANGE: peer 184.108.40.206 (AS 100) old state Established event Stop new state Idle Dec 6 20:57:26 co642 Ebra: %LINEPROTO-5-UPDOWN: Line protocol on Interface Ethernet49/1, changed state to down Dec 6 20:57:26 co642 Bfd: %BFD-5-STATE_CHANGE: peer (vrf:default, ip:220.127.116.11, intf:Ethernet49/1, srcIp:0.0.0.0, type:normal) changed state from Up to Down diag None
Here you can see that BFD first went down (BFD-5-STATE_CHANGE), which brought down the BGP session (BGP-BFD-STATE-CHANGE). This is because bfd fallover was configured for bgp
(neighbor a.b.c.d bfd). You can see et49/1 is used for the bfd and bgp session and this is the interface whose line protocol went down.
We need to follow the same steps for any protocol configured with BFD:
– find out the interface for which bfd went down from the “BFD-5-STATE_CHANGE” log message
– check if the protocol uses that interface for peering
– check if the interface went down/bounced
In such cases check where the physical interface went down (et49/1 in this case), take a look at “show interfaces phy detail”, “show interfaces mac detail” and do some layer 1 troubleshooting steps as required.
Note: In some cases, we have seen that the BFD peering has an asymmetric path. Check the routing and the interfaces involved before troubleshooting.
In some cases the interface is up, but BFD still goes down. In such scenarios we need to check the control plane state. Below examples demonstrate the troubleshooting steps.
2) If bfd packets are blocked by any ACLs
For example in the read-only -default-control-plane-acl, the following statement allows bfd packets:
30 permit udp any any eq bfd ttl eq 255 [match 6480 packets, 0:07:31 ago]
Check if there are any ACLs on the interfaces/protocols which prevents transmission or receipt of bfd packets
3) Check if bfd packets are sent and received properly
To test this scenario, let’s put an outbound acl on co643 which blocks the bfd packets from being sent:
co643...08:43:36#show run int et 49/1 interface Ethernet49/1 no switchport ip address 18.104.22.168/24 bfd interval 300 min-rx 300 multiplier 3 ip access-group test_bfd out co643...08:43:41#show ip access-lists test_bfd IP Access List test_bfd 10 deny udp any any eq bfd 20 permit ip any any
So co642 will not receive any bfd packets for 900 msecs and declare the session down.
co642...21:25:56(config)#show bfd peers hist Peer Vrf default, Addr 22.214.171.124, Intf Ethernet49/1, State Down Processed BFD session down event at Dec 06 21:25:20 2020 Last Up Dec 06 21:25:04 2020, Last Down NA Last Diag: Detect Time Exp Authentication mode: None Shared-secret profile: None TxInt: 1000, RxInt: 300, Detect Time: 900 Rx Count: 57, Tx Count: 59 SchedDelay: 1*TxInt: 51, 2*TxInt: 7, 3*TxInt: 0, GT 3*TxInt: 0
As you can see the Diagnosis is “Detect Time Exp”
This means that the switch did not receive any bfd keepalives for the duration of the detect time (900 msecs here).
So it brings down the session. The peer will be notified of this and hence the diag on the peer will be “neighbor signalled down”
co643...13:33:19#show bfd peers hist Peer Vrf default, Addr 126.96.36.199, Intf Ethernet49/1, State Down Processed Peer BFD down event at Dec 06 13:25:20 2020 Last Up Dec 06 13:25:04 2020, Last Down NA Last Diag: Nbr Signaled Down Authentication mode: None Shared-secret profile: None TxInt: 1000, RxInt: 1000, Detect Time: 3000 Rx Count: 0, Tx Count: 59 SchedDelay: 1*TxInt: 52, 2*TxInt: 6, 3*TxInt: 0, GT 3*TxInt: 0
So now we know that co642 is not receiving bfd keepalives from it’s peer. How do we prove this?
Since bfd keepalives are control plane packets, a regular tcpdump helps in checking if packets are sent/received properly. BFD messages are transmitted to udp port 3784.
So you can use the below command:
bash tcpdump -nvei <interface_number> udp port 3784
Saving the packets to a pcap and analyzing in wireshark is always better: bash tcpdump -nvei <interface_number> udp port 3784 -w /mnt/flash/bfd.pcap
Sometimes you will see that we are receiving packets from the peer, but the session still went down.
In wireshark you can check the time delta between the packets. If the delta is more than detect time, the session will go down.
Find out the latest packet from the peer before the session was brought down, right click on it and set the packet as reference:
Now click on the packet when the session was brought down (the control detection time expired packet in this case) and then expand the frame option:
You can see the delta since the reference frame is around 927 msecs which is more than the 900msec detect time. Hence the BFD session was brought down.
Use same methods as above but set up a circular tcpdump:
tcpdump session test1 interface et1 file flash:/bfd.pcap filecount 10 max-file-size 100 filter port 3784
This will create a circular capture in the device of 10 files of 100 MB each. You can delete unnecessary files from the flash to make more space. After a BFD flap, take a look at the pcap files in flash for troubleshooting.
Note: always check flash state (bash df -h) and if space is present before saving anything to flash.
5) BFD packets are sent properly but the session still goes down.
In such cases we need to check if bfd packets are processed properly by the switch.
We have seen cases where there are high cpu scenarios and cpu is not able to process the bfd packets in a timely manner.
– Check output of “show processes top” and see if any processes are consuming a lot of cpu.
– Check if any agents are crashing.
– If available, take a look at the cpu graph and check if bfd drops correlate to time of high cpu events.
– Check if there are any cpu drops:
switch#show cpu counters queue Jericho0.0: CoPP Class Queue Pkts Octets DropPkts DropOctets Aggregate -----------------------------------------------------------------------------------------------------------------CoppSystemBfd Et1 45 5535 20 2460
switch#show cpu counters queue
Queue Counters/pkts Drops/pkts
BFD 902550 0
The command is “show platform fm6000 counters cpu”
Check if there are any drops for rx frames with tx frames with priority_10, frames trapped to CPU (BPDUs, ICMP, FFU triggers, etc).
BFD packets will go into Self_IP_TC6_7 CPU queue ( priority_10 ) ( assuming DSCP value is 48 or above ). Any other traffic sent to router IP address with high dscp value will also hit this queue ( ex: mlag, ospf unicast etc )
We should also check if there are any other cpu drops (Copp drops) and not just bfd copp drops. It is possible that a large amount of traffic is hitting the cpu and consuming resources and hence other cpu bound traffic (like bfd) is not being processed.
Sometimes you might not see explicit drops, but high traffic on cpu can lead to processing delay of BFD packets.We have seen cases where there was high traffic on cpu and bfd packets were not processed leading to flaps.
– switches were receiving continuous multicat flow for SSDP and causing Fastdrops in multiple interfaces. This traffic is had TTL value 1, which is destined for CPU.
– copp drops like CoppSystemDefault, CoppSystemIpMcast, CoppSystemL2Bcast were seen
– BFD went down coz of cpu being 100% for some time and bfd packets were not processed:
– Check if bfd has aggressive timers.
Aggressive timers can lead to a lot of bfd control packets being sent and received by cpu and it’s possible that the cpu was not able to process all these packets properly.
If there is a lot of traffic on cpu, we can try to increase the bfd timers using the bfd interval command specified above.
Devs suggested relaxing the timers to 300*3 (default) and this is the recommended settings.
6) QOS settings:
BFD packet traffic class for bfd packets originated on the DUT differs based on platform and configuration.
BFD is mapped to traffic-class 6. For 7280R/7500R platforms:
Strata – traffic class 6
7280R/7500R – bfd over svi – traffic class 6
7280R/7500R – bfd over ethernet port or port channel – traffic class 7
By default traffic class 6 maps to egress queue 6 and traffic class 7 maps to egress queue 7. This can be modified by qos configuration.
We have 8 such queues (tx-queue 0 to tx-queue 7)
0 is best effort/lowest priority queue and 7 is highest. Since BFD sits in tx-queue 6 so it has pretty high priority.
Strict priority scheduling is the default setting. Any traffic in higher priority queues are serviced first, regardless if there is traffic present in lower priority queues.
switch(config)#show qos int et 49/1 Ethernet49/1: Trust Mode: DSCP Default COS: 0 Default DSCP: 0 Port shaping rate: disabled Tx Bandwidth Shape Rate Priority Queue (percent) (units)
7 - / - - / - ( - ) SP / SP 6 - / - - / - ( - ) SP / SP 5 - / - - / - ( - ) SP / SP 4 - / - - / - ( - ) SP / SP 3 - / - - / - ( - ) SP / SP 2 - / - - / - ( - ) SP / SP 1 - / - - / - ( - ) SP / SP 0 - / - - / - ( - ) SP / SP Note: Values are displayed as Operational/Configured Legend: RR -> Round Robin SP -> Strict Priority - -> Not Applicable / Not Configured % -> Percentage of line rate
Check if there are any qos settings which differ from this. For example, round-robin is used instead of strict priority queueing:
co643…10:59:22(config-if-Et49/1-txq-0-7)#show qos int et 49/1
Tx Bandwidth Shape Rate Priority Queue (percent) (units) ----------------------------------------------------- 7 16 / - - / - ( - ) RR / RR 6 12 / - - / - ( - ) RR / RR 5 12 / - - / - ( - ) RR / RR 4 12 / - - / - ( - ) RR / RR 3 12 / - - / - ( - ) RR / RR 2 12 / - - / - ( - ) RR / RR 1 12 / - - / - ( - ) RR / RR 0 12 / - - / - ( - ) RR / RR
Or the full bandwidth has been assigned to other queues: co643...11:04:29(config-if-Et49/1-txq-0-7)#tx-queue 2 co643...11:05:19(config-if-Et49/1-txq-2)#bandwidth percent 80 co643...11:05:27(config-if-Et49/1)#tx-queue 1 co643...11:05:31(config-if-Et49/1-txq-1)#bandwidth percent 20 co643...11:05:35(config-if-Et49/1-txq-1)#show qos int et 49/1
Tx Bandwidth Shape Rate Priority Queue (percent) (units) ----------------------------------------------------- 7 - / - - / - ( - ) RR / RR 6 - / - - / - ( - ) RR / RR 5 - / - - / - ( - ) RR / RR 4 - / - - / - ( - ) RR / RR 3 - / - - / - ( - ) RR / RR 2 80 / 80 - / - ( - ) RR / RR 1 20 / 20 - / - ( - ) RR / RR 0 - / - - / - ( - ) RR / RR
If there are any qos settings other then the default, make sure that the tx-queue 6 is being serviced.