Hi, i need help to configure BFD; i have a ospf-based network and i’m looking to improve the convergence time during a link failure; until now, i have used ospf hello&dead time but, aftere the upgrade to 4.5, i’m testing bfd.
i have configured bfd under /routing bfd interface, and i have enabled BFD support on OSPF interface; now i can see the bfd neighbor up on the true menu;
but, if i disable a radio card between the two ospf router, i can immediatly see the state change of BFD, but OSPF still remain in full-way, until ospf router-dead-time; i have forgotten some configuration step?
randomly, without link failure (radio or ethernet) i can see on the log the change of BFD from full to down with reason “packet read timeout”..
i hope for a configuration error!
How are the links actually performing, specifically, is there any packet loss? Packet loss could cause BFD packets to be missed, causing BFD to report the neighbour down. No packet loss may occur if BFD is to operate properly!
Secondly, you should check your CPU load. High load may cause BFD to be delayed and neighbours will be declared down.
already checked cpu is 0-5% on both rotuer (one is a 493AH, other a 450G), about packet loss i have tried to put&get data over link with wireshark and i cannot see any loss.
radio signal is good, -69db with a ccq from 90 to 100%..
but, i see now, i have nstreme with “best fit” configured; can is this the problem? (but i have the same problem on a ethernet link..)
No, unfortunately, I’ve only done rudimentary testing on it on the MikroTik platform, so my advice was more based on my understanding of the protocol than my MikroTik experience.
Maybe you could paste your ospf and BFD configuration?
/routing ospf interface> :put [get 2 use-bfd ]
true
/routing bfd interface> pr
Flags: X - disabled, I - inactive
# INTERFACE INTERVAL MIN-RX MULTIPLIER
0 all 0.2sec 0.2sec 5
1 2 - Mogliano 0.5sec 0.5sec 6
/routing bfd neighbor> pr
Flags: U - up
# STATE ADDRESS INTERFACE PROTOCOLS MULTIHOP
0 U up 94.198.72.21 2 - Mogliano ospf no
10:39:02 route,ospf,info OSPFv2 neighbor 94.198.72.42: state change from Full to Down
10:39:05 bfd,debug BFD neighbor 94.198.72.21 on 2 - Mogliano changed state to DOWN
10:39:05 bfd,debug reason: packet read timeout
side 2: “Mogliano”
/routing ospf interface> :put [get 0 use-bfd ]
true
/routing bfd interface> pr
Flags: X - disabled, I - inactive
# INTERFACE INTERVAL MIN-RX MULTIPLIER
0 all 0.2sec 0.2sec 5
1 3 - Cesen 0.5sec 0.5sec 6
/routing bfd neighbor> pr
Flags: U - up
# STATE ADDRESS INTERFACE PROTOCOLS MULTIHOP
0 U up 94.198.72.22 3 - Cesen ospf no
15:23:45 route,ospf,info OSPFv2 neighbor 94.230.75.49: state change from Full to
Down
15:23:54 bfd,debug removed BFD neighbor 94.198.72.22 on 3 - Cesen
15:23:54 bfd,debug BFD neighbor 94.198.72.22 on 3 - Cesen changed state to DOWN
15:23:54 bfd,debug reason: administrative change
The logs you have posted so far do not show any problems with the operation of the protocol.
On one end of the link a OSPF neighbor is lost (it timeouts?) and the BFD neighbor removed because of that (“reason: administrative change”).
On the other end of the link, the corresponding OSPF and BFD neighbors are still standing, until first one of them timeouts. That happens to be the BFD neighbor (“reason: packet read timeout”) - because the other end is not sending BFD messages anymore.
Further investigation is required to see why the OSPF neighbors go down.
We experimented with this between x86 based routers and had OSPF and/or BGP sessions bouncing up and down constantly. turn off BFD, and everything stabalizes. This was over short fast ethernet links with 0 packet loss.
So, we’ve concluded its just not ready yet. Alternatlivey, perhaps 0.2 seconds is just too fast for the current implementation, or the BFD process doesn’t have enough priority. We’ve noticed cpu hits 100% during BGP table exchange (at least when you have filters), and we were exchanging tables with approximately 90,000 entries. I didn’t experiment with differrent timer values.
With the default settings, if I understand correctly, this means either the sending router, or receiving router (or some combination) had to fail to send/detect bfd packets for a full second. (and if BGP peer setup is hogging one of or both the CPU and the link capacity, perhaps BFD just couldn’t get packets in edge-wise, so to speak).
yes, i have the some one bouncing problem; on laboratory all works fine, on real network (then with some cpue load, bandwith usage, ping latency etc etc) is very unstable; i have also try to change default values but it bounces in every setup!
However, they don’t have any other load on them, and the bgp session has a whopping 1 or 2 routes max, and they aren’t handling any other traffic. Now I need to figure out how to make them busy.
I’ve really considered writing some sort of scripted solution that disables BFD once the cpu load enters the 90% area.
If people’s experience is that BFD is unstable under real production load, I’m not going to turn it on yet.
mikrotik has replyed to me with the solution of increase the multiplier if the link is heavy loaded… now i will try but, if i increase the multiplier, the convergence time will increase…
I added a tcp bandwidth test to my little test setup and left it running all night, which claims to have the CPU on both sides at 100%.
On the receiving side we have:
Flags: U - up
0 U state=up address=192.168.101.51 interface=ether1 protocols=bgp multihop=no
state-changes=1 uptime=4d3h26m50s desired-tx-interval=0.2sec
actual-tx-interval=0.2sec required-min-rx=0.2sec remote-min-rx=0.2sec
multiplier=5 hold-time=1sec packets-rx=2127635 packets-tx=2129848
and on the sending side we have:
Flags: U - up
0 U state=up address=192.168.101.50 interface=ether1 protocols=bgp multihop=no
state-changes=2 uptime=4d3h26m51s desired-tx-interval=0.2sec
actual-tx-interval=0.2sec required-min-rx=0.2sec remote-min-rx=0.2sec
multiplier=5 hold-time=1sec packets-rx=2129856 packets-tx=2127645
I take this to indicate that the side initiating the bandwidth test seems to be sending BFD packets at a slightly larger interval than the other side.
Interestingly, based on the uptime and the 0.2 sec interval, there should only have been 1792060 bfd packets sent by now, instead of the 2129961 actually sent, hmm… resulting in an actual interval of 168ms instead of 200ms.
Well, anyway, even with this test, the bfd/bgp session didn’t drop, and the bandwidth test is averaging 76.4 mbits (over 100 mbit connection). Unfortunately, I don’t have two idle x86 based routers to try this on, and perhaps the bandwidth tester isn’t sufficent enough load to cause the problem to show up.
x86 ones with huge bgp tables are where I saw the problem before.
If you do a script which disables it, it would have to do it on both sides. Upping the interval and/or multiplier is probably easier.
MichelePietravalle: also make sure you don’t drop or de-prioritize BFD packets in your firewall. You may want to write special rules to handle these packets separately. I think ToS class “Internetcontrol” (two highest ToS bits set) is assigned to them, as well as to other routing protocol packets - this can be used to distinguish them from other traffic.
resulting in an actual interval of 168ms instead of 200ms.
xxiii: this is because a small random jitter is subtracted from protocol timers, to avoid accidental unneeded synchronisation.