Looking for some advice:
We have 2 BGP routers; one primary, one backup; connected to 2 upstream providers who just offer a default route.
We run OSPF internally and we’ve got a routing filter to set the ‘distance’ of the second BGP default gateway to 200.
OSPF redistributes the default route around the network as expected, including to the backup BGP router. All is
good, and all of our traffic goes via the primary router.
When the primary router BGP connection fails, OSPF stops distributing that default route, and the backup router’s
BGP default route is marked active and is then advertised via OSPF - all good so far.
On the restoration of the primary BGP session, everything switches back - apart from the backup router. It’s BGP
default route seems to remain, despite neighbours offering it a lower distance OSPF route.
Clearly I’m missing something fundamental, so what is stopping the backup router from learning the OSPF default route
and displacing the BGP one again?
No, you’re doing it right. Mikrotik’s OSPF implementation has an incorrect behavior. They’ve stated this is going to get fixed in ROS version 7 where they’re re-implementing the routing engine.
What happens is that OSPF should not inject a default GW that it learns via OSPF. This is to prevent routing loops. Whenever the OSPF engine is in the state where it’s actively advertising the default prefix, it will ignore any default prefix that it hears from other OSPF routers. If the router is set to originate-always, then this is correct behavior. However, if the OSPF default GW would result in a shorter AD than whatever default prefix OSPF is currently using as its justification to advertise default-information, it should stop injecting default information and start using the ospf-learned default GW instead.
It doesn’t do this, though. That’s how you’re ending up with a router making use of a default GW of higher distance than the OSPF default GW.
Unfortunately, the only workaround is to manually un-wedge it by disabling default-originate, letting the ospf-learned default GW take over, and then re-enabling the default-originate mode again for the next fail-over. I don’t think you can even script this because the default prefix from OSPF neighbors won’t even show up in the routing table as an inactive route (I think - if you do see this route, then you could schedule a script to check for its presence and then bounce the local OSPF default-originate)
Thanks for confirming my suspicions - and yes, the ‘other’ default route never makes it into the routing table, so I can’t reliably script the process to switch it back - I can’t differentiate between a genuine route failure and when it recovers.
We have alarm thresholds on our backup routes, so we’ll just have to manually monitor things and switch them back by hand.
Cheers
Marty