Page 1 of 1

BGP Upstream Failover

Posted: Fri Mar 06, 2015 4:19 pm
by gabrielpike
I had an issue with an upstream provider that I am peering BGP with. The problem was that they had a router failure beyond the BGP peering point so traffic was leaving my network through this peer but not going anywhere. Is there any way to control failover for BGP when the problem is beyond the peer point? My fix was to manually disable the port on the router until they had the problem fixed. I am looking for a solution to automate this process in the event of an upstream failure.

Re: BGP Upstream Failover

Posted: Fri Mar 06, 2015 8:11 pm
by IPANetEngineer
You can use netwatch to monitor an upstream IP and then tie it into a script that will enable / disable BGP peering based on the status of that upstream. Might be helpful to select an IP to monitor that is in the same provider network as the peering so it doesn't try to monitor via another peering.

Scripting example for enabling / disabling BGP peerings

http://forum.mikrotik.com/viewtopic.php?t=84800

Re: BGP Upstream Failover

Posted: Fri Mar 06, 2015 8:21 pm
by gabrielpike
When I searched I didn't find that thread.
Thanks :)

Re: BGP Upstream Failover

Posted: Fri Mar 06, 2015 8:38 pm
by IPANetEngineer
No problem....post back if you get it working. Good luck :)

Re: BGP Upstream Failover

Posted: Fri Mar 06, 2015 9:13 pm
by ZeroByte
My fix was to manually disable the port on the router until they had the problem fixed. I am looking for a solution to automate this process in the event of an upstream failure.
It sounds like you're only receiving a default route from your provider. If you have sufficient RAM, then you could ask for partial or full routes. If your BGP peer becomes isolated from the rest of its network, then you should see routes disappearing from that session, leaving other routes to take over automatically.

Your provider's router should not send default route unconditionally specifically because of situations like yours. It should have a way to know that it has been isolated from the Internet, but they may not be willing to fix their BGP design "just for one customer."

Having a full BGP table stops this from happening because if their router cannot reach destination x.x.0.0/16 anymore, then the route disappears from BGP and your second-best path takes over.

I just never did like "ping + script" solution, because BGP is already designed to provide fault tolerance, and you shouldn't have to re-invent the wheel.

Re: BGP Upstream Failover

Posted: Fri Mar 06, 2015 9:40 pm
by gabrielpike
Here is the scenario. I was still getting full BGP from upstream but their GW was broken. I am load balancing BGP with 2 peers. From customers perspective strange things happened like able to access google but not youtube. I have it set so that if the direct peer drops it will failover but if peer does not drop BGP still shows their routes as active.
BGPScenario.jpg

Re: BGP Upstream Failover

Posted: Fri Mar 06, 2015 10:46 pm
by IPANetEngineer
Having a full BGP table stops this from happening because if their router cannot reach destination x.x.0.0/16 anymore, then the route disappears from BGP and your second-best path takes over.

I just never did like "ping + script" solution, because BGP is already designed to provide fault tolerance, and you shouldn't have to re-invent the wheel.
While I agree that BGP does it's job pretty well most of the time, there are plenty of examples of routes not getting withdrawn when they should be due to some bug or another in the provider router that is causing a problem. I've seen plenty of cases in Cisco/Juniper gear where the FIB doesn't release the route even though the routing table shows it withdrawn, CEF/ASIC bugs that cause route withdrawal issues also come uo every now and then.

It really comes down to the requirements you have for availability - the higher the requirements, the more acceptable additional complexity becomes to ensure that goal..

Re: BGP Upstream Failover

Posted: Fri Mar 06, 2015 11:01 pm
by ZeroByte
While I agree that BGP does it's job pretty well most of the time, there are plenty of examples of routes not getting withdrawn when they should be due to some bug or another in the provider router that is causing a problem. I've seen plenty of cases in Cisco/Juniper gear where the FIB doesn't release the route even though the routing table shows it withdrawn, CEF/ASIC bugs that cause route withdrawal issues also come uo every now and then.

It really comes down to the requirements you have for availability - the higher the requirements, the more acceptable additional complexity becomes to ensure that goal..
True enough. BGP failues of that nature are pretty rare in my experience, but they do happen.

I've seen times where a link was flapping pretty badly, but was up just enough for iBGP to keep the session alive, so the network still wanted to route across the problem area. BGP is definitely slower to converge than many other protocols.
I've never experienced that type of CEF/ASIC issue. That would suck.