Wed Nov 03, 2010 8:00 pm
We run RouterOS BGP to external AS with full tables. That router currently has an uptime of 80 days, on 4.10. However, we've pretty much isolated it from the rest of our network, and use static routes on it to the IGP routers. We would prefer that it would be a full participant in our bgp/ospf setup that we use in the rest of our network, but the aforementioned OSPF problem (apparent cause) prevents that.
Basically the following issues seem to apply:
SNMP and BGP don't mix (possibly resolved now?)
BFD between non-adjacent BGP neighbors doesn't mix (unknown status, but we also suspect BFD (when used with OSPF) seemed to make the OSPF crashes more likely). If I recall correctly, the issue here is that BFD would take down the BGP peer, but the routes would remain in the routing table. I'm fuzzy on this one though.
OSPF crashes, and to answer your question, apparently takes the rest of routing with it. I'm not sure why this is. We only saw that /ip route print (and any other routing related commands) would hang, and after 2 or 3 minutes it would tell us to generate a supout and send it to support. We've been told it was OSPF, but we couldn't tell. We are carefully putting 5.0rc3 in a few places to see if this is really resolved. Sometimes, it would appear that OSPF crashed, and recovered (all neighbors have the same uptime, much smaller than the router uptime), but the routing would be inconsistent after this (which would usually show up as a traceroute bouncing back and forth between the problem router and its neighbor until TTL expiry. Also, /IP route print, and the route actually taken, would not match).
A lot of those were intertwined (and we were using MPLS as well at one point) and it was very difficult to tell where the fault(s) actully were/are. Unfortunately, the new profile tool lumps all routing all together as well (and doesn't show memory) (perhaps there really aren't separate processes for ospf/bgp/other, which is why ospf crash kills bgp/routing in general?)
There is a possible memory leak with BGP, the more memory your router has, the longer it will run before it needs a reboot/crashes. We started rebooting the router on purpose in the middle of the night every 3 months or so to deal with this. (this might be fixed?) The memory graphs on the monthly and yearly will show a steady upward trend. This could be because of normal caching mechanisms, or a memory leak, so I can't really tell if this problem is fixed yet, although in 4.10 the graph appears to be growing at a slower rate. I suspect this is the issue the people on the motorola list are having.
With IPv6 and link-local addresses, BGP does not properly do recursive lookups.
BGP peer hold timer only works properly when set to the default value.
p.s. Yes, I've contacted support on these.