BGP Stabillity/Lock Ups

We are moving to BGP in near future and planning on using Mikrotik to receive full feeds from two providers. I noticed on the Motorola Canopy mailing list there are a few running Mikrotik 4.11 with full feeds and having at least monthly lock ups. Worse yet the router does not reboot but rather freezes. So are there issues with BGP yet?

Also, if we receive 100Mbps from one provider and 500Mbps from a second is it possible with Mikrotik and BGP to try to balance them a bit?

We used 4.11 without full feeds and suffered from router lock ups aswell, 4.10 seems to to be the most stable if you aren’t using OSPF

Lookups could be related to ospf, which is fixed in v4.13 and latest v5RC

As fro load balancing, it is the same as in other manufacturers, you can get laod balancing by playing with BGP attributes (generally AS-path prepend for your advertised networks) and local pref for received routes.

Lookups could be related to ospf, which is fixed in v4.13 and latest v5RC

Assuming you were writing the changelog, what would you have wrote regarding the fix for ospf? I run ospf, don’t see any fixes for ospf in the changelogs, so I thought to myself, I dont need to upgrade. Now I wonder if I should…

The same fixes as in RC2 were backported to v4.13, it looks like it was not included in changelog.
*) ospf - fixed crash when working with external LSA that contain
forwarding addess;

How does this effect BGP?

We run RouterOS BGP to external AS with full tables. That router currently has an uptime of 80 days, on 4.10. However, we’ve pretty much isolated it from the rest of our network, and use static routes on it to the IGP routers. We would prefer that it would be a full participant in our bgp/ospf setup that we use in the rest of our network, but the aforementioned OSPF problem (apparent cause) prevents that.

Basically the following issues seem to apply:

SNMP and BGP don’t mix (possibly resolved now?)

BFD between non-adjacent BGP neighbors doesn’t mix (unknown status, but we also suspect BFD (when used with OSPF) seemed to make the OSPF crashes more likely). If I recall correctly, the issue here is that BFD would take down the BGP peer, but the routes would remain in the routing table. I’m fuzzy on this one though.

OSPF crashes, and to answer your question, apparently takes the rest of routing with it. I’m not sure why this is. We only saw that /ip route print (and any other routing related commands) would hang, and after 2 or 3 minutes it would tell us to generate a supout and send it to support. We’ve been told it was OSPF, but we couldn’t tell. We are carefully putting 5.0rc3 in a few places to see if this is really resolved. Sometimes, it would appear that OSPF crashed, and recovered (all neighbors have the same uptime, much smaller than the router uptime), but the routing would be inconsistent after this (which would usually show up as a traceroute bouncing back and forth between the problem router and its neighbor until TTL expiry. Also, /IP route print, and the route actually taken, would not match).

A lot of those were intertwined (and we were using MPLS as well at one point) and it was very difficult to tell where the fault(s) actully were/are. Unfortunately, the new profile tool lumps all routing all together as well (and doesn’t show memory) (perhaps there really aren’t separate processes for ospf/bgp/other, which is why ospf crash kills bgp/routing in general?)

There is a possible memory leak with BGP, the more memory your router has, the longer it will run before it needs a reboot/crashes. We started rebooting the router on purpose in the middle of the night every 3 months or so to deal with this. (this might be fixed?) The memory graphs on the monthly and yearly will show a steady upward trend. This could be because of normal caching mechanisms, or a memory leak, so I can’t really tell if this problem is fixed yet, although in 4.10 the graph appears to be growing at a slower rate. I suspect this is the issue the people on the motorola list are having.

With IPv6 and link-local addresses, BGP does not properly do recursive lookups.

BGP peer hold timer only works properly when set to the default value.

p.s. Yes, I’ve contacted support on these.

hci: ospf, bgp and everything else is the module of route app. When route crashes then whole routing goes down.

xxii: we are aware of your reported problems and we are working on new routing which will resolve all those issues.

version: RouterOS 4.9

We are also using BGP with full routes and partial table for metro access. We are currently receiving 2 full tables and 2 metro tables on each of our routers. We have 2 routers with vrrp enabled.

We are currently experiencing the following problems:

  1. bgp advertisments sometimes freeze, which is very bad, as we do /32 announcement for null route towards upstream, and even though the announcement gets cleared from the network announcements, it sometimes gets advertised. even if we clear and we resend it still gets advertised, so the only option is to disable and re-enable the bgp peer sessions
  2. in case one of the routers goes down, vrrp flaps at 1 packet when let’s say the primary router goes down, but when it comes back online it simply cuts off all traffic for about 10-15 seconds which sometimes breaks the traffic, load is around 60% when advertisments start.
  3. bgp over ipip tunneling with cisco only works with 1400 tcp-mss applied to all SYNs, bgp to cisco over eoip works ok

I don’t know about snmp & bgp problems, we are running both, on both production routers and they work just fine. Traffic flow is also working ok. We are using only static, egp and igp routing sometimes over bridges, vlans or tunnels but snmp has never been a problem.

We have only experienced 1 router lockup in 3 years, we are using gigabit links and x86_64 arhitecture. The first two problems described are really annoying and should be investigated by the development team.

Is Mikrotik based on XORP or what? XORP does not seem to have any updates released in over a year that I can tell. We are looking to implement BGP in the next few months and I would really really like to use Mikrotik but I need to know that its stable.

No, it is not based on XORP, it is our own implementation.

I’m using BGP Mikrotik (peering with Cisco and Juniper) without issues since routing-test 3.28 was out. (there were a lot of issues before)
Now i’m fine with 4.11 too, and 5.0rc1 is working fine until now…
3 BGP peers with 2 full feeds each.
Nothing to complain (finally)

Are you getting full feeds? What kind of hardware are you running on?

Yes, full feeds.
Those are common supermicro boards

/system resource> pr
                   uptime: 15w2d23h36m27s
                  version: "5.0beta4"
              free-memory: 1601032KiB
             total-memory: 1944092KiB
                      cpu: "Intel(R)"
                cpu-count: 2
            cpu-frequency: 3000MHz
                 cpu-load: 8%
           free-hdd-space: 207288KiB
          total-hdd-space: 242442KiB
  write-sect-since-reboot: 557684
         write-sect-total: 864154
        architecture-name: "x86"
               board-name: "x86"
                 platform: "MikroTik"



/system resource> pr
                   uptime: 37w2d23h41m9s
                  version: "4.5"
              free-memory: 1451572kB
             total-memory: 1946124kB
                      cpu: "Intel(R)"
                cpu-count: 1
            cpu-frequency: 3000MHz
                 cpu-load: 16
           free-hdd-space: 92196kB
          total-hdd-space: 122703kB
  write-sect-since-reboot: 22968
         write-sect-total: 52781786
        architecture-name: "x86"
               board-name: "x86"



 /system resource> pr
                   uptime: 34w2d5h13m2s
                  version: "4.6"
              free-memory: 749820kB
             total-memory: 1027284kB
                      cpu: "Intel(R)"
                cpu-count: 1
            cpu-frequency: 3200MHz
                 cpu-load: 31
           free-hdd-space: 450648kB
          total-hdd-space: 484630kB
  write-sect-since-reboot: 6383660
         write-sect-total: 6383660
        architecture-name: "x86"
               board-name: "x86"

We have a full v6 table and a partial v4 table loaded with 4.10 which is stable as a rock, the only times I’ve seen instability on 4.10 it has been the x86 hardware we were testing

its stable since 4.10, full table, two upstreams 100Mbps ea.

the winbox issue described here http://forum.mikrotik.com/t/passing-netbios-traffic/63/1
seems to have been at least reduced as of 4.13.

otherwise not a single issue. we were never offline, not a minute.

we´re running rc02 , 15 days uptime, no problem . we had a lot of problems caused by the ospf crash.