if just a single full-table with 170k routes can exceed MT’s routing capabilities, there must be a design flaw anywhere in the context.
Exactly. I have emailed support with many supout files and asked that they take a look to see if it can be optimized. I have a good feeling they will figure out whats wrong and fix it. I don’t think its a bug, i think its simply optimization that needs to be done - as well it is a routing-test package so its still beta at this point. I have full confidence in these people at Mikrotik!
So the status of our BGP … we’ve got a router upgraded to 2.9.6 with routing-test package and connections to both Cogent and Level3 via 100mbps pipes. BGP is accepting routes from both providers. We started with full transit routes and within a few minutes I realized it was impossible. I had them both start sending customer routes only from each provider. Cogent is sending about 12,500 routes and Level3 is about 75,000 routes. Cogents route set is small enough that mikrotik handles it no problem, it takes about 30 seconds and it’s synced up and CPU goes to idle after being at 100% during sync. No problems there. Level3 peer starts up and after about 3 minutes seems to just spin out of control - BGP sessions start terminating, hold timers expire, etc. The CPU is pegged at 100% the entire time. If I run a /ip route print count-only during the sync I see it crawling up and once it gets over about 20,000 it seems to die a slow death. We turned off Cogent peering to see if it could handle just the level3 peering session but no luck. The only way I can get the router under control is if I filter out the routes as they come in. I believe in the beginning we actually had them accepting with a routing-mark using the filter and that worked, I believe because there werent overlapping routes causing it to work hard to decide where to send things.
I think the problem stems from the way RouterOS injects routes into the table. I noticed when I filter out all routes coming in that the BGP session takes about 5 seconds to sync up and its done… I see a huge 5-10mbps download while it syncs up. Without the filter I see them trickle in because it seems to be applying and processing as each route comes in.
Here is our config:
/ routing filter
add chain=cogent-a prefix=38.XX.XX.0/24 prefix-length=24 action=accept \
comment="" disabled=no
add chain=cogent-a prefix=38.101.160.XXX prefix-length=32 action=accept \
comment="" disabled=no
add chain=cogent-a prefix=0.0.0.0/0 action=discard comment="" disabled=no
add chain=cogent-b prefix=0.0.0.0/0 action=discard comment="" disabled=no
add chain=cogent-in set-nexthop=38.XX.XXX.1 comment="" disabled=no
The above is required for Cogent’s peer A & B setup. The 38.101.x.x address is their loopback IP required to get peer B up and running.
add chain=level3-in action=discard comment="" disabled=no
add chain=level3-out prefix=38.XX.XX.0/24 prefix-length=24 action=accept \
comment="" disabled=no
add chain=level3-out prefix=0.0.0.0/0 action=discard comment="" disabled=no
Notice that we ‘action=discard’ all rules coming in. If we simply change that to ‘accept’ or anything else it craps out and the router hits 100% cpu for hours.
/ routing bgp instance
set default as=19557 router-id=38.XX.XX.2 redistribute-static=no \
redistribute-connected=yes redistribute-rip=no redistribute-ospf=no \
redistribute-other-bgp=no name="default" out-filter="" disabled=no
add as=19557 router-id=XX.XX.XX.18 redistribute-static=no \
redistribute-connected=yes redistribute-rip=no redistribute-ospf=no \
redistribute-other-bgp=no name="level3" out-filter="" disabled=no
We setup 2 instances because there might have been problems using the same router-id with both peers. If set to 0.0.0.0 its supposed to figure it out, but i was being cautious and forced the right ip with each peer. I didn’t want to create a peering session using the other providers pipe … and I think it might have only done that because we use ECMP on the way back out.
/ routing bgp peer
add remote-address=38.XX.XX.1 remote-as=174 multihop=no in-filter="" \
out-filter=cogent-a keepalive-time=0s hold-time=3m ttl=60 disabled=no
add remote-address=38.101.160.XXX remote-as=174 multihop=yes \
in-filter=cogent-in out-filter=cogent-b keepalive-time=0s hold-time=3m \
ttl=6 disabled=no
add remote-address=63.210.XX.XX remote-as=3356 multihop=no \
in-filter=level3-in out-filter=level3-out keepalive-time=0s hold-time=3m \
instance=level3 ttl=60 disabled=no
Cogent has 2 peering sessions, one to send routes and one to accept routes. Level3 does all within a single peering session.
Anyhow, I think Mikrotik will (hopefully) improve their routing table injections process and all will be good. We plan on now splitting our border routes up into 2, 1 for each connection … I’m hoping that we don’t have to use Crisco or other as we really have faith in the RouterOS to do the job well. I have been looking into NAPI with the intel cards - hoping its just part of the OS already and its being used. I won’t really know until we get a dev environment setup and can actually test pps ratings.
Thx
Sam