Router forget how to route after a 3-4 days of working

Hello,

I have the following configuration:

2xMikrotik switches (ROS7; CRS310 + CRS309)
2xMikrotik routers (ROS7; RB5009 + Chateau)

Switches are switching, i.e. no L3+ is configured except management interface
Routers are routing with BGP and BFD, having dedicated link against them to exchange routes.

first router (Chateau) has about 38k routes in the only VRF, second one - RB5900 about 40 (just 40, not 40k).

And once in a 4 days RB5900 stucks, which look like /ip route print stucks without output, in winbox routes list is empty, memory is looking enough (about 700MB free), disk space too, CPU is about 3-4% busy.
Something similar happens to Chateau but once in a 3-4 weeks (yes that one giving 38k of routes).

At the same time switches are working without an issue, so I suppose it’s something related to dynamic routing configuration since their configuration is almost the same.

The only non-default option is set on routers: CPU frequency: it set the their max frequency.
But he issue happened since 7.16.x in auto frequency mode and now is still there in 7.17.2 (on RB5900) so it doesn’t look related.

Any ideas where to dig next in that case?

Thanks.

PS. Since these routers are working and they need to be working so it doesn’t look a great idea to enable debug routing for 4 days, that’s actually why I still have no logs, let’s call it last resort :slight_smile:

I had absolutely identical crazy behavior with routes about a year ago and only netinstall helped to solve the problem.

Wondering how this could help, you mean reinstall ROS from the scratch?

The theory (not only in this case) is that when doing a normal upgrade something in the configuration (invisible or however very difficult to be found) can remain “sticky” and create the problem.

Starting from fresh (and not restoring a backup, but rather re-creating the configuration from export .rsc) seems to be able to fix some of these issues.

Yes

Oh, and I forgot to add, that in my case after a few weeks of identical issues with routes, I started to have similar problems with files (/file print stopped working), problems with creating supout (it was freezing at 2%), even problems with reboot (router was freezing when rebooting, it was only possible to reboot by replugging the power). So, if you have such issues with routes, there is a chance that everything can become much worse…

netinstall + reset and don’t restore your backup… use rsc export and recreate it from scratch.
I had similar behavior too on ccr device and do that… and yes, with rsc, not everything is exported, like device-mode, certs, users…

The issue comes up if you use the backup config from one router to another.
If you want to make a copy of a router config strongly suggest you don’t use
the backup but export the config :slight_smile:

The backup copy includes the mac address for the original device and you then
have two routers with same mac addresses on network and hilarity follows
with routing.

If you do use the backup config to spawn a duplicate router you must then go to
each interface and click on reset mac address on the winbox tab. Its down bottom
right between blink and reset counters, Sorry I don’t know how to do it from cli.

I found that all out the hard way and had a network that flat refused to route even
stupidly simple stuff after a couple days and you would look at it and go why does that
not work.

I won’t agree. In my case, I wasn’t using any backups, the router was working for years. And after some RouterOS update I started to have these problems with routes, and then even with basic functions like reboot. It’s not about config, it’s more likely because of bad sectors in places where system files are located.

In my case I also not migrating any configurations from one router to another. I have configuration backups but never actually restored them, just sometimes manually running /export terse verbose remotely into the file over SSH.

Just like like this:

for host in ${@} ; do
    mkdir -p "${host}"
    ssh "${host}" "/export terse verbose" > "${host}/config.rsc"
done

So I suppose this can’t cause any issues for sure.

However my case become more frequent (for last couple of days at least) and I set debug logging for everything. Is still investigating what is happened but for now the only thing looks strange to me is “Unsupported capability received” messages router receives from GoBGP instances (I actually have only Mikrotik and GoBGP for BGP for now, so Mikrotik connections don’t have such messages while GoBGP’s do) with various status codes:

# One sequence which look like happened during the same handshake
Starter {openOk: false} Unsupported capability received, code: 6
Starter {openOk: false} Unsupported capability received, code: 69
Starter {openOk: false} Unsupported capability received, code: 73
Starter {openOk: false} Unsupported capability received, code: 71
Starter {openOk: false} Unsupported capability received, code: 70

# Another one
Starter {openOk: false} Unsupported capability received, code: 73
Starter {openOk: false} Unsupported capability received, code: 5

Am I right guessing these codes are actually capability codes: https://www.iana.org/assignments/capability-codes/capability-codes.xhtml ?

I also have pretty often route recalculations while there’re no new actual routes received. Since the first portion of Unsupported capability messages are received from my own software built against GoBGP as a library I have a chance to perform some experiments and actually know what it does. For now it looks like Mikrotik do recalc on every route pushed via BGP without any check for current routes and their prefixes but I will check it deeper.

Second portion of unsupported capabilities (with 73 and 5 codes only) is from Cilium, so it doesn’t look we both done something wrong at least the same way :smiley:

My main theory for now is improper BGP message handling causing recalc which making routes mechanism stuck at some point. Anyway no proofs for now, just a theory. Will keep you tuned folks.

And thank you for your thoughts, they are really valuable.

PS. About netinstall, I understand this could help and that’s a fast way, but I prefer to know what goes wrong deep inside, so will keep this option if I become too lazy or too old for that stuff. Btw, one of the options to isolate this behaviour is to switch to GARP instead of BGP for all of non-Mikrotik peers, that looks possible for my case but will require some renumbering in some services, how ever it doesn’t look insane - just a coupe of DNS changes and IP address changes - this will exclude all of GoBGP stuff from Mikrotiks and will leave only Mikrotik peers and static routing for others in my network setup.

UPD: Keep investigating by smoothly reducing amount of BGP connections.

On ROS 7.18.1 the issue is still present.

UPD: I had a week with heavier network load on the router impacted by the issue (RB5900) compared to all previous days and during this time the issue wasn’t appear. However once the traffic was reduced to almost zero - it come again.

here are some thoughts:

  1. it could (but not must) be a sign of some kind of race condition within the routerOS if the traffic caused CPU load (which it should do - this is why I’ve set static CPU frequency on all of the Mikrotik devices), so +1 to that theory
  2. today I’ve upgraded to 1.18.2
  3. keep switching services from BGP to gARP, since one-by-one takes too long, probably it’s time to switch them all (except mikrotik-mikrotik routes exchange)

UPD: some services moved to VRRP some I left on BGP and fixed the software causing recalc (since it’s my own software on GoBGP).

This actually means two things:

  1. in 1.19 there’re some changes in BGP behaviour so I interested in them but prefer to wait it to be stable
  2. now I have 1.18.2 so because of pt. 1 I probably will stay on it for a couple of weeks to make this test relevant

What the idea behind:
Often route recalculations it’s the only thing looks strange to me in router logs, at the same time sometimes I could notice “no route to host” on BGP announced /32’s.
At the same time I actually have another stand with Cisco routers and exactly the same software works great for half a year already which made me thinking there’s a bug in recalc behaviour and/or deadlock somewhere around.

So testing this. Couple of weeks should be enough.