Hi All,
Previously I've found viewtopic.php?t=78327 - this made zero sense to me, but we decided that the resolution of disabling md5 authentication was worth a shot, and after double checking that our mitigations for "directly attached attackers" were in place and operational we disabled OSPF MD5 Authentication. This did not resolve the issue in any way, as such, we suspect the authentication was never the real issue to begin with.
The same page indicates that your issue is packet loss. We don't see loss of adjacency according to any of the routers involved. The specific network segment consists of 2 x (48x10G + 4x40G) switches, paired and running vPC (LACP) back to the routers where possible, or 10G active + (2 x 1G LACP) standby where not (ie, only 1 x 10G port available). We also don't see any loss with echo request/response frames (I've just flooded just under 10m such requests at 1ms intervals, and got 0% loss - to a router that has in the same period by happy accident dropped out from OSPF).
We get between 20 to 25 of these drop-outs per day, per affected router. Detailed symptoms:
* Other routers on the same L2 segment drops routes being originated from the MT in question for (sometimes extended, eg, half an hour+) periods of time.
* Routes originated elsewhere and advertised to MT are not affected (ie, the MT retains it's routes to the rest of the network).
The routers involved on this segment (interface addresses and DR selection priorities):
172.31.255.1 - FRR 8.2.2 - priority 200 - dropped below at 08:15:33
172.31.255.2 - FRR 8.2.2 - priority 200 (current DR) - dropped below at 08:15:42
172.31.255.3 - RouterOS 6.48.4 (to be upgraded in the next 48h) - priority 1
172.31.255.5 - RouterOS 6.49.6 (current BDR, upgraded this morning) - priority 1
What I do spot in the logs leading up to the deletion of routes is quite a bit of "Skipping flooding: from DR or BDR". Which makes sense that we only want to flood updates back into the network if the router is the DR or the BDR. But I suspect it also means the router isn't refreshing it's routes back to the DR and BDR regularly enough, not even sure if this is a consideration, but as per anything I'm guessing that if the MT don't let the DR and BDR know from time to time that it's routes are still valid they will get dropped every so often.
What bugs me is that since the upgrade this morning: the rebooted router has bee n stable, and in the same time (just over 4 hours) we've had 4 outages from the not rebooted node (logs above for the first of these). This can, from what I can tell, have one of two causes:
1. A reboot fixes whatever the underlying problem is for a while and it will return; or
2. If a router is a DR or BDR it floods whenever LSAs are made, resulting in it's own routers also effectively being refreshed on peers.
I'm not familiar enough with the OSPF protocol to be able to confirm or reject either hypothesis.
Resulting problems:
* iBGP should normally connect to loop back addresses - if OSPF fails, loopbacks fail, iBGP fails, network as a whole fails. We've had to update iBGP to use interface addresses which turns link-layer failures into routing failures, but these are far less frequent than OSPF currently (handful of failures per year compared to near hourly OSPF drop-outs).
* Sub-optimal internal routing (eg, will follow a BGP announced /21 to a route-reflectors instead of the more specific /28 from OSPF to a different router). Just adds extra latency, not a trainsmash, as next-hop will redirect (which is not an MT).
* Non-functional routing for directly attached (connected routes) originated from Mikrotik routers. (similar problem to loopbacks, fortunately these destination networks are generally intended for routing EGP so it's 99% of the time not a blocking issue since mostly only the router itself needs to be able to access these destinations).
The first of these is a network killer. Yes, we can work around by routing to interface addresses, but this to a large degree negates the point of having redundancy in your network.
Happy to create a pcap of all traffic on the frr side (much lower performance impact), but can provide Mikrotik raw logs for OSPF forabove directly to Mikrotik Support (cannot post this publicly, but happy to discuss and test).