Feature Request: OSPF Cost Changes Without Adjacency Loss

Say you have a ring of 5 routers, A-B-C-D-E-A, and the OSPF costs dictate that traffic from D to A will travel in the East direction of the ring (ie, through E). It comes time to do maintenance on the link between D and E, but the link is traffic-bearing. No big deal, right? Just change the costs so that traffic starts travelling West (ie, through C and B), right?

Except that when you change the costs on D’s interface to E, it tears down the adjacency, causing all the routes to be withdrawn for the duration of the hello interval of the adjacency between D and C (when D reannounces its routes to C). Traffic bound for customer subnets is suddenly rejected with ICMP Unreachable, and you’ve got yourself a 5-10 second outage (wait, isn’t this why we build the ring - so we wouldn’t have outages?).

In Ciscoland, one would simply increase the cost on E’s interface to D, and D’s interface to E, and the next update would have new costs that would cause the FIBs to be updated.

Either the OSPF daemon under the hood is incapable of being reconfigured on the fly, or the calls that are being made do not take advantage of its ability to do so. The ideal fix is to resolve this and make it possible to change OSPF costs without tearing down adjacencies.

Failing that, if one could influence OSPF update’s costs with filters (IE if announcement interface = X, increment the cost of those announcements by Y), that would be an acceptable workaround. This is a little hacky as it means the filter is meddling with the OSPF database, but I can live with a kludge much more easily than I can live with ICMP unreachables to customer subnets.

We work around this today with a script that generates static routes; basically turning off our IGP so that we can turn a knob. I don’t have to tell you how much this approach sucks. :slight_smile:

Thoughts? Can this be fixed in ROS 6.28/29, or ROS7?

Thanks!

I agree that changing an interface’s cost should not break an adjacency, but even breaking an adjacency on a working link shouldn’t cause what you’re seeing.

I always just remove OSPF from an interface before doing maintenance on it. In ciscoland.
Zero packets lost while the network re-converges (the link still is up, so it will pass the odd packet or two which were in the buffer while LSAs go out announcing the detour). After about 30 seconds or so, I take the link down for real and do the work. Something sounds wrong if you’re getting destination unreachable messages.
(granted, I didn’t use a ring topology)

Changing cost then dropping link then raising link then fixing cost = 4 reconverge events.
remove ospf / restore ospf when done = 2 reconverge events.

Appreciate the feedback. This behavior is reproducible in a lab environment with a trivial configuration. :slight_smile:

Just to be clear, router D has 3 interfaces. 2 face other routers, one faces customers. Recosting the active interface causes packets bound for the customer interface to go ICMP unreachable (because that route has been temporarily removed from the FIBs of the surrounding routers).

This wouldn’t be the first “not like Cisco” thing I’ve found in Mikrotik OSPF - I’ll be honest.

The thing that’s NOT like Cisco I like the most about Mikrotik though is that you don’t have to sell your soul just to own a measley brach office router, and then lease it out again every year for support contracts JUST TO HAVE ACCESS TO FIRMWARE UPDATES!!!

(Just so people don’t think I’m a big ol’ Cisco worshiper)

I think I will lab this up just for my own edification - in your experience, how long does convergence take?
How many routes are in your table?
(so I make a fair comparison)
Do you use static default route or OSPF default-information?

~1000 routes, including default (‘just another route’). But you should see it with fewer; actual convergence is extremely fast, the delay is largely the product of the hello timer as far as I can tell.

Here’s my test topo:
5-tiks-ospf ring.png
Here’s my node 4 config: (cost = 50 was just the last cost I set on the link to node 5)

[admin@Mikrotik-4] /routing ospf interface> /export compact
# apr/01/2015 19:24:10 by RouterOS 6.27
#
/interface bridge
add name=lo1
/ip neighbor discovery
set ether1 discover=no
/routing ospf instance
set [ find default=yes ] router-id=4.4.4.4
/ip address
add address=10.1.1.4/24 interface=ether1 network=10.1.1.0
add address=4.4.4.4/32 interface=lo1 network=4.4.4.4
add address=10.0.34.4/24 interface=ether2 network=10.0.34.0
add address=10.0.45.4/24 interface=ether5 network=10.0.45.0
add address=10.4.0.1/24 interface=ether3 network=10.4.0.0
/ip route vrf
add interfaces=ether1 routing-mark=mgmt
/routing ospf interface
add cost=50 interface=ether5 network-type=point-to-point
/routing ospf network
add area=backbone
/system identity
set name=Mikrotik-4

Would you call this a fair representation?
(not shown = a single 10.x.0.1/24 on each router as a “customer” interface)
(Also - for some reason, internally, GNS3 swapped ether3 and ether5 on tik 4 so ether5 → node5 on node 4, the drawing label is incorrect)

When I change the cost of the interface on router D, router E shows:

echo: route,ospf,info OSPFv2 neighbor 4.4.4.4: state change from Full to Init
(and sadly, there’s no info-level loging event when the state changes Init → Full)

Debug level shows all of the LSA flooding and route changing, etc. that you would expect, and indeed, the uptime on the adjacency does reset. (Tested simpler topo in Cisco - adjacency doesn’t reset, just sends a LS-Update)

During the change, I ran a constant ping from 1.1.1.1 to 4.4.4.4 - Even running this in GNS3, I only dropped one ping (due to TTL exceeded @ node 5) during the re-converge, but it didn’t take a neighbor timeout. This behavior was the same for broadcast and point-to-point network types. Granted, I didn’t stuff the routing table full of routes, which might take slightly longer if more LSAs have to be sent/updated. You did say that convergence is fast once the other side times out…

I must still have something different because the route never gets torn down completely (which would cause the unreachable). It just switches direction around the ring, and if a node between the propagating topo-change LSA updates happens to send the old way, then the packet bounces back and forth between the two routers that disagree.
(which is why I got TTL expired as my single ping that fails)

Yea, that’s a fair representation. You might only see a single TTL exceeded due to the small size of the route database; I get a few more in the lab but I’m pinging at 50-100ms intervals. In production, it takes a while longer due to the sheer number of routers.

In either case, those TTL exceeded’s are sufficient to cause TCP connections to close, calls to drop, etc.

Bottom line: adjacencies simply shouldn’t drop when you change the cost metric on an interface. You go from sending a simple LSUpdate to having to load and synchronize the entire LS database. During the synchronization, packets will be lost due to the resulting routing loops.

I agree 100% with your statement.
bouncing an adjacency is just a lazy action for the OSPF process to do.
It should not do this.

Sizeable routing table of 1000+ prefixes means that a topo change is not as instantaneous as the lab, I completely understand.

Obviously you know your network and I don’t but if this sort of thing happens on maintenance, then it certainly happens on a real failure - Perhaps a different topology than a ring could keep link failures from being so disruptive. Your comment leads me to believe that there are more than 5 routers in your ring.

For what it’s worth - in my experience, rings are for SONET, Brocade RRP, and other protocols that are designed as rings.
Higher layer protocols that operate hop by hop tend to be sloppy at handling topo changes.
(when I started at a telephone company, the CCIE had made a ring of switches around town running spanning tree. - a fiber cut or a bounced interface of any kind would cause 90seconds of down time while spanning tree (not rapid spanning tree - regular spanning tree) dealt with the change - I came to call them netquakes because they were common enough to deserve a name)

Compare these two topologies with 12 access routers:
(obviously, if physical or budget constraints don’t allow such a design as the ‘fancy’ one, then obviously it’s not possible)
Drawing16.png
Drawing16a.png
Again - I agree that cost change should not bounce the adjacency - I’m just not a big fan of rings…

Hi all,
Does anyone know if this has been introduced in one of the latest version? Running 6.39.1 and still an issue, planning the upgrades soon. Thanks

Problem still exists, more-over it restarts the BFD session too. This is a pretty lazy way of handling a simple cost change…mikrotik please add a feature request to the development cycle. I hope it doesn’t come back to the RouterOS7 unicorn…