Routing loop that isn't

airbanduk · January 15, 2018, 10:20pm

I’ve started to see some strange behaviour on one of my networks that I can’t seem to get my head around. With OSPF fully converged, I have a stable, loop free network with full reachability configured in a resilient ring topology. When a link goes down, we then see some strange things happen. Some random loopback interfaces are no longer reachable, and when we run a traceroute to those the hops appear to double-back on themselves, or towards stub areas that do not contain the destination addresses. All routing tables report the correct next hop and MPLS labels are correct and set to forward out the correct interfaces. An example of a traceroute is something like:

10.0.0.1
10.0.1.1
10.0.2.1
10.0.3.1
10.0.2.1
10.0.1.1
10.0.0.1
10.1.0.1
10.1.1.1
10.1.2.1

or if it goes towards a stub (each area’s subnet is the area ID x16 in the third octet)

traceroute 10.0.16.1

10.0.0.1
10.0.1.1
10.0.32.1
10.0.33.1

I’ve never seen anything like this happen before. The only way to get it to stabilise is to reboot routers until it’s fixed. I’ve tried upgrading to 6.41 on the routers that seem most affected but it’s still happening. It’s been stable for over a year and only started happening about 4 weeks ago. I plan to roll 6.41 out to all of them but there are about 50 of them so it needs careful planning and time.

Any ideas? All welcomed as I’m stumped.