Hi fellow Network wizards,
I'm having serious “fun” after upgrading the whole fleet of routers all over Europe from 7.19.6 to 7.20.8.
I did thorough testing in lab and other networks, without any issues, but that's only because I didn't look well enough.
I’m not sure if this is version specific change of behavior or if it is a bug…. or I’m just too tired to see the mistake(s) in my analysis.
Here's my problem:
I have highly redundant setups like the one I'm describing here.
- two routers
- two firewalls
- switchstacks
- routers connected to each other (bonding_cluster) => ospf cost 1, preferred
- routers connected to the local firewall (bonding_fw) => ospf cost 10, the firewall loopback should take that one
- routers connected to the remote firewall (other room) via vlan42 on trunk bonding_downlink => ospf cost 100, use that one only as a last resort
- the firewalls don't support BFD, otherwise I would use it on those connections, too.
OSPF is only used to transport the loopbacks (and transfer nets), BGP is used for the real routing information, but out of scope of my problem here.
As you can see above, the two routers have the loopback IPs 10.255.56.1 and 10.255.56.2 and here's the result of the config.
r1)
[r1] > /routing/ospf/interface print detail
Flags: D - dynamic
[...]
1 D address=10.255.1.177%bonding_cluster area=backbonev4 state=bdr network-type=broadcast dr=10.255.1.178 cost=1 priority=128
use-bfd=yes retransmit-interval=5s transmit-delay=1s hello-interval=10s dead-interval=40s
2 D address=10.255.1.181%bonding_fw area=backbonev4 state=dr network-type=broadcast bdr=10.255.1.182 cost=10 priority=128
use-bfd=no retransmit-interval=5s transmit-delay=1s hello-interval=10s dead-interval=40s
3 D address=10.255.1.189%vlan42 area=backbonev4 state=bdr network-type=broadcast dr=10.255.1.190 cost=100 priority=128
use-bfd=no retransmit-interval=5s transmit-delay=1s hello-interval=10s dead-interval=40s
[r1] > /routing/ospf/neighbor/print
Flags: V - virtual; D - dynamic
0 D instance=ospfv4 area=backbonev4 address=10.255.1.178 priority=128 router-id=10.255.56.2 dr=10.255.1.178 bdr=10.255.1.177
state="Full" state-changes=6 ls-retransmits=1 adjacency=2d21h17m30s timeout=36s
1 D instance=ospfv4 area=backbonev4 address=10.255.1.190 priority=1 router-id=10.255.56.5 dr=10.255.1.190 bdr=10.255.1.189
state="Full" state-changes=6 adjacency=2d21h17m35s timeout=40s
2 D instance=ospfv4 area=backbonev4 address=10.255.1.182 priority=1 router-id=10.255.56.5 dr=10.255.1.181 bdr=10.255.1.182
state="Full" state-changes=6 adjacency=2d21h17m23s timeout=35s
[...]
[r1] > /routing/bfd/session/print
Flags: U - up, I - inactive
0 U multihop=no vrf=main remote-address=10.255.1.197%40-gw02 local-address=10.255.1.198%40-gw02 state=up state-changes=2
uptime=1d8h52m28s desired-tx-interval=10s actual-tx-interval=10s required-min-rx=10s remote-min-rx=200ms
remote-min-tx=200ms multiplier=5 hold-time=50s packets-rx=13459 packets-tx=13448
1 U multihop=no vrf=main remote-address=10.255.1.178%bonding_cluster local-address=10.255.1.177%bonding_cluster state=up
state-changes=2 uptime=2d21h18m9s desired-tx-interval=200ms actual-tx-interval=200ms required-min-rx=200ms
remote-min-rx=200ms remote-min-tx=200ms multiplier=5 hold-time=1s packets-rx=1448966 packets-tx=1449207
r2)
[r2] > /routing/ospf/interface print detail
Flags: D - dynamic
[...]
1 D address=10.255.1.178%bonding_cluster area=backbonev4 state=dr network-type=broadcast bdr=10.255.1.177 cost=1 priority=128 use-bfd=yes retransmit-interval=5s transmit-delay=1s hello-interval=10s dead-interval=40s
2 D address=10.255.1.185%bonding_fw area=backbonev4 state=dr network-type=broadcast cost=10 priority=128 use-bfd=no retransmit-interval=5s transmit-delay=1s hello-interval=10s dead-interval=40s
3 D address=10.255.1.193%vlan42 area=backbonev4 state=dr network-type=broadcast bdr=10.255.1.194 cost=100 priority=128 use-bfd=no retransmit-interval=5s transmit-delay=1s hello-interval=10s dead-interval=40s
[r2] > /routing/ospf/neighbor/print
Flags: V - virtual; D - dynamic
0 D instance=ospfv4 area=backbonev4 address=10.255.1.194 priority=1 router-id=10.255.56.5 dr=10.255.1.193 bdr=10.255.1.194 state="Full" state-changes=15 ls-retransmits=1 adjacency=3d21h29m49s timeout=39s
[...]
2 D instance=ospfv4 area=backbonev4 address=10.255.1.177 priority=128 router-id=10.255.56.1 dr=10.255.1.178 bdr=10.255.1.177 state="Full" state-changes=6 ls-retransmits=1 adjacency=2d21h15m11s timeout=35s
[r2] > /routing/bfd/session/print
Flags: U - up, I - inactive
[...]
1 U multihop=no vrf=main remote-address=10.255.1.177%bonding_cluster local-address=10.255.1.178%bonding_cluster state=up state-changes=1 uptime=2d21h15m49s desired-tx-interval=200ms actual-tx-interval=200ms required-min-rx=200ms remote-min-rx=200ms
remote-min-tx=200ms multiplier=5 hold-time=1s packets-rx=1448380 packets-tx=1448151
As you can see, all sessions are up, bfd is working, ospf is up, LSAs are being exchanged.
So far for the config. Now to my problem and the unexpected behavior.
I would expect because of the direct connection and the lowest cost, that r1 and r2 would choose bonding_cluster to talk to each other. Instead:
r1)
[r1] > /routing/route/print where ospf and dst-address~"10.255.56.*/32"
Flags: A - ACTIVE; o - OSPF
Columns: DST-ADDRESS, GATEWAY, AFI, ROUTING-TABLE, DISTANCE, SCOPE, TARGET-SCOPE, IMMEDIATE-GW
DST-ADDRESS GATEWAY AFI ROUTING-TABLE DISTANCE SCOPE TARGET-SCOPE IMMEDIATE-GW
Ao 10.255.56.2/32 10.255.1.182%bonding_fw ip main 110 20 10 10.255.1.182%bonding_fw
Ao 10.255.56.5/32 10.255.1.182%bonding_fw ip main 110 20 10 10.255.1.182%bonding_fw
r2)
[r2] > /routing/route/print where ospf and dst-address~"10.255.56.*/32"
Flags: A - ACTIVE; o - OSPF
Columns: DST-ADDRESS, GATEWAY, AFI, ROUTING-TABLE, DISTANCE, SCOPE, TARGET-SCOPE, IMMEDIATE-GW
DST-ADDRESS GATEWAY AFI ROUTING-TABLE DISTANCE SCOPE TARGET-SCOPE IMMEDIATE-GW
Ao 10.255.56.1/32 10.255.1.194%vlan42 ip main 110 20 10 10.255.1.194%vlan42
Ao 10.255.56.5/32 10.255.1.194%vlan42 ip main 110 20 10 10.255.1.194%vlan42
==> r1 chooses the more expensive bonding_fw connection over the cheaper bonding_cluster link and the most expensive vlan connection.
==> r2 chooses the most expensive vlan connection (because fw2 is inactive → bonding_fw not an option) over the direct and cheapest bonding_cluster link.
On one of the sites, same setup, it helped to disable bfd for the bonding_cluster ospf-config, so I thought that would be the thing - but when I tried it on another site, it didn't change the routing decision. So… I am even more confused.
Do you see something like that on 7.20.x yourself? Any obvious errors in my config?
ANY help will be appreciated
Thanks a lot,
Best regards,
Irrwitzer
p.s.: SUP-212079
Update 2026-03-04:
I think I found the problem. LSA types… and the strategy I use to announce the loopback IPs. Instead of redistribute connected (external lsa) I need to make it internal.
/routing/ospf/interface-template/add place-before=0 area=backbonev4 interfaces=lo
/routing/ospf/instance/unset value-name=redistribute 0
seems to work. the loopback IP then is announced as internal stub.
Needs a lot of time for the old LSAs to expire though, as I don’t seem to find a way to flush the ospf process without disabling/re-enabling it.
So, it’s obviously not a version specific change but a problem with my config, as I feared. Why this didn’t come up earlier…. who knows.
How do you guys announce your loopback IPs? Like this? I used to do it the redistribute-connected way with cisco and juniper, that's why I implemented it this way. Adding it as “interface” seems in-intuitiv for me. So if there's a cleaner way to do it, please let me know.
Thanks,
Irrwitzer
