I set up a test network consisting of 4 RB750GL boxes. All switching and bridging was removed so ethernet ports were NOT connected to other ethernet ports internally (by bridge or switch). Then I made a ring using the four boxes, A, B, C, and D, connecting ports 1 and 5 on each as follows:
...[D]e5<=>e1[A]e5<=>e1[B]e5<=>e1[C]e5<=>e1[D]e5<=>e1[A]...
Then I set up IP networks on each link:
10.20.30.0/30 between A eth5 (.1) and B eth1 (.2)
10.20.30.4/30 between B eth5 (.5) and C eth1 (.6)
10.20.30.8/30 between C eth5 (.9) and D eth1 (.10)
10.20.30.12/30 between D eth5 (.13) and A eth1 (.14)
Additionally, a non-connected loopback bridge was created on each–by nonconnected I mean that in the “/interface bridge ports” section, there were ZERO entries–thus the bridge did not actually bridge anything. And a /32 was assigned to each device’s loopback bridge:
10.20.30.64/32 to A
10.20.30.65/32 to B
10.20.30.66/32 to C
10.20.30.67/32 to D
Next OSPF was set up on each device as follows (the config below is from box D–change the router-id to the loopback for each box and change the two /30 networks listed in the “/routing ospf network” section to the two /30 networks that the specific box uses to connect to the two neighbors:
/routing ospf instance
set default disabled=no distribute-default=never in-filter=ospf-in metric-bgp=auto metric-connected=20 \
metric-default=1 metric-other-ospf=auto metric-rip=20 metric-static=20 name=default out-filter=ospf-out \
redistribute-bgp=no redistribute-connected=as-type-1 redistribute-other-ospf=no redistribute-rip=no \
redistribute-static=no router-id=10.20.30.67
/routing ospf area
set backbone area-id=0.0.0.0 disabled=no instance=default name=backbone type=default
/routing ospf interface
add authentication=md5 authentication-key=fooBARbaz authentication-key-id=1 cost=10 dead-interval=8s disabled=no \
hello-interval=2s instance-id=0 interface=ether1 network-type=point-to-point passive=no priority=1 \
retransmit-interval=5s transmit-delay=1s use-bfd=no
add authentication=md5 authentication-key=fooBARbaz authentication-key-id=1 cost=10 dead-interval=8s disabled=no \
hello-interval=2s instance-id=0 interface=ether5 network-type=point-to-point passive=no priority=1 \
retransmit-interval=5s transmit-delay=1s use-bfd=no
/routing ospf network
add area=backbone disabled=no network=10.20.30.12/30
add area=backbone disabled=no network=10.20.30.8/30
Everything worked exactly as expected with one exception: Box D and A would NOT form an OSPF neighbor relationship. No matter what I tried, removing IP address, adding them again, removing OSPF configuration items, adding them back, rebooting devices, changing to alternate ethernet ports. NOTHING would work. Inevitably, box A would show state “Init” for the A-to-D link in the “/routing ospf neighbor” section, but D would show nothing, no entries at all.
I made a snapshot of the configuration and then manually reset box D to factory settings. Then using the exported configuration, I configured box D again EXACTLY HOW IT WAS CONFIGURED THE FIRST TIME. Let me emphasize that. It was an EXACT DUPLICATION of the original NON-WORKING configuration.
But it worked. Resetting to factory worked. Rebooting had not worked.
So…
WHY???
What hidden configuration item is there that the RouterOS command line cannot see (using “print” and “export” commands) that somehow changed?
What OSPF state exists hidden from administrative control that BREAKS THINGS even when the configuration is 100% correct and 100% accurate and SHOULD WORK???
This is a huge problem when devices are deployed in the field and a tested, working configuration is created and tested in the lab, but fails upon deploying to the remote devices. Then the non-working devices require a truck to roll and a visit to manually reset them on premises. Then the tested configuration starts working. Sorry, but this is unacceptable.
Can I reproduce this? Who knows. What caused this bizarreness anyway? I have no idea. I can’t reproduce it. If it happens again, I certainly will generate a support file BEFORE resetting to factory.
Sadly, this isn’t the first time RouterOS has behaved in a quirky manner for me. I just wish I’d documented the oddness in times past and created support files. This hard-to-reproduce seemingly-random misbehavior reputation RouterOS has with my fellow coworkers has them always telling me to give up and go Cisco. (But I like these Routerboard boxes… I just want them more consistently reliable.)
Any ideas? Any suggestions beyond capturing support files and reporting bugs should I run into this in the future?
Puzzled and frustrated,
Aaron out.