Gateway check for /32 Ethernet point to point links - How to ?

FIPTech · December 31, 2021, 4:39pm

/32 Point to Point links are useful to reduce the consumption of IP addresses (a lot) for direct links between routers. Specially when using the same IP address for all interfaces of the same router. (concept similar to unnumbered IP address). Another advantage is simplification of the configurations and routing tables.

So i started to test this with Router OS 7.2 rc1 using a set of four routers, connected through /32 point to point direct links, All interfaces of the same router have been given the same /32 IP address, using the /32 address of the remote endpoint interface as network address for each link. This is a setup known to work on Mikrotik. It is even more efficient than /31 links.

Then i did find connection failures between routers when only one interface was disabled or one cable disconnected. Should not be the case because i did use redundant links. The routing protocol (OSPF) should have managed to restore full connectivity after a single link failure, after a few seconds.

I did watch the OSPF routes propagation after disabling an interface, or removing a cable, and did find that all worked flawlessly for all routers (using OSPF in ptp mode).

Then i did investigate a bit more and did find that the problem did not come from the router protocol (OSPF) but from /32 point to point interfaces that were not disabled when the remote side was disconnected or disabled.
Not only the interface did stay up, but more importantly the dynamic connected route to the remote ptp side was still in active state.

The problem did come from here : the connected routes have priority over OSPF routes. If a connected route of a ptp link stay active when a cable is disconnected or when the remote interface is down, then the router does not try the OSPF proposed alternative route because this later one has less priority over the connected route. So the router still try to send traffic through the broken link. Result is connection lost to the router at the end of the broken ptp link.

As a side note, the solution to add a static /32 route with gateway check detection for the remote ptp address (by ARP or Ping) does not work, because the connected route has the same mask. The static route is not more specific and cannot be used for that.

A solution (tested working) is to use a bonding interface, put the Ethernet interface inside, put the /32 address on it, and use active backup mode with arp link monitoring detection, using the IP address of the remote ptp interface.for ARP link state detection. Like this the bonding interface comes down when the link is broken, the dynamic connected route disappear, and OSPF routes can take priority.

Perhaps another workaround (not tested) would be to use a routing filter to modify the metrics of the connected routes (i’m not sure that will work).

But using those workarounds is not a clean way. I feel that something should be done in Router OS 7 to remove the dynamic connected route of a Point to Point link when the link is down. Perhaps using BFD ?

This seems mandatory for a working setup using /32 point to point links between routers and unnumbered style addressing.

smyers119 · December 31, 2021, 4:52pm

can you post the route tables in both the failed state and normal state. and also your test configs would help as well.

FIPTech · December 31, 2021, 5:56pm

Here is the setup (All routers are Router OS 7.2 rc1 CHR in a virtual machine EVE-NG.
The administration virtual switch is a Mikrotik router, where a bridge with horizon settings is used to isolate forwarding between ports, so that there cannot be routing announcements exchanged on the admin interfaces. There are two interfaces to the Net Clould just to test that RSTP is working flawlessly in router OS 7.2.
.
EVE _ Topology.png
.
And here are the R1 routes if i disable the Ether3 interface on R4.

You can see that the DAC route 10.50.50.4/32 on ether3 is catching the traffic that should normally take the OSPF route 10.10.50.4/32 with 10.10.50.2 Gateway on Ether 2.

I feel that BFD should catch the problem and disable the Ether3 interface on R1.
Routes R1.png

smyers119 · December 31, 2021, 6:27pm

I am going to research this, as I am pretty sure that shouldn’t happen and may even be against RFC. That route should be invalid as soon as the interface goes up down. you should not need bfd.

(tested and confirmed that is the way it happens in cisco packet tracer)

smyers119 · December 31, 2021, 6:29pm

Confirmed this is against RFC:

5.3.12.3 When an Interface Fails or is Disabled

If an interface fails or is disabled a router > MUST > remove and stop
advertising all routes in its forwarding database that make use of
that interface. It > MUST > disable all static routes that make use of
that interface. If other routes to the same destination and TOS are
learned or remembered by the router, the router MUST choose the best
alternate, and add it to its forwarding database. The router SHOULD
send ICMP destination unreachable or ICMP redirect messages, as
appropriate, in reply to all packets that it is unable to forward due
to the interface being unavailable.

3.3.6 Interface Testing

A router > MUST > have a mechanism to allow routing software to determine
whether a physical interface is available to send packets or not; on
multiplexed interfaces where permanent virtual circuits are opened
for limited sets of neighbors, the router must also be able to
determine whether the virtual circuits are viable. A router SHOULD
have a mechanism to allow routing software to judge the quality of a
physical interface. A router > MUST > have a mechanism for informing the
routing software when a physical interface becomes available or
unavailable to send packets because of administrative action. A
router MUST have a mechanism for informing the routing software when
it detects a Link level interface has become available or
unavailable, for any reason.

DISCUSSION
It is crucial that routers have workable mechanisms for
determining that their network connections are functioning
properly. Failure to detect link loss, or failure to take the
proper actions when a problem is detected, can lead to black
holes.

FIPTech · December 31, 2021, 9:50pm

I think that the problem does come from the emulation layer in EVE-NG : I suppose that the layer 1 physical Ethernet protocols are not emulated. (for exemple port speed negociation).

This mean that when i disable the ether3 interface on R4, R1 ether3 interface does not see that the Ethernet link is broken and keep the interface up. And R1 keep the DAC route.

OSPF is working correctly, it is just that this DAC route should not be here when the PTP link is broken. OSPF cannot do anything to remove this route. It is not its responsibility.

So the virtualization layer is responsible for the problem, i think. I don’t think that the problem would reproduce on physical routers.
Emulation should take care of level1 connectivity. If not it’s not a true emulation. On the other side, it would be nice if something could be added at level2 in Router OS 7 (BFD i suppose) to avoid such situations.for Ethernet interfaces in virtual environments where it seems that level1 connectivity checks cannot be done.
And / or be able to add a gateway-check, ARP or IP, for /32 DAC routes.

smyers119 · December 31, 2021, 10:32pm

I think that the problem does come from the emulation layer in EVE-NG : I suppose that the layer 1 physical Ethernet protocols are not emulated. (for exemple port speed negociation).

This mean that when i disable the ether3 interface on R4, R1 ether3 interface does not see that the Ethernet link is broken and keep the interface up. And R1 keep the DAC route.

OSPF is working correctly, it is just that this DAC route should not be here when the PTP link is broken. OSPF cannot do anything to remove this route. It is not its responsibility.

So the virtualization layer is responsible for the problem, i think. I don’t think that the problem would reproduce on physical routers.
Emulation should take care of level1 connectivity. If not it’s not a true emulation. On the other side, it would be nice if something could be added at level2 in Router OS 7 (BFD i suppose) to avoid such situations.for Ethernet interfaces in virtual environments where it seems that level1 connectivity checks cannot be done.
And / or be able to add a gateway-check, ARP or IP, for /32 DAC routes.

I’ll attempt to reproduce in gnu3. does you get the same results with 6.49? 7+ doesn’t work in gnu3 (i believe it’s a virtualbox problem)

FIPTech · December 31, 2021, 11:03pm

I’ll attempt to reproduce in gnu3. does you get the same results with 6.49? 7+ doesn’t work in gnu3 (i believe it’s a virtualbox problem)

In GNS3 i think that Ros 7 is working, did you try Router OS in a QEMU VM ? What are you using for the GNS3 virtual instance ? Try to use Vmware player instead of Virtualbox. You should get better speed for nested virtualization.
EVE-NG is complex to manage Node images - i have only managed to import a single ROS 7.2 CHR image in it. I’m going to try in GNS3 with both versions.
More GNS3 has some native tools to temporarily break links or add jitter / latency / packet loss.

On the link state detection subject, i think that is is important for important point to point links that level 2 connectivity would be checked on a link and for both traffic directions. Sometimes, a link can be up at level 1 (physical) and not working at level 2 (Mac). I’ve seen that occurring a few times on damaged switch ports or damaged copper cables. On optical links, it’s important to have a bidirectional hello tests so that interfaces at both end of the link go down if a single fiber direction is broken. If the test is not bidirectional specially on fiber links, this can translate to a network disaster if an interface at one end of the ptp link stays up.

smyers119 · January 1, 2022, 12:41am

I was able to test this in GNU3 with 7.2r1, and can confirm same results. I researched it on the GNU3 forums and apparently it is a known limitation that even though you disable a interface on 1 router the other router still see’s the link as up/up.

FIPTech · January 1, 2022, 1:37am

I did test on GNS3 with Router OS 6.49.2. Same results. When i disable the ether3 interface on R4, R1 does not see the broken link and keep its ether3 interface running, keep the ether3 DAC route, and connectivity between R1 and R4 is lost because the OSPF routes have less priority than a DAC route. The DAC route 10.10.50.4/32 should be removed when the link is down.

The good point with GNS3 is that we can suspend a link. In this case, the interfaces at both ends stop running. → no connectivity loss. Better to test routing protocols.

Network simulators should be able to emulate level1 link status protocols, and Router OS should be able to detect level2 or level3 connectivity loss on a point to point link with /32 addresses.

For those tests i was using the default virtio network interfaces in the router OS VM. I should try perhaps with an emulated Intel Nic to see if there is a difference.

I think that the worst of both is that Router OS is not able to detect a level2 or level3 connectivity loss on a point to point link. This is a function that should be added for reliability. As i said before, sometimes level1 link status is ok, but level2 traffic does not work or is strongly degraded (bad cable for example, i’ve seen that a few times for copper cables outside or underground).

smyers119 · January 1, 2022, 2:23am

GNS3 is not able to simulate a layer 2 link loss. It’s not a routeros issue.

FIPTech · January 1, 2022, 10:43am

Not layer 2, but layer 1 link loss. If you check at L2 in GNS3, using ARP for example in a bonding interface, the broken link will be detected.

So yes virtual environments cannot simulate a L1 link loss, but even with physical routers, using /32 point to point links is dangerous because you have no simple way to check for the remote endpoint connectivity at level 2. This can translate to connectivity loss because connected routes and their interface stay alive if L1 connectivity is good, but L2 is broken.

Again L2 connectivity loss when L1 is working is not so uncommon. I’ve seen that many times when a cable has humidity in it, when connectors are damaged, or when a physical port is damaged.

More, on optical links, you can have a single direction down. If you don’t have a L2 check here with hello packets in both directions, you can have an interface staying alive at one end when the link is broken. BFD inside the routing protocol is not enough with /32 because it does not control the interface running status. BFD must be on the Ethernet interface himself for Ptp links. Same concept as loop detection for Ethernet : could be a simple option on each Ethernet interfaces that we could enable for point to point links.
Using BFD on the routing protocol alone does not help with the DAC route staying alive problem. It does just help to reduce the re-converging time. Try it, it’s available in router OS 6, you will see that it does not solve the interface status not changing problem.

With /30 links there is a quite simple way to check for L2 / L3 remote connectivity (using a static /32 route with check-gateway ARP or ICMP) but not possible with /32 links because the DAC route has always priority.

So this is something that should be implemented, i think. A simple hello BFD protocol would be enough, directly implemented at L2 on the Ethernet interfaces, available for point ot point links.

Using a bonding interface to do that is possible but overkill.

FIPTech · January 3, 2022, 4:53pm

Watching a bit more in details whats going on a when a point to point link between two routers is broken, when using OSPF and /32 addressing with the same IP on all interfaces of the same router, here is what i saw in Router OS 6.49.2 :

If the interface of the broken link goes down, there is no problem, router connectivity between the two routers is kept by rerouting. Traffic going through this link is rerouted too.
If the interface stay alive, router connectivity between routers at each side of the broken link is lost, because the DAC route stay active. But traffic is rerouted.

So the problem is less pronounced than what i thought initially because the traffic seems to be always rerouted.

To overcome the problem of lost connection between two routers when such a broken link with still “running” interface state occur, a solution would be to add a Loopback interface on each router, but using a different IP address than the one used for the unnumbered addressing. Because it’s a different IP it won’t be caught by the ghost DAC route. This is only one IP used on each router if connectivity between routers is mandatory.

But the true solution would be to have a BFD link state detection for point to point links. Would be good to have this too for static routes gateways.

As a side note, /32 addressing for point to point links is not an exact definition. There are really two addresses used, the host one, and the network one that become the remote address and is used to create the route, same as in a /31 but with totally free addressing.