Bonding active backup behaviour change - 6.47.x or earlier.

Hi,

We have a problem in that something that did work fine now does not. So either something has broke or what we are doing was not officially supported.

This relates to using Bonding and Active Backup to provide a resilient path over 5Ghz for 60GHz links of up to 1.7km. 60G is Mikrotik and 5G is Ubiquiti .

One scenario we have is we have a core mast with an omni 5Ghz antenna, this is visible to 3 remote sites/relay. Each relay also has a 60GHz PtP link from this central site and a 5G client too.

All of the devices at the central relay go into a Netonix switch and back to a MT CRS router on the same network.

At the remote sites the 60G and 5G client sides have their own port on a Powerbox pro. their respective access aerials are routed via this router but the wan side is on the same network.

Now to use active backup in this scenario is not a simple case of adding ports to bonding interfaces as at the master site ALL the devices go via a switch to a bridge on the CRS.

So what we did and worked fine until recently was to set up VLAN’s off the main port for each bond interface and at each respective site on a separate Vlan ID.

this worked and if the 60G link dropped the 5G took over for the related site via the omni ( we use a single omni as site space, spectrum and power budget is limited)

However this no longer works, it seems the ARP side has changed and we have noticed we can see devices on the neighbour list that are not on the port it claims, this “leakage” means that when a 60G port drops the bonding interface still gets some ARP responses causing it to flap, letting the odd ping packet through the 5G side that appears to be via the 60G side. Rendering it useless.

Interestingly if you swap the Primary interface to the 5G side it works fine via 5G but then will not fall back to the 60G with the same symptoms if you force the 5G link to drop.

We have had to resort to using EoIP which resolves the problem but has a huge CPU and therefore power overhead, not to mention it drops the throughput ceiling somewhat.

Is there a way of restoring the Vlan method? Or is there a low CPU method of having as near an instant fallback in this scenario?

We ave tried all manner of tricks to get this to work again and no luck.

Thanks

Bill

interestingly an example exists on the wiki that does what we used to do succesfully basically >

Solution
Bonding interfaces are not supposed to be connected using in-direct links, but it is still possible to create a workaround. The idea behind this workaround is to find a way to bypass packets being sent out using the bonding interface. There are multiple ways to force a packet not to be sent out using the bonding interface, but essentially the solution is to create new interfaces on top of physical interfaces and add these newly created interfaces to a bond instead of the physical interfaces. One way to achieve this is to create EoIP tunnels on each physical interface, but that creates a huge overhead and will reduce overall throughput. You should create a VLAN interface on top of each physical interface instead, this creates a much smaller overhead and will not impact overall performance noticeably. Here is an example how R1 and R2 should be reconfigured:

/interface vlan
add interface=ether1 name=VLAN_ether1 vlan-id=999
add interface=ether2 name=VLAN_ether2 vlan-id=999
/interface bonding
add mode=balance-xor name=bond1 slaves=VLAN_ether1,VLAN_ether2 transmit-hash-policy=layer-2-and-3
/ip address
add address=192.168.1.X/24 interface=bond1
add address=192.168.11.X/24 interface=ether1
add address=192.168.22.X/24 interface=ether2

Bill