Server load balancing with bonding

Hi, I’m (ab)using a bonding interface on a CRS309 to load balance incoming traffic onto three Linux servers and it’s working great! However, ARP link monitoring is giving me trouble.

My configuration (with RouterOS 6.47.1) is:

/interface bonding
add arp-ip-targets=10.0.0.50 link-monitoring=arp mode=balance-xor name=bonding1 slaves=sfp-sfpplus5,sfp-sfpplus6,sfp-sfpplus7 transmit-hash-policy=layer-3-and-4

/interface bridge port
add bridge=bridge interface=sfp-sfpplus1
add bridge=bridge interface=sfp-sfpplus2
add bridge=bridge interface=sfp-sfpplus3
add bridge=bridge interface=sfp-sfpplus4
add bridge=bridge interface=sfp-sfpplus8
add bridge=bridge interface=bonding1
add bridge=bridge interface=ether1

Both the bonding and the bridging is done in hardware with this setup.

The three servers are directly connected to the bonded ports and each is configured with the same IP and MAC addresses on their bonded interfaces. They receive incoming packets on the bonded interface but the outgoing packets go out via a different port. The servers don’t know about the bonding.

This setup is rock solid when using MII link monitoring (to the point that I ran it with a production workload for two weeks) but extremely unreliable with ARP link monitoring, which I’d prefer to use. The issue is significant packet loss (~5-25%, as measured by 1-second interval pings) in seemingly random patterns.

Looking at tcpdump on the servers, they receive the link monitoring ARP requests just fine and reply to them on the same interface, with the appropriate MAC address. Everything looks like it should.

At first I thought the default 100ms interval was just too quick, but increasing it doesn’t help. In fact, it hurts: if I set the interval to 5 seconds, the server becomes unreachable even though the interface stays up and the link monitoring ARP packets are flowing.

Any ideas on how to debug this further? I’d especially like to get some details on the status of the slave interfaces on the bonding1 interface, but can’t see any status display or logging for that information.