Problem with LACP in MLAGs

Hi,

I have a problem with LACP in a MLAG setup.

Testsetup: two CRS518-16XS-2XQ with RouterOS 7.10, client with Linux and 802.3ad Dynamic link aggregation
Both mikrotiks have a regular bridge and an MLAG configured on SFP28-2 which looks fine:

/interface bridge add frame-types=admit-only-vlan-tagged name=local vlan-filtering=yes
/interface bridge port add bridge=local interface=sfp28-2 pvid=299
/interface bridge mlag set bridge=local peer-port=sfp28-2

/interface/bridge/mlag/monitor once
       status: connected
    system-id: 48:A9:8A:11:1B:15
  active-role: primary      # other device is secondary

I then add a distributed bond on both devices:

/interface bonding add mlag-id=10 mode=802.3ad name=bond14-mlag slaves=sfp28-14 transmit-hash-policy=layer-2-and-3
/interface bridge port add bridge=local interface=bond14-mlag

/interface/bonding/monitor bond14-mlag once 
                  mode: 802.3ad
          active-ports: sfp28-14
        inactive-ports: 
        lacp-system-id: XX:XX:XX:XX:XX:XX
  lacp-system-priority: 65535
lacp-partner-system-id: YY:YY:YY:YY:YY:YY

/interface/bonding/monitor-slaves  bond14-mlag once yes 
Flags: A - active; P - partner 
 AP port=sfp28-14 key=15 flags="A-GSCD--" partner-sys-id=XX:XX:XX:XX:XX:XX partner-sys-priority=65535 partner-key=15 partner-flags="A-GSCD--"

Everything looks fine so far. I didn’t show the vlan configuation but the bond works.
Things go wrong when I disable one interface on my linux to simulate a connection failure:

/interface/bonding/monitor bond14-mlag once            
                  mode: 802.3ad
          active-ports: sfp28-14
        inactive-ports: 
        lacp-system-id: 48:A9:8A:11:1B:16
  lacp-system-priority: 65535

/interface/bonding/monitor-slaves  bond14-mlag once=yes 
Flags: A - active; P - partner 
 A  port=sfp28-14 key=15 flags="A-GS--F-"

Connection tests fail now. Some remarks:

  • Shouldn’t sfp28-14 be listed as ‘inactive’ now?


  • I assume my tests fail because they still try to use the (now nonfunctional) port (sniffer indicates that)


  • Port status is still up, MII did not fail.


  • The same test works as intended if I use a local bond on one CRS.

Questions:

  • is my approach correct or am i missing something?


  • How can I debug the problem?

Have you tried to reboot both switches? I’ve had a similar issue where rebooting fixed it.