Bonding disconnect every 1 min

Hello,

In my homelab configuration, my RB5009 works with Cisco SG250 switch through a LACP Bonding.

/interface bonding
add mode=802.3ad name=Bond_SG250 slaves="ether2(Bond),ether3(Bond)"

Previously, SG250 performs very well when i configure link aggregation with Openwrt and ASUS ax86u.
But now SG250 often disconnects with 5009 on one gi7 (connect to “ether3(Bond)”).
Here is the log from SG250.

04-Feb-2024 03:35:38 %TRUNK-W-PORTREMOVED: Port gi7 removed from Po1
04-Feb-2024 03:35:38 %TRUNK-I-PORTADDED: Port gi7 added to Po1
04-Feb-2024 03:36:38 %TRUNK-W-PORTREMOVED: Port gi7 removed from Po1
04-Feb-2024 03:36:38 %TRUNK-I-PORTADDED: Port gi7 added to Po1
04-Feb-2024 03:37:38 %TRUNK-W-PORTREMOVED: Port gi7 removed from Po1
04-Feb-2024 03:37:38 %TRUNK-I-PORTADDED: Port gi7 added to Po1
04-Feb-2024 03:38:38 %TRUNK-W-PORTREMOVED: Port gi7 removed from Po1
04-Feb-2024 03:38:38 %TRUNK-I-PORTADDED: Port gi7 added to Po1
04-Feb-2024 03:39:38 %TRUNK-W-PORTREMOVED: Port gi7 removed from Po1
04-Feb-2024 03:39:38 %TRUNK-I-PORTADDED: Port gi7 added to Po1
04-Feb-2024 03:40:38 %TRUNK-W-PORTREMOVED: Port gi7 removed from Po1
04-Feb-2024 03:40:38 %TRUNK-I-PORTADDED: Port gi7 added to Po1
04-Feb-2024 03:41:38 %TRUNK-W-PORTREMOVED: Port gi7 removed from Po1
04-Feb-2024 03:41:38 %TRUNK-I-PORTADDED: Port gi7 added to Po1

For 5009, no link downs can be found in log.

BTW, the disconnection is regular (every 1 min), it can be related to LACP frames (1min interval). If i disable the LACP and set a manually aggregation, it works fine.

Any idea? Thanks.

Hi there!

As far as I know and unless you changed the defaults, the LACPDUs are sent every 30s, so that could be something else.

However!

What LACP mode did you set on the Cisco side? Did you enforce the same load-balancing algo on both ends?

Still on the Cisco side, can you look at the interface counters and see whether there is any error?

Thank you for reply
I can confirm that load balance algorithm is both Layer2 and no link downs on physics.
on Cisco side, i change the lacp port priority, gi7 is still at fault, another trunk member(gi6) works well.
on RB5009 side, i use

/interface bonding monitor-slaves

to monitor the aggragation, noticing the instand disconnection like this

AP port=ether2(Bond) key=9 flags="A-GSCD--" partner-sys-id=B8:11:4B:**:**:** partner-sys-priority=65535 partner-key=1000 partner-flags="A-GSCD--" 
P port=ether3(Bond) key=9 flags="A-GSCD--" partner-sys-id=B8:11:4B:**:**:** partner-sys-priority=65535 partner-key=1000 partner-flags="A-GS----"

Then I repeat the configuration process, the disconnection interval become 30sec.

If I understand you correctly: if you pick two ports that don’t include gi7 on the Cisco it works fine?

Sorry, but my problem is that if I pick any two ports, one of them will disconnect every 30sec and another port works well. Replacing gi7 with other port makes no difference.
I also tried to set up bonding with a Netgear switch, similar problem.
However, if I disable LACP for static LAG on both side, it works very fine, without drop and disconnection.

No worries.

Can you send the output of the following commands?

/interface/bonding/print
/interface/bridge/port print
/interface/bridge/print detail



[xhyh@RB5009] > /interface/bonding/print
Flags: X - disabled; R - running 
 0  R name="Bond_SG250" mtu=1500 mac-address=78:9A:18:9F:0D:2F arp=enabled arp-timeout=auto slaves=ether2(Bond),ether3(Bond) mode=802.3ad 
      primary=none link-monitoring=mii mii-interval=100ms lacp-rate=30secs transmit-hash-policy=layer-2 min-links=0
[xhyh@RB5009] > /interface/bridge/port print
Flags: I - INACTIVE; H - HW-OFFLOAD
Columns: INTERFACE, BRIDGE, HW, PVID, PRIORITY, HORIZON
#    INTERFACE       BRIDGE       HW   PVID  PRIORITY  HORIZON
0  H ether1(2500M)   Bridge_Main  yes     1  0x80      none   
1  H Bond_SG250      Bridge_Main  yes     1  0x80      none   
2  H sfp1(10G)       Bridge_Main  yes     1  0x80      none   
3 IH ether4(Backup)  Bridge_Main  yes     1  0x80      none   
[xhyh@RB5009] > /interface/bridge/print detail
Flags: X - disabled, R - running 
 0 R name="Bridge_Main" mtu=auto actual-mtu=1500 l2mtu=1514 arp=enabled arp-timeout=auto mac-address=78:9A:18:9F:0D:2E protocol-mode=none 
     fast-forward=yes igmp-snooping=no auto-mac=yes ageing-time=5m vlan-filtering=no dhcp-snooping=no port-cost-mode=long

Thank you!

For the bridge, could you change the “protocol-mode” to “rstp” and see if it changes something?

Very successful! Thank you for saving me from the problem!
But I can’t figure out why the LACP have to work with RSTP. (I understand that without LAG it is a loop)

This is a bit of a feature that is becoming a bug: “protocol-mode=none” not only disables spanning-tree but results in all L2 multicast frames being forwarded to all ports as well. As a result, the switch was forwarding the LACPDU from one ethernet port to another, resulting in the Cisco switch seeing its own LACPDU, considering there was a loop and dropping the link out of the PO. The last parts are mostly conjectures I will make a lab later today to confirm.

I have a ticket to have “protocol.mode=none” being simply “no spanning-tree” and “protocol-mode=transparent” being the same as today’s none.

That really matches my test today. Although I enable RSTP on both side, no topology change happens. For cisco, all ports forward as before, no RSTP frame is received from mikrotik. But it really fixes LACP.