Network instability on Proxmox Infrastructure in HA X Mikrotik

Hello,

I’m experiencing strange behavior on my Proxmox infrastructure configured with 3 hosts and 4 Mikrotik switches: 2 dedicated to storage and 2 for VM networking. Each pair of switches is linked via a bonding.

In normal operation, everything runs smoothly. However, during high-availability (HA) tests involving the rebooting of one or more hosts, a server may lose its network connection altogether. Even after several reboots, connectivity does not return, unless I manually disable and then re-enable a bonding port at the switch.

This problem has occurred several times. So I’m wondering if this is a configuration error on my part or a limitation related to the current infrastructure, but it seems unreliable in an HA context.

I’d be grateful for your advice or recommendations to stabilize this configuration.

Hi there,

Can you tell a bit more? Type of switches? Type of bonding (static, LACP active/passive, …)? Spanning-tree configuration (priorities and so forth)? If you have a diagram of your installation, that would help as well.

Could you come up with a network diagram similar to this?, so that others can help you triage the issue?

If that can help


an exemple of a bond configuration :
add comment=“VM NETWORK PROX10” mlag-id=1 mode=802.3ad mtu=9000 name=mbond1
slaves=sfp-sfpplus1 transmit-hash-policy=layer-2-and-3

Proxmox side :
root@prox11:~# cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v6.8.12-10-pve

Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer2+3 (2)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0
Peer Notification Delay (ms): 0

802.3ad info
LACP active: on
LACP rate: slow
Min links: 0
Aggregator selection policy (ad_select): stable
System priority: 65535
System MAC address: ec:e7:a7:10:05:f0
Active Aggregator Info:
Aggregator ID: 1
Number of ports: 2
Actor Key: 15
Partner Key: 15
Partner Mac Address: 78:9a:18:39:52:b9

Slave Interface: ens1f0np0
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 1
Permanent HW addr: ec:e7:a7:10:05:f0
Slave queue ID: 0
Aggregator ID: 1
Actor Churn State: none
Partner Churn State: none
Actor Churned Count: 0
Partner Churned Count: 0
details actor lacp pdu:
system priority: 65535
system mac address: ec:e7:a7:10:05:f0
port key: 15
port priority: 255
port number: 1
port state: 61
details partner lacp pdu:
system priority: 65535
system mac address: 78:9a:18:39:52:b9
oper key: 15
port priority: 255
port number: 1
port state: 61

Slave Interface: ens6f0np0
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 3
Permanent HW addr: ec:e7:a7:08:37:f8
Slave queue ID: 0
Aggregator ID: 1
Actor Churn State: none
Partner Churn State: none
Actor Churned Count: 1
Partner Churned Count: 1
details actor lacp pdu:
system priority: 65535
system mac address: ec:e7:a7:10:05:f0
port key: 15
port priority: 255
port number: 2
port state: 61
details partner lacp pdu:
system priority: 65535
system mac address: 78:9a:18:39:52:b9
oper key: 15
port priority: 255
port number: 1
port state: 61
root@prox11:~#

are those cas… switches MT switches? and your proxmox connected to cas#2 only as in lacp?

what version is the proxmox engine?

I’m sorry I have forgotten was it debian or redhat distro?

can you post your proxmox dmesg or syslog for the bonding problem? how do you connect your vm to the network? vnic or something else? have you set your bonding mac statically? ie. not dynamic mac assigned to it.

do you have any vlan in those switches and server bonding interface?

@babine , on your diagram it looks like each proxmox host has a lag to 2 different switches. Is it how it is connected?

Each host with 4 10GB ports

Each host has two 2-port aggregates
One for NFS storage access, the other for VM networks

all aggregates are LACP , hash 2+3

On the mikrotik switch side, I have 4.
2 times 2 switch in MLAG. 1 MLAG for NFS storage and another for VM networks.

everything works perfectly. I’ve just noticed these problems when rebooting.

I’ve never had this kind of problem with Cisco Nexus and VMWare.

Yes, I have VLANs on my infra.

Here is the proxmox version:
Kernel Version Linux 6.8.12-10-pve (2025-04-18T07:39Z)
Boot Mode EFI
Manager Version pve-manager/8.4.1/2a5fa54a8503f96d

root@prox10:~# dmesg | grep -i bond
[ 10.778157] bond1: (slave ens1f1np1): Enslaving as a backup interface with a down link
[ 10.890885] bond1: (slave ens6f1np1): Enslaving as a backup interface with a down link
[ 11.373741] vmbr1: port 1(bond1) entered blocking state
[ 11.373754] vmbr1: port 1(bond1) entered disabled state
[ 11.373779] bond1: entered allmulticast mode
[ 21.236272] bond1: Warning: No 802.3ad response from the link partner for any adapters in the bond
[ 21.238856] bond1: (slave ens6f1np1): link status definitely up, 10000 Mbps full duplex
[ 21.238869] bond1: active interface up!
[ 21.245611] vmbr1: port 1(bond1) entered blocking state
[ 21.245625] vmbr1: port 1(bond1) entered forwarding state
[ 21.297375] 8021q: adding VLAN 0 to HW filter on device bond1
[ 21.644208] bond0: (slave ens1f0np0): Enslaving as a backup interface with an up link
[ 21.775549] bond0: (slave ens6f0np0): Enslaving as a backup interface with an up link
[ 22.264161] vmbr0: port 1(bond0) entered blocking state
[ 22.264174] vmbr0: port 1(bond0) entered disabled state
[ 22.264201] bond0: entered allmulticast mode
[ 31.905852] bond1: (slave ens1f1np1): link status definitely up, 10000 Mbps full duplex
[ 31.911416] 8021q: adding VLAN 0 to HW filter on device bond0
[ 31.919429] vmbr0: port 1(bond0) entered blocking state
[ 31.919460] vmbr0: port 1(bond0) entered forwarding state
[ 474.275521] bond1: entered promiscuous mode
[1370497.771946] bond1: left promiscuous mode

In the diagram, prox10 has a bond ens1f0np0/ens6f0np0 going to swsto-01 and swsto-02. Can you confirm this is the case?

Because that might not work the way you think it does.

11.373741] vmbr1: port 1(bond1) entered blocking state
[ 11.373754] vmbr1: port 1(bond1) entered disabled state
[ 11.373779] bond1: entered allmulticast mode
[ 21.236272] bond1: Warning: No 802.3ad response from the link partner for any adapters in the bond

i don’t thoroughly read your first post about those interface priority etc. - but those lines above showing stp working for the vm bridge. looks like you have the problem inside your proxmox - not MT problems. and i don’t think other switches will work either.

brctl show please

and what the MT syslog said about the lacp interface?

Yes that’s right.

MT Syslog :

[admin@cas-swcamp-01.infra.casino.local] > /log print where message~"lacp"
2025-05-05 10:47:33 system,info device changed by winbox-3.41/mac-msg(winbox):admin@C4:C6:E6:B3:79:8D (/interface set bond
ing3 comment="VM NETWORK PROX10" disabled=no mtu=9000 name=bonding1; /interface bonding set bonding3 arp=enabled arp-timeo
ut=auto comment="VM NETWORK PROX10" disabled=no down-delay=0ms lacp-rate=30secs link-monitoring=mii mii-interval=100ms min
-links=0 mlag-id=20 mode=802.3ad mtu=9000 name=bonding1 primary=none slaves=sfp-sfpplus1 transmit-hash-policy=layer-2-and-
3 up-delay=0ms; /queue interface set bonding3) 
2025-05-05 10:47:59 system,info device changed by winbox-3.41/mac-msg(winbox):admin@C4:C6:E6:B3:79:8D (/interface set bond
ing1 comment="VM NETWORK PROX10" disabled=no mtu=9000 name=bonding1; /interface bonding set bonding1 arp=enabled arp-timeo
ut=auto comment="VM NETWORK PROX10" disabled=no down-delay=0ms lacp-rate=30secs link-monitoring=mii mii-interval=100ms min
-links=0 mlag-id=1 mode=802.3ad mtu=9000 name=bonding1 primary=none slaves=sfp-sfpplus1 transmit-hash-policy=layer-2-and-3
 up-delay=0ms; /queue interface set bonding1) 
2025-05-05 10:48:36 system,info device changed by winbox-3.41/mac-msg(winbox):admin@C4:C6:E6:B3:79:8D (/interface set bond
ing1 comment="VM NETWORK PROX10" disabled=no mtu=9000 name=bonding1; /interface bonding set bonding1 arp=enabled arp-timeo
ut=auto comment="VM NETWORK PROX10" disabled=no down-delay=0ms lacp-rate=30secs link-monitoring=mii mii-interval=100ms min
-links=0 mlag-id=1 mode=802.3ad mtu=9000 name=bonding1 primary=none slaves=sfp-sfpplus1 transmit-hash-policy=layer-2-and-3
 up-delay=0ms; /queue interface set bonding1) 
2025-05-05 10:50:43 system,info device changed by winbox-3.41/mac-msg(winbox):admin@C4:C6:E6:B3:79:8D (/interface set bond
ing2 comment="VM NETWORK PROX11" disabled=no mtu=9000 name=bonding2; /interface bonding set bonding2 arp=enabled arp-timeo
ut=auto comment="VM NETWORK PROX11" disabled=no down-delay=0ms lacp-rate=30secs link-monitoring=mii mii-interval=100ms min
-links=0 mlag-id=2 mode=802.3ad mtu=9000 name=bonding2 primary=none slaves=sfp-sfpplus2 transmit-hash-policy=layer-2-and-3
 up-delay=0ms; /queue interface set bonding2) 
2025-05-06 11:20:43 system,info device changed by winbox-3.41/mac-msg(winbox):admin@C4:C6:E6:B3:79:8D (/interface set PEER
BOND disabled=no mtu=9000 name=PEERBOND; /interface bonding set PEERBOND arp=enabled arp-timeout=auto disabled=no down-del
ay=0ms lacp-rate=30secs link-monitoring=mii mii-interval=100ms min-links=0 mode=802.3ad mtu=9000 name=PEERBOND primary=non
e slaves=qsfpplus1-1,qsfpplus2-1 transmit-hash-policy=layer-2-and-3 up-delay=0ms; /queue interface set PEERBOND) 
2025-05-06 12:16:44 system,info device added by winbox-3.41/tcp-msg(winbox):admin@172.17.15.126 (*2A = /interface bonding 
add arp=enabled arp-timeout=auto disabled=no down-delay=0ms lacp-rate=30secs link-monitoring=mii mii-interval=100ms min-li
nks=0 mlag-id=21 mode=802.3ad mtu=1500 name=bonding21 primary=none slaves=sfp-sfpplus21 transmit-hash-policy=encap-2-and-3
 up-delay=0ms) 
2025-05-20 11:54:56 system,info device added by winbox-3.41/tcp-msg(winbox):admin@192.168.23.7 (*2C = /interface bonding a
dd arp=enabled arp-timeout=auto disabled=no down-delay=0ms lacp-rate=30secs link-monitoring=mii mii-interval=100ms min-lin
ks=0 mlag-id=3 mode=802.3ad mtu=9000 name=bonding3 primary=none slaves=sfp-sfpplus3 transmit-hash-policy=layer-2-and-3 up-
delay=0ms) 
2025-05-20 11:55:21 system,info device changed by winbox-3.41/tcp-msg(winbox):admin@192.168.23.7 (/interface set bonding3 
comment="VM NETWORK PROX12" disabled=no mtu=9000 name=bonding3; /interface bonding set bonding3 arp=enabled arp-timeout=au
to comment="VM NETWORK PROX12" disabled=no down-delay=0ms lacp-rate=30secs link-monitoring=mii mii-interval=100ms min-link
s=0 mlag-id=3 mode=802.3ad mtu=9000 name=bonding3 primary=none slaves=sfp-sfpplus3 transmit-hash-policy=layer-2-and-3 up-d
elay=0ms; /queue interface set bonding3) 
2025-05-20 14:55:38 system,info device added by winbox-3.41/tcp-msg(winbox):admin@192.168.23.7 (*2D = /interface bonding a
dd arp=enabled arp-timeout=auto disabled=no down-delay=0ms lacp-rate=30secs link-monitoring=mii mii-interval=100ms min-lin
ks=0 mlag-id=3 mode=balance-rr mtu=9000 name=bonding3 primary=none slaves=sfp-sfpplus3 transmit-hash-policy=layer-2-and-3 
up-delay=0ms) 
2025-05-20 14:57:21 system,info device changed by winbox-3.41/tcp-msg(winbox):admin@192.168.23.7 (/interface set bonding3 
disabled=no mtu=9000 name=bonding3; /interface bonding set bonding3 arp=enabled arp-timeout=auto disabled=no down-delay=0m
s lacp-rate=30secs link-monitoring=mii mii-interval=100ms min-links=0 mlag-id=3 mode=802.3ad mtu=9000 name=bonding3 primar
y=none slaves=sfp-sfpplus3 transmit-hash-policy=layer-2-and-3 up-delay=0ms; /queue interface set bonding3) 
2025-05-20 14:57:43 system,info device changed by winbox-3.41/tcp-msg(winbox):admin@192.168.23.7 (/interface set bonding3 
comment="VM NETWORK PROX 12" disabled=no mtu=9000 name=bonding3; /interface bonding set bonding3 arp=enabled arp-timeout=a
uto comment="VM NETWORK PROX 12" disabled=no down-delay=0ms lacp-rate=30secs link-monitoring=mii mii-interval=100ms min-li
nks=0 mlag-id=3 mode=802.3ad mtu=9000 name=bonding3 primary=none slaves=sfp-sfpplus3 transmit-hash-policy=layer-2-and-3 up
-delay=0ms; /queue interface set bonding3) 
2025-05-28 13:19:18 system,info device changed by winbox-3.41/tcp-msg(winbox):admin@192.168.23.7 (/interface set mbond21 d
isabled=no mtu=1500 name=mbond21; /interface bonding set mbond21 arp=enabled arp-timeout=auto disabled=no down-delay=0ms l
acp-rate=30secs link-monitoring=mii mii-interval=100ms min-links=0 mlag-id=21 mode=802.3ad mtu=1500 name=mbond21 primary=n
one slaves=sfp-sfpplus21 transmit-hash-policy=encap-2-and-3 up-delay=0ms; /queue interface set mbond21) 
2025-05-28 13:19:24 system,info device changed by winbox-3.41/tcp-msg(winbox):admin@192.168.23.7 (/interface set mbond3 co
mment="VM NETWORK PROX 12" disabled=no mtu=9000 name=mbond3; /interface bonding set mbond3 arp=enabled arp-timeout=auto co
mment="VM NETWORK PROX 12" disabled=no down-delay=0ms lacp-rate=30secs link-monitoring=mii mii-interval=100ms min-links=0 
mlag-id=3 mode=802.3ad mtu=9000 name=mbond3 primary=none slaves=sfp-sfpplus3 transmit-hash-policy=layer-2-and-3 up-delay=0
ms; /queue interface set mbond3) 
2025-05-28 13:19:30 system,info device changed by winbox-3.41/tcp-msg(winbox):admin@192.168.23.7 (/interface set mbond2 co
mment="VM NETWORK PROX11" disabled=no mtu=9000 name=mbond2; /interface bonding set mbond2 arp=enabled arp-timeout=auto com
ment="VM NETWORK PROX11" disabled=no down-delay=0ms lacp-rate=30secs link-monitoring=mii mii-interval=100ms min-links=0 ml
ag-id=2 mode=802.3ad mtu=9000 name=mbond2 primary=none slaves=sfp-sfpplus2 transmit-hash-policy=layer-2-and-3 up-delay=0ms
; /queue interface set mbond2) 
2025-05-28 13:19:35 system,info device changed by winbox-3.41/tcp-msg(winbox):admin@192.168.23.7 (/interface set mbond1 co
mment="VM NETWORK PROX10" disabled=no mtu=9000 name=mbond1; /interface bonding set mbond1 arp=enabled arp-timeout=auto com
ment="VM NETWORK PROX10" disabled=no down-delay=0ms lacp-rate=30secs link-monitoring=mii mii-interval=100ms min-links=0 ml
ag-id=1 mode=802.3ad mtu=9000 name=mbond1 primary=none slaves=sfp-sfpplus1 transmit-hash-policy=layer-2-and-3 up-delay=0ms
; /queue interface set mbond1) 
2025-06-13 11:19:35 system,info device changed by winbox-3.41/tcp-msg(winbox):admin@192.168.23.7 (/interface set mbond3 co
mment="VM NETWORK PROX 12" disabled=no mtu=9000 name=mbond3; /interface bonding set mbond3 arp=enabled arp-timeout=auto co
mment="VM NETWORK PROX 12" disabled=no down-delay=0ms lacp-rate=30secs link-monitoring=mii mii-interval=100ms min-links=0 
mlag-id=3 mode=balance-xor mtu=9000 name=mbond3 primary=none slaves=sfp-sfpplus3 transmit-hash-policy=layer-2-and-3 up-del
ay=0ms; /queue interface set mbond3) 
2025-06-13 11:21:09 system,info device changed by winbox-3.41/tcp-msg(winbox):admin@192.168.23.7 (/interface set mbond3 co
mment="VM NETWORK PROX 12" disabled=no mtu=9000 name=mbond3; /interface bonding set mbond3 arp=enabled arp-timeout=auto co
mment="VM NETWORK PROX 12" disabled=no down-delay=0ms lacp-rate=30secs link-monitoring=mii mii-interval=100ms min-links=0 
mlag-id=3 mode=802.3ad mtu=9000 name=mbond3 primary=none slaves=sfp-sfpplus3 transmit-hash-policy=layer-2-and-3 up-delay=0
ms; /queue interface set mbond3

Unless the 2 switches are stacked, this is not going to work: all the links in a bond must go to the same switch.

Can you configure that and try again?

They are stacked . I have created a MLAG

Taking the time for a proper response.

Lacp rate is 30 secs, can you set something lower? Doc recommends 1 sec

Stp -
Is the prio and costs the same on both switches in each pair?
Can you set the ports as edge, if there is no risk of a loop?

STP - I compared the settings on my two bridges and found a disparity. One of my bridges had the port cost mode in short instead of long. This has been corrected.

I changed the LACP rate as requested to 1s on all my bondings.

I’m going to run more tests with these changes to try and reproduce the problem

Hi there! Did it solve the issue?

Hey,
I wasn’t able to reproduce again the issue. So I guess we can say yes !
:vulcan_salute: