I’m experiencing strange behavior on my Proxmox infrastructure configured with 3 hosts and 4 Mikrotik switches: 2 dedicated to storage and 2 for VM networking. Each pair of switches is linked via a bonding.
In normal operation, everything runs smoothly. However, during high-availability (HA) tests involving the rebooting of one or more hosts, a server may lose its network connection altogether. Even after several reboots, connectivity does not return, unless I manually disable and then re-enable a bonding port at the switch.
This problem has occurred several times. So I’m wondering if this is a configuration error on my part or a limitation related to the current infrastructure, but it seems unreliable in an HA context.
I’d be grateful for your advice or recommendations to stabilize this configuration.
Can you tell a bit more? Type of switches? Type of bonding (static, LACP active/passive, …)? Spanning-tree configuration (priorities and so forth)? If you have a diagram of your installation, that would help as well.
an exemple of a bond configuration :
add comment=“VM NETWORK PROX10” mlag-id=1 mode=802.3ad mtu=9000 name=mbond1
slaves=sfp-sfpplus1 transmit-hash-policy=layer-2-and-3
Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer2+3 (2)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0
Peer Notification Delay (ms): 0
802.3ad info
LACP active: on
LACP rate: slow
Min links: 0
Aggregator selection policy (ad_select): stable
System priority: 65535
System MAC address: ec:e7:a7:10:05:f0
Active Aggregator Info:
Aggregator ID: 1
Number of ports: 2
Actor Key: 15
Partner Key: 15
Partner Mac Address: 78:9a:18:39:52:b9
Slave Interface: ens1f0np0
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 1
Permanent HW addr: ec:e7:a7:10:05:f0
Slave queue ID: 0
Aggregator ID: 1
Actor Churn State: none
Partner Churn State: none
Actor Churned Count: 0
Partner Churned Count: 0
details actor lacp pdu:
system priority: 65535
system mac address: ec:e7:a7:10:05:f0
port key: 15
port priority: 255
port number: 1
port state: 61
details partner lacp pdu:
system priority: 65535
system mac address: 78:9a:18:39:52:b9
oper key: 15
port priority: 255
port number: 1
port state: 61
Slave Interface: ens6f0np0
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 3
Permanent HW addr: ec:e7:a7:08:37:f8
Slave queue ID: 0
Aggregator ID: 1
Actor Churn State: none
Partner Churn State: none
Actor Churned Count: 1
Partner Churned Count: 1
details actor lacp pdu:
system priority: 65535
system mac address: ec:e7:a7:10:05:f0
port key: 15
port priority: 255
port number: 2
port state: 61
details partner lacp pdu:
system priority: 65535
system mac address: 78:9a:18:39:52:b9
oper key: 15
port priority: 255
port number: 1
port state: 61
root@prox11:~#
are those cas… switches MT switches? and your proxmox connected to cas#2 only as in lacp?
what version is the proxmox engine?
I’m sorry I have forgotten was it debian or redhat distro?
can you post your proxmox dmesg or syslog for the bonding problem? how do you connect your vm to the network? vnic or something else? have you set your bonding mac statically? ie. not dynamic mac assigned to it.
do you have any vlan in those switches and server bonding interface?
root@prox10:~# dmesg | grep -i bond
[ 10.778157] bond1: (slave ens1f1np1): Enslaving as a backup interface with a down link
[ 10.890885] bond1: (slave ens6f1np1): Enslaving as a backup interface with a down link
[ 11.373741] vmbr1: port 1(bond1) entered blocking state
[ 11.373754] vmbr1: port 1(bond1) entered disabled state
[ 11.373779] bond1: entered allmulticast mode
[ 21.236272] bond1: Warning: No 802.3ad response from the link partner for any adapters in the bond
[ 21.238856] bond1: (slave ens6f1np1): link status definitely up, 10000 Mbps full duplex
[ 21.238869] bond1: active interface up!
[ 21.245611] vmbr1: port 1(bond1) entered blocking state
[ 21.245625] vmbr1: port 1(bond1) entered forwarding state
[ 21.297375] 8021q: adding VLAN 0 to HW filter on device bond1
[ 21.644208] bond0: (slave ens1f0np0): Enslaving as a backup interface with an up link
[ 21.775549] bond0: (slave ens6f0np0): Enslaving as a backup interface with an up link
[ 22.264161] vmbr0: port 1(bond0) entered blocking state
[ 22.264174] vmbr0: port 1(bond0) entered disabled state
[ 22.264201] bond0: entered allmulticast mode
[ 31.905852] bond1: (slave ens1f1np1): link status definitely up, 10000 Mbps full duplex
[ 31.911416] 8021q: adding VLAN 0 to HW filter on device bond0
[ 31.919429] vmbr0: port 1(bond0) entered blocking state
[ 31.919460] vmbr0: port 1(bond0) entered forwarding state
[ 474.275521] bond1: entered promiscuous mode
[1370497.771946] bond1: left promiscuous mode
11.373741] vmbr1: port 1(bond1) entered blocking state
[ 11.373754] vmbr1: port 1(bond1) entered disabled state
[ 11.373779] bond1: entered allmulticast mode
[ 21.236272] bond1: Warning: No 802.3ad response from the link partner for any adapters in the bond
i don’t thoroughly read your first post about those interface priority etc. - but those lines above showing stp working for the vm bridge. looks like you have the problem inside your proxmox - not MT problems. and i don’t think other switches will work either.
brctl show please
and what the MT syslog said about the lacp interface?
STP - I compared the settings on my two bridges and found a disparity. One of my bridges had the port cost mode in short instead of long. This has been corrected.
I changed the LACP rate as requested to 1s on all my bondings.
I’m going to run more tests with these changes to try and reproduce the problem