I am experiencing this same issue currently. I have two CRS520s running MLAG which works perfectly fine. I have then setup VRRP to act as the default gateway for some Proxmox nodes. VRRP can communicate perfectly fine.
The problem is that if one of the VRRP interfaces go down, some traffic no longer routes. I can still SSH into the proxmox nodes and the proxmox nodes can ping out but I can’t reach the web interface of proxmox and traceroutes from the nodes take upwards of 10 seconds to complete. If I wait around 15 minutes with one VRRP interface down everything then starts working out of the blue.
To me this feels like something no updating on the nodes rather than the Mikrotiks but since I’d seen other people with similar problems I thought I would ask if this is a known issue or not.
Tested disabling MLAG on a duplicate setup (except with CRS328s instead of CRS520s) and this works fine. Enabling MLAG again then causes the same behaviour to appear.
MLAG and L3HW offload don’t work together, so I wouldn’t be surprised if other routing gotchas show up with MLAG enabled. For now, it might be better to use a pair of separate routers LAG’d to the MLAG stack than to try making the switches to the routing as well.
hmm… I’m wondering whether you have routers in front of those switches facing the internet? or you are literally just using those 2 switches for the internet + vrrp + mlag + server farm?
if yes, then no - your design might not work correctly.
bitter truth is high availability setup is expensive and still has this single point of failure.
vrrp should be on other devices. floating gateway target for those switches, and probably to accommodate intervlan routing as well.
mostly mlag should be on network devices. ie between switches. not directly to servers although they support lacp/mlag. (or even should not directly to edge routers) - which mostly only need lacp.
how those mlag involved devices handle stp is the most crucial thing in high availability setup.
you need to analyze different stp performance. ie between rstp and pvst to know what you are up to.
I’m sorry I can’t read further config - but my guess is that your setup probably has this certain vlans didn’t work - because of the nature r/stp. hence the preemptive session didn’t work as expected.
Of those points, I only agree with #2. The rest is not true, irrelevant, or just doesn’t make sense in this situation
The whole purpose of MLAG is to help eliminate single points of failure for connected clients, including servers, routers, and other switches. The servers use LACP (LAG) with one connection going to each switch in case either switch fails.
I agree that OP should have a router (or a pair of routers) sitting between the Internet at the CRS328’s, doing the actual routing, and just let the MLAG stack be a switch. Because MLAG relies heavily on Layer 2 technologies to keep track of client traffic, it is likely that the VRRP bits are getting caught in the middle.
true. that’s the idea and the ideal expectation. but unfortunately without proper design - anything with rectangle or triangle or circle involves loop.
sometimes, if we are lucky enough - we get layer 3 loop. do filters - done. worst than that - we rely on stp to solve the problem on layer 2.
lacp pagp mlag mlt smlt dmlt they are all rely on how layer 2 work underneath it. especially vlan trunking if any.
so do you think there will be no link dropped within a triangle? circle rectangle? that is single point of failure.
comparing how much time it will take between vrrp preemptive and stp block learn and forwarding period. we can’t do magic