VRRP + MLAG

Hello guys!

I have a particular scenario that I’m working on which I got stuck while reading the documentation:
Screenshot 2023-11-10 at 3.17.50 PM.png
The applications are running inside a VMWare cluster of 4 machines. Each machine has 2x Mellanox ConnectX-4 Dual port NICs. Each NIC, is supposed to be connected to a different switch, with a MLAG setup where each port on each NIC is on its own MLAG with one of the switches. The two MLAGs are essentially one for regular application traffic (i.e. VMs, incoming HTTP requests from the clients, etc.) while the other is dedicated for VMWare stuff like vSAN for example. That would ensure me that if a switch fail, I can still be alive with the second one.

One way up in the picture, I have the two routers which would be setup to use VRRP. That way we have a single gateway across the networks.

The internet link is 1Gbps Full Duplex, so RB5009 seems to be enough to to handle the traffic even though the metrics we have indicate the link utilization was never more than 100Mbps.

That is all good, however, the documentation say that:

The MLAG is not compatible with L3 hardware offloading. When using MLAG, the L3 hardware offloading must be disabled.

That statement got me freaking out. And then there comes the questions:

  1. If I can’t enable L3 HW Offloading, does that mean that all traffic on the switches will have to go thru the CPU making the performance horrible?
  2. For the communication between the VMs or any VMware host machines that are part of the MLAGs, if it is traffic for the same network, does those packets get sent all the way up to the router and back to another machine or are the forwarded inside the switch itself to the destination machine?

My main goals here would be the following:

  1. Internal traffic within the same network (i.e. VM to VM, vSAN, etc) should not go up to the routers, but be routed internally on the Switch at wire speed
  2. Requests coming from the clients in the internet, regardless of which ISP/Router receive it, should be sent down to the VMWare cluster to be handled by the applications
  3. When going out to the internet, use the available path on the MLAG and the current VRRP gateway
  4. If a switch fail, the MLAG would covers the connection lost and keep going
  5. If the router fail or the internet link connected on that router is not working, then the VRRP should move the virtual gateway IP to the other router and the traffic should flow preferably thru the 10G SFP+ link. If it is not available, use the 1G link.
  6. In all cases, always prefer the 10G SFP+ uplink

Can someone shed a light on this design and the questions?

Thank you!

I really appreciate any input.

Best regards,
Gutemberg

Hello,

regarding MLAG:

  • traffic between two IP addresses in the same subnet does not need to go up to the routers because, scrictly speaking, it is not routed. The behavior is the same as if there was a single switch, the Ethernet frames carrying those IP packets are L2-forwarded/bridged/switched between the respective switch ports, possibly using also the link between the peering switches.
  • L3 offloading is a feature that allows the switch chip to L3-forward/route the IP packets between subnets. It requires that a gateway IP address in each subnet was set up on the switch. In that case, even traffic between addresses in different subnets would be routed by the “switch” device itself, acting as a router. And this is what cannot be supported simultaneously with MLAG, but it should not bother you.
  • with VMware, you don’t need to use LAG at all unless you want to use LACP to check the state of the physical link or unless you have a bandwidth issue that can be solved by using LAG (remember that a single connection always uses a single physical path, so LAG is not a panacea). Without LAG, VMware links each virtual NIC to a particular physical interface in an even fashion and keeps sending all traffic from that NIC via that physical interface for as long as the physical interface is up, so the MAC address of that VNIC stays learned on the switches accordingly. If you don’t need to use LAG on VMware side, there is consequently no need for MLAG on Mikrotik side to obtain redundancy, as independent switches connected together by multiple paths in a redundant fashion (using a LAG between the switches or using just xSTP) are enough.

Regarding routing: a lot of things depend on whether you get a different public IP address from each ISP and you have to use NAT on the routers or whether you’ve got your own public IP pool and an AS number. If the public IP addresses are linked to the physical uplinks, the task of routing the server responses back using the correct uplink either has to be fulfilled by the servers themselves or you have to use the connection tracking feature of the routers’ firewalls which basically boils down to assigning both the WAN addresses to the same router at the same time and using the other one as a transparent bridge for the respective uplink. In any case, i.e. even if you have got your own AS, unless the choice of WAN is done by the servers, the possible throughput of one of the uplinks will be throttled down to the possible throughput of the link between the routers.

Hello @sindy, thanks for the reply.

traffic between two IP addresses in the same subnet does not need to go up to the routers because, scrictly speaking, it is not routed. The behavior is the same as if there was a single switch, the Ethernet frames carrying those IP packets are L2-forwarded/bridged/switched between the respective switch ports, possibly using also the link between the peering switches.

Ok. So for example, the ports that are used with the vSAN traffic for storage purposes, as long as they are part of the same VLAN on both switches, they wouldn’t be routed. Got it.

L3 offloading is a feature that allows the switch chip to L3-forward/route the IP packets between subnets. It requires that a gateway IP address in each subnet was set up on the switch. In that case, even traffic between addresses in different subnets would be routed by the “switch” device itself, acting as a router. And this is what cannot be supported simultaneously with MLAG, but it should not bother you.

The servers have 3 NICs total.

  • 4-port 1Gb Ethernet LOM which is used as the VMWare management ports.
  • Mellanox ConnectX-4 dual 10Gb which are used for the application/VM traffic
  • Mellanox ConnectX-4 dual 25Gb which are used for the vSAN traffic

Those NICs have no need to talk to each other on different VLANs, since they are dedicated to specific purpose, so no inter-VLAN routing. If I understood what you are saying correctly, there is no need for L3 HW routing/offloading in this case as the traffic will always be within the same VLAN.

with VMware, you don’t need to use LAG at all unless you want to use LACP to check the state of the physical link or unless you have a bandwidth issue that can be solved by using LAG (remember that a single connection always uses a single physical path, so LAG is not a panacea). Without LAG, VMware links each virtual NIC to a particular physical interface in an even fashion and keeps sending all traffic from that NIC via that physical interface for as long as the physical interface is up, so the MAC address of that VNIC stays learned on the switches accordingly. If you don’t need to use LAG on VMware side, there is consequently no need for MLAG on Mikrotik side to obtain redundancy, as independent switches connected together by multiple paths in a redundant fashion (using a LAG between the switches or using just xSTP) are enough.

I’m new to VMWare. On Windows Server/Hyper-V the “new” way of doing this is with NIC Team using Switch Independent mode (SET). That way we don’t need LACP/LAG/mLAG at all. What you are saying is that VMWare is essentially doing the same in this case, right? If that is true, then ok, the LAG isn’t necessary. However, I still have to have a link between the two switches, correct?

Regarding routing: a lot of things depend on whether you get a different public IP address from each ISP and you have to use NAT on the routers or whether you’ve got your own public IP pool and an AS number. If the public IP addresses are linked to the physical uplinks, the task of routing the server responses back using the correct uplink either has to be fulfilled by the servers themselves or you have to use the connection tracking feature of the routers’ firewalls which basically boils down to assigning both the WAN addresses to the same router at the same time and using the other one as a transparent bridge for the respective uplink. In any case, i.e. even if you have got your own AS, unless the choice of WAN is done by the servers, the possible throughput of one of the uplinks will be throttled down to the possible throughput of the link between the routers.

AFAICT, the co-location/hosting provider will give me a /29 block. I could assign different IPs to each of the routers.

The thing is, in reality, we have no inbound traffic to the application thru those routers. The way we receive traffic is by using CloudFlare tunnels. Essentially there will be a set of VMs/containers within VMWare running cloudflared. Those processes establish a tunnel with CloudFlare from inside out. When traffic hits CloudFlare PoPs, it is router inside this tunnel to the process, and it has a “route table” as to where to deliver that request. This makes the Public IPs only a means to have access to internet so the tunnels can be established. Ofc we would have VPN setup on the Routers so we can access the infrastructure for management purposes but besides that, there is no direct inbound traffic thru the Routers.

Once the tunnel is established, it will stay there for as long as there is internet connectivity. If the link is down (or the path to that gateway is dead somewhere), the tunnel will try to reconnect, which I assume is where the VRRP would help in this case, because the gateway floating IP would be put in the different slave router and everything should move forward.

So with that in mind, and updating the “diagram” here is what I have:
Screenshot 2023-11-11 at 1.18.59 PM.png
With that scenario, I have the following:

  1. If the connection to any of the switches is broken, it would use the other one and if this happen but both switches are alive, the standard LAG bond between the switches would keep the traffic flowing;
  2. A switch is dead, then it should use the other one;
  3. If one Router is dead, VRRP would swap the floating IP to the other Router making the internet access to work again;

The only thing this doesn’t cover, is if the router uplink to the ISP in the current “master” router fail, it wouldn’t trigger the VRRP failover. Is there a way, even thru scripting, to trigger the VRRP failover? That way I could somehow check the internet connection on the Router itself, and if it fail, fail over the floating IP to the other available Router.

Thank for the insights, I really appreciate the input.

Just a terminological remark, the abbrevaiation NIC is typically used for a single Ethernet interface even if it is physically located on a multi-port plug-in card.


What I am saying is what I am saying. I can’t agree or disagree with your comparison to Hyper-V because I never ran Hyper-V on a server so I never dived into its multi-link capabilities :slight_smile:


Yes in your case, not necessarily a direct one in other cases. If the servers do not need to talk to each other on L2, an L2 path between the server NIC and the router NIC may be sufficient. So if both routers have links to both switches, and the servers do not need to talk to each other,
there is no need for a direct link between the switches.


Are ISP1 and ISP2 just physical interfaces accessing the same L2 segment (i.e. switch redundancy at data center side) or do ISP1 and ISP2 represent different interconnection subnets that are both capable to accept route advertisement from you to tell which address from your /29 range is reachable through which of these two L3 subnets?


OK, so in order that redundancy and maybe load distribution worked (I don’t know how CloudFlare handles the traffic), the cloudflared on each server has to establish the VPN tunnels to multiple CloudFlare POPs. The task of responding via the correct tunnel is the job of cloudflared, but a task of spreading the tunnels from a given server to the individual CloudFlare POPs across the available WAN paths exists. I gather that the addresses of the VPN servers are given as FQDNs so the IP numbers may drift over time. On Mikrotik, I would use one address list per such FQDN to keep the DNS translation up to date and control the choice of WAN, but that implies that all the traffic would run through the same router. So for your use case, it seems much more useful to me to move this task to the server as well, and give the server two gateway IP addresses to choose from; these gateway addresses would be VRRP addresses, and while both routers would be up, each gateway address would be up on another router. If one router would stop working, the gateway address preferring it would migrate to the remaining one.


Correct.


The answer to this depends on what ISP1 and ISP2 actually mean:

  • if these are ports connecting the routers to the very same L2 network, you can use VRRP also on WAN side (just make sure you don’t use the same VRRP group IDs like the data center, as that would cause fireworks). In such case, you would again use two VRRP interfaces with different IP addresses from the /29, and when both routers would be running, each of the two addresses would be up on another one. The VRRP interfaces have scripts on-master and on-backup that are triggered when the role of the VRRP interface changes to master of backup respectively. These scripts may then adjust the priority of the VRRP interfaces attached to the LAN interfaces.
  • if there are distinct WAN subnets, you’ll need a scheduled script to adjust the LAN side VRRP priorities depending on WAN gateway availability.

Just a terminological remark, the abbrevaiation NIC is typically used for a single Ethernet interface even if it is physically located on a multi-port plug-in card.

Sorry, pardon the language :smiley:

What I am saying is what I am saying. I can’t agree or disagree with your comparison to Hyper-V because I never ran Hyper-V on a server so I never dived into its multi-link capabilities

I’ve just read more the docs on VMWare networking and yeah, there is NIC teaming with switch independent configuration. Just like on the Windows Server and SET so yeah, what you said you said, is perfectly correct :smiley:

Yes in your case, not necessarily a direct one in other cases. If the servers do not need to talk to each other on L2, an L2 path between the server NIC and the router NIC may be sufficient. So if both routers have links to both switches, and the servers do not need to talk to each other,
there is no need for a direct link between the switches.

The second diagram shows only one patch from the Router to one switch. No links to both. The servers may talk to each other, but since all of them will have connections to all the switches, the traffic would stay there, good even if they want to talk to each other.

Are ISP1 and ISP2 just physical interfaces accessing the same L2 segment (i.e. switch redundancy at data center side) or do ISP1 and ISP2 represent different interconnection subnets that are both capable to accept route advertisement from you to tell which address from your /29 range is reachable through which of these two L3 subnets?

They are just different paths to whatever “switch(s)” they have behind the scenes. All they tell me is that I have a /29 range to assign to my router(s) and a gateway IP.

OK, so in order that redundancy and maybe load distribution worked (I don’t know how CloudFlare handles the traffic), the cloudflared on each server has to establish the VPN tunnels to multiple CloudFlare POPs. The task of responding via the correct tunnel is the job of cloudflared, but a task of spreading the tunnels from a given server to the individual CloudFlare POPs across the available WAN paths exists. I gather that the addresses of the VPN servers are given as FQDNs so the IP numbers may drift over time. On Mikrotik, I would use one address list per such FQDN to keep the DNS translation up to date and control the choice of WAN, but that implies that all the traffic would run through the same router. So for your use case, it seems much more useful to me to move this task to the server as well, and give the server two gateway IP addresses to choose from; these gateway addresses would be VRRP addresses, and while both routers would be up, each gateway address would be up on another router. If one router would stop working, the gateway address preferring it would migrate to the remaining one.

The way cloudflare works is that I set them as my DNS server, and create a public facing DNS entry like “app.contoso.com”. Whenever someone make an HTTP request, cloudflare will resolve that DNS to the IP address of a PoP closest to the user request location. Once that request goes inside the loca PoP, it is then routed thru cloudflare global infrastructure until it reaches the PoP closest to our datacenter.

We would have multiple tunnels, probably 1 per physical VMWare server connected to that PoP (or any other we decide to for HA reasons). Cloudflare will then load balance that incoming requests among those tunnels which in turn has pre-configured to which service on my internal network that packet/request should be sent to. My service just process the HTTP request and reply to it. So there is no need for the application to figure out which tunnel the request came from since it is replying to the same HTTP request. If this is a persistent connection like WebSockets for example, it will stick with that server until it is closed and we at the application level handle the processing of “which web socket should I send this to?”.

That way, we don’t need to expose our service to the inbound traffic at all as cloudlfare is dealing with with WAF, DDOS, etc for us. All we receive is clean traffic from a connection that was originated from our system to cloudflare. So in that sense, all router has to do (naively speaking) is to allow outbound connectivity with Cloudflare.

If the network uplink gateway IP is behind the VRRP, then whatever Router has that IP at the moment will be the one forwarding out the packets thru the tunnel. If the router becomes unavailable, it is fine, cloudflared will reconnect and we should be good.

if these are ports connecting the routers to the very same L2 network, you can use VRRP also on WAN side (just make sure you don’t use the same VRRP group IDs like the data center, as that would cause fireworks). In such case, you would again use two VRRP interfaces with different IP addresses from the /29, and when both routers would be running, each of the two addresses would be up on another one. The VRRP interfaces have scripts on-master and on-backup that are triggered when the role of the VRRP interface changes to master of backup respectively. These scripts may then adjust the priority of the VRRP interfaces attached to the LAN interfaces.

Ahhh I see. So we set the VRRP like this (hypothetical IPs assuming the /29 ):

  • Gateway: x.x.x.5
  • Router 1: x.x.x.1
  • Router 2: x.x.x.2
  • VRRP: x.x.x.3
  • Default route on both would use x.x.x.3 with gateway x.x.x.5

Then set the on-master/backup scripts to configure the LAN VRRP which will update the floating gateway IP for the internal gateway network. I wasn’t aware it was a good idea to do VRRP using the public IPs (as long as the group id don’t collide as you said). Fantastic!

Thank you for all the input! The servers are already with me, will make tests and see what I can get out :smiley:

Well, my suggestion to use VRRP on WAN side was based on the wrong understanding I’ve built, that the cloudflared establishes a pair of tunnels to each POP for redundancy and each of them should use another WAN (ISP1 and ISP2) of your setup. So the essence of the idea was to have two VRRP interfaces on each side (LAN/WAN), that would normally let the traffic use both routers evenly.

In fact the “back to back VRRP” approach cannot detect, let alone handle, many fault scenarios autonomously and needs help of an external script. Even if we adhere to the rule of redundant designs, which says that only the first issue to occur has to be resolved autonomously, the first issue can occur both on the WAN side or on the LAN side, so the on-master and on-backup scripts cannot just blindly adjust the priority of the other VRRP interface - they should trigger some other script that would do some checks and set the priorities as a result of some more complex evaluation.

Given that there is apparently a single tunnel per server per POP, I would use VRRP only on the LAN side of the routers, but I would spend some effort on monitoring the health of at least the WAN connections. The case when the router just dies completely is simple to handle; the case when the router is all right, its physical connections are all right as well, but the end-to-end paths are not, needs more effort to detect and deal with.

One way is a periodically scheduled script checking the availability of multiple destinations in the internet should adjust the priority of the VRRP interface if it detects the internet connection to be lost, assuming that the issue is local and the twin router is not affected. I usually add also a check of reachability of devices on the LAN for the case that the connection between the router and the switch is OK but the connection between the switch and all the LAN devices is broken, but it should not be necessary in your case.

In any case, redundant systems need to be tightly monitored as they handle the first issue so silently that it usually remains unnoticed if no monitoring is in place. So any state change of the VRRP interface on both routers has to be reported, and any issue on the WAN as well. I don’t know what CloudFlare offers in this regard and how big an organisation you are, i.e. whether you have got a 24/7 network monitoring center; what I use for small projects are notifications using Telegram. Mutual monitoring is a must in such case - a device that has lost internet access cannot send a notification, so some other device must tell you it has lost contact with the former one.

AFAIK MLAG was never working in a predictable and reliable way. A typically overcomplicated MT nonsenese-feature.

If you need to bound hardware-devices as a “big logical device”, MT is definitely not your vendor. MT is good in routing, but switching - OH GOD NO (unless you want a severe depression)!!

Regarding VRRP, its a generic RFC-protocol, not implemented in ROS in a way other vendors do their HA. In ROS, there is a basic functionality, but to achieve what other vendors have with a click, you would need heavy scripting in ROS. As well: OH GOD NO (unless you want a severe depression)!!

Id suggest:
Use for the CRS317 hardware from another vendor, which supports production-ready stacking (no MLAG-nonsense).
Use for the RB5009 a hot/cold-backup solution and terminate ISP1 and ISP2 on the “hot” router. In case of a failure, you have to swap cables manually (or use a switch and disabe the ISP-ports on the “cold” router).

@Guscht thanks for the reply.

I had used 100G Dell switches in another situation and even they use MLAG so I don’t see a problem. Nonetheless, as you see from the thread, @sindy already proposed a better and simpler solution, which is what we are going to follow.

It is true that MT doesn’t implement the latest datacenter features (I’ve already asked them why not) you would find in most expensive switches like RDMA (RoCE and iWARP), etc., but again, for the price they charge, there is no competition if you can workaround/accept the limitations.

Using another vendor for this solution is not an option, at least not now and not for a while.

Use for the RB5009 a hot/cold-backup solution and terminate ISP1 and ISP2 on the “hot” router. In case of a failure, you have to swap cables manually (or use a switch and disabe the ISP-ports on the “cold” router).

I don’t need to drive to the datacenter or pay white glove services every time I need to make an update on the router, a regular failover test, or even a real router problem. Yes, we will have to have scripts but they are just fine.

Thanks