It’s the first time ever for me to see an F status of a VRRP interface on Mikrotik, I was unable to replicate it here using any settings I could think of.
So please disable the VRRP interface that shows this status, enable logging of VRRP using /system logging add topics=vrrp, run /log print follow-only where topics~“vrrp”, re-enable the VRRP interface, and wait until it shows the F status again. Then stop the /log print … and post its output. Hopefully there will be something useful in it.
Just curious if you happened to check the ESXi logs to find the root cause? Anyway, feel free to get back here if you find anything interesting for future reference.
I did not. That is just a minor part of what I am trying to figure out. Pretty sure it has to do with the fact that I use 2 physical adapters for that simple switch and that is causing the issue. I ran into a similar issue with a Cisco ASA a while back.
Still bashing my head on a wall trying to figure out how to use a single WAN IP address for the two routers I have created.
The easiest is to use private addresses in the “WAN VRRP”, and then NAT out the real public IP. Basically, the VRRP IP address does not have to be in the same subnet as the two members.
If NAT is not desirable, I’m less sure what to suggest. I suppose you could use private address as the WAN IP on each, and make the VRRP address the public IP, with the real prefix (not /32) and route VRRP. Although this path I’m not as sure of myself, since at least in older RouterOS, using VRRP interface as an interface itself is slow, and there may be other artifacts in routing without NAT.
At least in RouterOS 6, it was indeed possible that the address attached to the VRRP interface was totally unrelated to the address attached to the underlying “physical” one, so you could e.g. use private or APIPA addresses (169.254.x.y) to let the two devices talk VRRP to each other and a public address with a shorter prefix than /32 as the virtual one, so you could use dynamic routing protocols to advertise the router where the VRRP was currently in master mode as the gateway to that public subnet.
So also here, it should be possible to use just a single public address as the virtual one and let the devices negotiate which one will use it using private/APIPA addresses.
However, what I find much more problematic about VRRP than synchronizing multiple VRRP interfaces on the same device with one another is that a VRRP device does not become a master if the underlying physical interface stays up at L1 but the L2 path to the rest of the network is broken. Because if that happens, a VRRP interface always becomes a master as it stops receiving VRRP messages from the other VRRP routers, so if it also raises its priority on the other VRRP interface, it starts stealing the traffic and effectively blackholing it.
So once the WAN side VRRP interface becomes a master, it is very important to check that some canary destinations in the internet are indeed reachable through it before raising the LAN side VRRP interface priority to make it become a master.
On the “preferred master” router, the VRRP priority on both interfaces should be higher than on the “preferred backup” one and it should not be adjusted. On the “preferred backup”, the fact that the WAN side VRRP becomes a master should trigger a check that internet is reachable through it and only if the result of that check is positive, the priority of the LAN side VRRP can be raised to make it beat the “preferred master” one. The same behavior is necessary also for the event that the LAN side VRRP becomes a master, but of course you have to add some additional checks to prevent a self-locking effect, i.e. each VRRP interface on the “preferred backup” side may only raise the priority of the other one if its own priority has not been raised yet.
I’m not sure the VRRP address has to be a /32 when it’s NOT in same IP subnet?
Yeah… I had a good handle on how this all worked. And on LAN side, all the same. But with V7, the effects of the new routing engine on VRRP is just not well described and subtlety different. Now I don’t use VRRP on WAN, but recall the pref-src= on routes didn’t work the same when testing sync connections a while back - but could never really figure out a scheme to even use it on the WAN side.
So on the WAN interface we would use a private IP address to each router.
VRRP2
RTR 1 = 192.168.1.40
RTR 2 = 192.168.1.50
VRRP Address = x.x.60.100 (WAN address for my internet connection)
In this setup the VRRP packets travel the .40 and .50 IPs to check keepalive. If the primary routers goes offiline it triggers the failover and the backup router would take over all routing NAT and forwarding duties, but until there is a failover the backup is in a shutdown/blocking state? I just want to make sure I don’t accidentally create a loop or issue within even my test environment right now as I have some other stuff I am working on there as well.
As a secondary question would the setup I am trying to implement with using a second CHR as a back for the WAN connection be better accomplished using a script of some kind? I am completely unfamiliar with using scripts on Mikrotik, but it is the only other thing I could think of as a way to have a failover solution.
I appreciate the help everyone is giving me as well. Thank you.
To be precise: each of the two routers periodically sends a multicast VRRP “advertisement” from its individual 192.168.1.x address. The one that receives such advertisements whose priority field bears a higher value than its own goes to backup mode, which makes the VRRP interface appear to be down to its routing stack, so it does not use the x.x.60.100 address in this state. The router that doesn’t receive these advertisements at all, or receives them with a lower value in the priority field than its own configured one, runs in master mode so its routing stack can see the VRRP interface as up.
Some scripting is needed in any case if you want to let the master/backup state of the LAN side track the one of the WAN side and vice versa. But the bigger challenge here is that given the way it works, the VRRP switches to master mode also if it completely loses contact with the rest of the network via the underlying “physical” interface but that interface itself stays up. So if the WAN side VRRP on the"normally backup" machine becomes a master in this way and propagates that change to the LAN side, it will break things rather than saving them. So you need to “somehow” check that you actually can reach something in the internet via the WAN before you propagate the change to the LAN side. And the least complicated way to accomplish this is to use another public address for this test, assigned to the “physical” interface as an additional one (or, to be completely on a safe side, to another VRRP interface attached to the same physical one, to avoid that the machine decides to use that address as a source of the VRRP advertisements it sends). But it should be enough to do this check on the “normally backup” machine, so you only spend one more public address on this, not two.
Indeed, if you run a machine in a 3rd party data center, this is exactly what you expect; however, the OP calls “datacenter” a bunch of servers in his own premises and it is his responsibility to provide redundancy. So an automatic migration of VMs from a failed host to another one may not be an option.
I host the CHRs myself. In the final setup I will have one CHR at 1 location the other CHR at my other site. The two sites are interconnected and can share connectivity and yes I can easily spin up the CHR at either site in the event of a failure, but that can take some time to do depending on the type of failure. If I can get something going via VRRP or some kind of script that is able to failover automatically that is what I am looking for. This is for a video service my company offers and downtime is very much not a good thing.
Correct. Each of my datacenters as I use the term is a Cisco UCS deployment with 4 hosts in each of two locations. Each location can be used as a backup for the other in the event of a failure of one site. The four hosts at each site are in a cluster and managed by vCenter. In the event of a host failure at a site vCenter should automatically migrate the VMs on that host to one of the other three. But in the event of a failure of the entire UCS and having to failover to the other site I have to manually kick off a failover for my protected VMs. Hope that clarifies some of my physical setup.
i have read some of your previous posts, probably you have a wrong solution design, the wrong application for your streaming service requirements.
this…
So on the WAN interface we would use a private IP address to each router.
VRRP2
RTR 1 = 192.168.1.40
RTR 2 = 192.168.1.50
VRRP Address = x.x.60.100 (WAN address for my internet connection)
it probably won’t work in real world. if those separate datacenters used different subnets - because the floating virtual ip for vrrp needs to be in the same subnet for the heartbeat.
if you were thinking about using that vrrp virtual ip for your streaming service access - then nope, that won’t work if those 2 Datacenter in different subnet.
what you need is an layer 7 load balancer on each Datacenter - so need at least 2 public ips (don’t need to be in the same subnet). have a read on haproxy. it can have vrrp as well - but you don’t need that part. it will directly make your streaming service highly available on both sides. point those haproxy to your streaming servers (private ips, response will be routed back to the load balancer).