Mikrotik Fault Tolerance Solution

Everyone,

We developed a fault tolerant solution for our main Mikrotik CCR 1036 router. This router is used at the main shared firewall for our entire hosted customer base. Right now, we have 25 subnets and 47 tunnels to our various clients. When this firewall goes down, we needed a method to minimize down time.

I looked at several different methods to accomplish this goal. The main issue with redundant firewalls like a Sonicwall HA pair is that the firewalls in a redundant pair often share a config. However, if there is a problem with the shared config, then you have an issue. The second issue is cost. Trying to find hardware equivalent in performance to the CCR 1036 that provides redundancy would be very expensive.

Our solution has 2 mikrotiks configured so that if FW1 goes down or is rebooted, FW2 immediately takes over all routing and tunnels. We can also execute a failover from FW1 to FW2 on command that will allow us to update routerOS or the firmware. We do all of this with less than 30 seconds of total downtime.

How it works is this:

The 1036 has 12 Ethernet ports. In our scenario, ether1 and 2 are bonded together to provide internet traffic. Ether 3-8 are bonded together to provide VLAN traffic for our subnets.

Port ether9 is directly connected between the two firewalls. This has a netwatch with monitors if the other firewall is responding.

Port ether10 is a backup Internet port. We use this so that the standby firewall, meaning, the one which is not in production, has a path to the internet.

Port ether 11 is the failover port. It is directly connected. If port ether 11 is turned off, then the firewalls are configured to run with FW2 in production and FW1 in standby mode.

Port ether 12 is used for limp mode. In limp mode, the firewall reboots with all it’s ports turned completely off. This is to ensure that a rebooting firewall does not affect production. In limp mode, there is an IP that is used just for this purpose. Once the firewall reboots, there is a script which sets up the firewall to go in to standby mode.

The failover works by turning off and on ports as needed depending on the intended role. This ensures that between the 2 firewalls, I always have one in production and I have the ability to address critical issues without being in my colo facility. Since implementing this, our longest period of downtime was 30 seconds and that problem was fixed before our team was able to answer the phone call.

If you are interested, I will post the code so you can see how I did it.

I’ve set up a setup for a colocation/hosting company with two CCR1036.
But I haven’t done anything like what you’ve done.
I think you went overboard with your configuration. You could have just used VRRP on both sides (upstream/lan) and have achieved only 3 seconds failover time.

In my case, upstream wise the failover is handled by BGP and with BFD the failover is instantaneous (though we ended up not using it since it’s not that stable).
On an unexpected downtime depending on the type of downtime (ie: if the interface is up or down) even without BFD the failover can be very quick (depending on the BGP configuration on the upstreams, BGP timers, etc).

Local wise (clients’ vlans, etc) we use VRRP which will failover in 3 seconds after an unexpected downtime of the master router.

Also since VRRP uses the same virtual MAC address on both routers then no ARP updates should be done on any switch or client after the failover. Your method (unless you have forcefully changed the MAC address on the standby router) will cause unnecessary extra downtime on some clients if for some reason they don’t update their ARP tables quickly.

During a scheduled failover there is essentially no downtime. VRRP and BGP can failover gracefully without a single packet lost!

And you only need a minimum of 3 ports (1 uplink, 1 lan, 1 iBGP between the two routers) on each CCR leaving many ports for future expansion (ie: more uplinks, bonding, etc..)

The issue with MikroTik and HA in general is the lack of any way to sync two routers together.
When I saw your thread I was hoping you would post some way of syncing both routers automatically.
Especially on a statefull firewall with connection tracking the lack of state sync can cause many issues during a failover.


The catch to all this is how to sync both routers together so you don’t have to do double work every time you need to change something.
I haven’t found anything so far that fits my needs. It has been requests a few times to be added as a native feature of MikroTik.

I had considered VRRP. The problem being that at the colo datacenter (Peak 10), they clear tier mac address tables every 5 hours, instead of anything more reasonable. The problem being that the having the IP addresses come up with entirely different MAC addresses was not being recognized fast enough.

What I ended up having to do is have configure the internet facing ports on both firewalls with the same MAC address. That way I didn’t have to wait for them to clear their ARP tables for the backup to come up. It was not a problem on my side of the network, since I control that network.

I also need a solution to do the changes on both at the same time. At the moment, I resolve this just by admin discipline, which is fine for me, but I would like something a little more robust for my guys.

I haven’t given up on this bit. I saw someone who was sending commands to multiple mikrotiks using SSH calls in bash. I am testing that to see if I could adopt that method to be able to provide something close to that.

Also, I did get VRRP working in the lab. I liked it, but for various reasons having to do with the way Peak 10 provides connectivity to us, we couldn’t get it to work at Peak 10.

The nice thing about Mikrotik is that you can generally script workarounds to achieve the desired goals. It’s a shame that your provider cannot adequately support a VRRP based solution. I personally use VRRP and have found it to do remarkably well. Its always nice to reboot an edge router and not even drop a VoIP call while it switches between routers.