Everyone,
We developed a fault tolerant solution for our main Mikrotik CCR 1036 router. This router is used at the main shared firewall for our entire hosted customer base. Right now, we have 25 subnets and 47 tunnels to our various clients. When this firewall goes down, we needed a method to minimize down time.
I looked at several different methods to accomplish this goal. The main issue with redundant firewalls like a Sonicwall HA pair is that the firewalls in a redundant pair often share a config. However, if there is a problem with the shared config, then you have an issue. The second issue is cost. Trying to find hardware equivalent in performance to the CCR 1036 that provides redundancy would be very expensive.
Our solution has 2 mikrotiks configured so that if FW1 goes down or is rebooted, FW2 immediately takes over all routing and tunnels. We can also execute a failover from FW1 to FW2 on command that will allow us to update routerOS or the firmware. We do all of this with less than 30 seconds of total downtime.
How it works is this:
The 1036 has 12 Ethernet ports. In our scenario, ether1 and 2 are bonded together to provide internet traffic. Ether 3-8 are bonded together to provide VLAN traffic for our subnets.
Port ether9 is directly connected between the two firewalls. This has a netwatch with monitors if the other firewall is responding.
Port ether10 is a backup Internet port. We use this so that the standby firewall, meaning, the one which is not in production, has a path to the internet.
Port ether 11 is the failover port. It is directly connected. If port ether 11 is turned off, then the firewalls are configured to run with FW2 in production and FW1 in standby mode.
Port ether 12 is used for limp mode. In limp mode, the firewall reboots with all it’s ports turned completely off. This is to ensure that a rebooting firewall does not affect production. In limp mode, there is an IP that is used just for this purpose. Once the firewall reboots, there is a script which sets up the firewall to go in to standby mode.
The failover works by turning off and on ports as needed depending on the intended role. This ensures that between the 2 firewalls, I always have one in production and I have the ability to address critical issues without being in my colo facility. Since implementing this, our longest period of downtime was 30 seconds and that problem was fixed before our team was able to answer the phone call.
If you are interested, I will post the code so you can see how I did it.