Hardware Redundancy / Clustering / Standby Router

It would be great if RouterOS had a feature to synchronise it’s configuration from a partner and enter service when the primary router is unavailable. Something along the lines of transferring a backup or export every X minutes and switching to this configuration when primary is not reachable.

For example:
All connectivity via switching layer and thus presented to CCR routers as VLANs
CCR1 connected to CCR2 via cross over on ether8 to transfer configs and perform ‘heart beats’
CCR2 would essentially be in standby mode. Upon activation the current running config would be written to disk as ‘standby’ and the backup config brought in to service. Upon heartbeat restoring the current config would be written back as the backup and the standby configuration reloaded.

I actually just finished implementing my own version of this using pairs of CCR1009s. Interested in testing it? I have it detecting changes and pushing and restore a system backup to a secondary. It uses a VRRP interface (directly connected between the pair on ether1) to heartbeat and decide what to do. It detects changes on the active member via system history. Everything else is a standard configuration, it shuts down the other ether interfaces on the backup. It is a little hacky but seems to work pretty well.

I am still polishing it but would be interested in someone else testing it and working on it with me.

Absolutely, I would love to review it.

I initially thought it wouldn’t be possible as exports generally only contain non default settings and RouterOS, to the best of my knowledge, has no mechanism to reset portions of its configuration (ie /int reset).

My next thought was to transfer backups to a web server which could then hopefully inject the necessary netwatch, ether8 config and specific scripts…

Hi nathan1 me too like to test it

i will test on 2 CHR LABs

Configuration sync is only one piece of the puzzle.

State sync is just as crucial for failover chassis redundancy to work. If the backup box doesn’t have state information for the firewall, for instance, then all open connections through the firewall would be blocked if the chassis fails over, and all stateful NAT entries are going to fail. If the “stack” is running PPPoE services, then the standby is going to need to know the state of all sessions, session timers, bandwidth counters, etc. If IPSec SAs exist, then those are going to need to be synchronized. If routing protocols are running, then those states will need to be replicated in real time also.

Basically, the soul of box A needs to be able to jump into box B at a moment’s notice, discarding the old body. :wink:

ZeroByte:
I do agree that a continuous realtime sync done internally by Mikrotik is the best way to go about it, but this feature has been requested of Mikrotik for quite some time, without the feature being added. Short of that, having a hot(cold in some sense) standby that is ready to takeover when the master drops, even without active state replication, is very useful in my mind.

As it stands today, with Mikrotik devices, we all have singleton routers running. If they die, are we better off having no service to the end user or coming up with the latest (hopefully) backup and start allowing new flows? With my use case, I’d rather be up and running again automatically.

My solution isn’t perfect but I’m hoping that it might be improved/tested by others. Until Mikrotik implements the realtime sync, maybe they can easily add some features that make my implementation more effective. So far, it is working very well for me. YMMV.

I’m still putting some things together, I’m hoping I will have something to offer soon.

As promised:
https://github.com/svlsResearch/ha-mikrotik

If you are bold enough to test this, please heed my warnings. Have a proper test setup with out of band access.
If you do have a proper setup and want to give me some feedback of your tests, I’m happy to offer some guidance. The code is extremely alpha, but I do have 2 production pairs actively running it and successfully failing over.

It might be a little hard to get your head around the code without seeing it working. When I have time, I will product a video of it working from bootstrap, to configuration, to failover back and forth.