Failover setup options for WAN

Mikrotik just released a video on how to do simple fail over setup for WAN, and people in comments are asking how to deal with established connections (like TCP sessions, VoIP) as those will break.

Here is a concept I recently created to solve this issue:

Implementation might not be perfect, but doable by end user on a public network. Main issue here is, when IP address changes due to failover, sessions break. Sessions are established between fixed endpoints, if source or destination IP changes communication is no longer possible (usually). So, we need stable and reliable node making those connections over public network, that will be one of C1s roles. To get our data to C1 we need a tunnel, a secure and transparent tunnel, that can handle failover. L1 and L2 are connecting to C1 over Wireguard (WG) to create that secure tunnel, and VXLAN layer is needed for transparency. In this concept whole network is one broadcast domain (local network) and CHR instance in my example is truly used as Cloud Hosted Router - it is the only routing point in this topology. Also acting as DHCP server, DNS server etc, as actual router would do. L1 and L2 are just passing network frames to C1. It can sound scary, but I don’t see another way how to keep those sessions unchanged and alive.

Briefly about roles:

C1 - In my case CHR instance hosted on cloud. Acts as VPN concentrator and router. Local traffic exits to public network from here. It can also be any Mikrotik device on premises, requirements are static IP, so it can accept WG connections, CPU power to process traffic, and obviously more stable connection than L1 and L2 to be useful. Bonding and failover decisions are happening here.

L1 and L2 - Mikrotik devices, that are connected to separate ISPs, can be LTE/5G modems (WG usually can cut through CGNAT, but not always), DSLs etc. Role is to establish WG tunnel and pass local network frames over VXLAN.

SW1 - Bonding links other end and connection to your physical network. Can be any Mikrotik device as long as it has required capacity to handle traffic.

What can others recommend? Sure, there is networking protocols and “enterprise” solutions that your ISPs can offer, but this time I’m talking about solution, that you can create and use on your own. This implementation can be used as foundation to various configurations, implementations and is scalable to even more failover links or linking locations. It is a SD-WAN solution, how it’s called nowadays.

In @druvis’s “starter” scheme using “recursive routes”, those connection will break. And client applications will have reconnect/retry themselves, with the proviso users will notice. The good news is another internet be back relative quick (vs no failover). But at end of day, however, clients will notice on something like VoIP, and web browsers may recover a “refresh” after failover trigger.

I think The idea in video is to keep it simple, with the “bet” here is failover should be rare, and after a minute or less most protocol will recover. No approach on end-user side alone* can offer “seamless failover”* since the public IP address is going change after failover, regardless the method in RouterOS you use.

*The approach to “seamless failover” is you’d use a server on other side of multiple WAN and VPN tunnels over each WAN to that remote server, and from the remote server/router all public IP be same towards internet. But this WAY more involved and manual setup.

Now you can do more speed recovery time after failover, like clearing /ip/firewall/connection – but then it starts to get into scripting – which @druvis wisely avoids on the first video on the topic.

But there are three approach possible on end-user side:

  • recursive routes - in recent video
  • netwatch / scripting - discussed in forum, in many posts, since it gets does get more complex with routing tables and more config
  • PCC - which already had a video, but similar to netwatch it also requires setting up routing tables:

https://www.youtube.com/watch?v=nlb7XAv57tw