most effective failover?

tomislav91 · September 17, 2022, 4:47pm

Which scenarios are you using for dual wan failover? Through just routes, ibgp, or something else?
Anykind of examples will be helpful

I am trying to find replacment for this kind of failover

/ip route
add check-gateway=ping distance=1 gateway=8.8.8.8
add check-gateway=ping distance=2 gateway=8.8.4.4
add distance=2 dst-address=8.8.4.4/32 gateway=192.168.2.1 scope=10
add distance=1 dst-address=8.8.8.8/32 gateway=192.168.1.1 scope=10

sindy · September 17, 2022, 5:19pm

It depends on what kind of WAN you have - unless you’ve got public addresses also in LAN and your own AS number, the only thing you can do is to speed up the failure detection using the method described by means of some scripts pinging the canary addresses more frequently than in those hardcoded 10s intervals.

If you trust the availability of your favourite data center much more than the one of your ISP’s uplinks, you can run a CHR in that datacenter, use OSPF and BFD to connect one tunnel per each WAN to that CHR, and do the NAT there.

anav · September 17, 2022, 6:43pm

What is wrong with it?
What has changed in your requirements??

tomislav91 · September 17, 2022, 9:48pm

i have mostly public static ip on one wan interface and one NAT-ed ip on another.
Problem lies where I can go to gateway, its pingable, but there is a ISP problem and it wont go outside of router, then this kind of failover don’t work, that’s why i try to find something better for me.

maybe to add some netwatch?

/tool netwatch
add down-script="ip route disable [find dst-address=0.0.0.0/0 gateway=8.8.8.8] \
    \r\
    \n:log error \"ISP_1 is up\"\r\
    \n/ip firewall connection remove [find]\r\
    \n" host=1.1.1.1 interval=30s up-script="ip route enable [find dst-address=0\
    .0.0.0/0 gateway=8.8.8.8] \r\
    \n:log error \"ISP_1 is up\"\r\
    \n"

sindy · September 18, 2022, 8:22am

So the typical SOHO scenario with no own AS number and with two ISPs, i. e. the public addresses from which you connect to servers in the internet are different and thus a failover to the secondary uplink means that all existing sessions break down.

That’s surprising - this setup (recursive routing where the “canary” (path transparency check) IP addresses are routed via the actual gateways and everything else is routed via the “canary” IPs) deals exactly with the issue you describe, i.e. that the actual gateway stays up but the network behind it loses connection to the rest of the internet. If that happens, the check-gateway ping stops getting responses from the canary IP (virtual gateway) and the route thus becomes inactive.

So if this “doesn’t work” for you, something else must be broken (I can e.g. imagine that ping keeps getting through an uplink but other traffic doesn’t), or “doesn’t work” must mean something else than how I understand it.

Ah, the /ip firewall connection remove [find] in your netwatch script maybe gives a hint on what “doesn’t work” means? Whereas TCP sessions time out eventually once the remote server stops responding, UDP sessions (IPsec, L2TP, SIP, …) that get refreshed from the LAN side more frequently than once in 3 minutes stay stuck with the same reply-dst-address. If this is indeed the issue you need to address, do concentrate at that - use a scheduled script to remove these connections whenever the route through their respective WAN becomes inactive.

There is also a follow-up question - what to do when the WAN becomes available again. The answer to this one depends on the usage strategy of the WANs. If the strategy is load distribution, nothing needs to be done - connections that migrated to WAN B due to failure of WAN A may be left running via WAN B even after WAN A recovers. If the strategy is pure backup because WAN B is more expensive and/or offers less bandwidth than WAN A, the script has to remove connections from WAN B once WAN A recovers, but maybe after some guard time rather than immediately.

As compared to netwatch, a scheduled script gives you more flexibility in what it tracks. So e.g. it can inspect the current state of all the WANs at each run, compare it with the state detected during the previous one, and execute actions best matching the particular state change detected (there may be more than two WANs and more than one usage strategy).

tomislav91 · September 18, 2022, 12:37pm

there is no l2tp or any tunneling on those routers, pure LAN and WAN usage.
WAN_B is only for backup, its usually copper isp provider, and WAN A is fiber.

So my direct problem is, first there is a about 60 packets, when i ping from dude to that site, before internet gets back, but it is a situation where everything seems ok.
So if i test it myself, and just plug out cable from isp modem, my route table shows that gw 192.168.0.1 (for example) is still alive and keeps that route enabled even though there is no internet.
With this routes I can’t do it like that.

sindy · September 18, 2022, 1:38pm

That was just an example of long-term UDP connections that need to be treated specially to properly migrate to the backup WAN. A continuous ping is yet another example of the same - if you run a continuous ping from the LAN side, it keeps updating the existing tracked connection so the packets keep getting translated to the IP address of the dead WAN until you either stop pinging or remove that connection so that the next echo request packet could create a new one although it has the same ping ID.

Of course the route dst-address=8.8.8.8 gateway=192.168.0.1 remains active - the Ethernet interface stays up if you disconnect the cable from the other side of the modem. But the route dst-address=0.0.0.0/0 gateway=8.8.8.8 must go down if your scope and target-scope values on all the routes involved are set up properly, i.e. if no other route to 8.8.8.8, which could be used by the 0.0.0.0/0 via 8.8.8.8 one, is available in the system.

I use this at multiple places and there is no issue - in 10 seconds at the latest the loss of internet access via the WAN gets noticed and the 0.0.0.0/0 via 8.8.8.8 route gets deactivated. So I’d suggest you run /route print interval=1s where gateway=8.8.8.8 and watch the status flags while you repeat the experiment with disconnecting the DSL cable. You should see the A to disappear. If it does, the mechanism itself works and the issue you encounter are the surviving tracked connections. If it doesn’t, something is wrong with the scope and target-scope values and 8.8.8.8 remains reachable via some other route.

anav · September 18, 2022, 1:51pm

Okay, your failover is a bit confused, in the sense that there is no need to check the failover through a public DNS server site.
The reason being, if the primary is down, then if the secondary has no access, regardless you have no internet.
However, perhaps there is some logic to knowing ???

Please try with these settings ( NO recursive on failover wan) but checking two DNS sites (cloudfare and then quad9) for main primary WAN.
Pay attention to all of the scope entries!!
…

/ip route
add check-gateway=ping distance=2 dst-address=0.0.0.0/0 gateway=1.0.0.1 scope=10 target-scope=12
add distance=2 dst-address=1.0.0.1/32 gateway=192.168.1.1 scope=10 target-scope=11
add check-gateway=ping distance=3 dst-address=0.0.0.0/0 gateway=9.9.9.9 scope=10 target-scope=12
add distance=3 dst-address=9.9.9.9/32 gateway=192.168.1 scope=10 target-scope=11
+++++++++++++++++++
add comment=SecondaryISP distance=5 dst-address=0.0.0.0/0 gateway=192.168.2.1 scope=10 target-scope=30

…

If perchance you want to check both routes recursively then try the following ( the primary WAN checks cloudfare and the secondary WAN quad9):
…

/ip route
add check-gateway=ping distance=2 dst-address=0.0.0.0/0 gateway=1.0.0.1 scope=10 target-scope=12
add distance=2 dst-address=1.0.0.1/32 gateway=192.168.1.1 scope=10 target-scope=11
add distance=5 dst-address=0.0.0.0/0 gateway=9.9.9.9 scope=10 target-scope=12
add distance=5 dst-address=9.9.9.9/32 gateway=192.168.2.1 scope=10 target-scope=11

tomislav91 · September 19, 2022, 8:58am

there is no l2tp or any tunneling on those routers, pure LAN and WAN usage.

That was just an example of long-term UDP connections that need to be treated specially to properly migrate to the backup WAN. A continuous ping is yet another example of the same - if you run a continuous ping from the LAN side, it keeps updating the existing tracked connection so the packets keep getting translated to the IP address of the dead WAN until you either stop pinging or remove that connection so that the next echo request packet could create a new one although it has the same ping ID.

So if i test it myself, and just plug out cable from isp modem, my route table shows that gw 192.168.0.1 (for example) is still alive and keeps that route enabled even though there is no internet.
With this routes I can’t do it like that.

Of course the route dst-address=8.8.8.8 gateway=192.168.0.1 remains active - the Ethernet interface stays up if you disconnect the cable from the other side of the modem. But the route dst-address=0.0.0.0/0 gateway=8.8.8.8 must go down if your scope and target-scope values on all the routes involved are set up properly, i.e. if no other route to 8.8.8.8, which could be used by the 0.0.0.0/0 via 8.8.8.8 one, is available in the system.

I use this at multiple places and there is no issue - in 10 seconds at the latest the loss of internet access via the WAN gets noticed and the 0.0.0.0/0 via 8.8.8.8 route gets deactivated. So I’d suggest you run /route print interval=1s where gateway=8.8.8.8 and watch the status flags while you repeat the experiment with disconnecting the DSL cable. You should see the A to disappear. If it does, the mechanism itself works and the issue you encounter are the surviving tracked connections. If it doesn’t, something is wrong with the scope and target-scope values and 8.8.8.8 remains reachable via some other route.

thanks, i will try it. Its a bit hard do find a moment, cause i have a dozent of routers. But i’ll keep watching and try to find problem.

Can I just solve this problem and remove all connections when wan1 goes down/up?

tomislav91 · September 19, 2022, 9:01am

Okay, your failover is a bit confused, in the sense that there is no need to check the failover through a public DNS server site.
The reason being, if the primary is down, then if the secondary has no access, regardless you have no internet.
However, perhaps there is some logic to knowing ???

Please try with these settings ( NO recursive on failover wan) but checking two DNS sites (cloudfare and then quad9) for main primary WAN.
Pay attention to all of the scope entries!!
…
/ip route
add check-gateway=ping distance=2 dst-address=0.0.0.0/0 gateway=1.0.0.1 scope=10 target-scope=12
add distance=2 dst-address=1.0.0.1/32 gateway=192.168.1.1 scope=10 target-scope=11
add check-gateway=ping distance=3 dst-address=0.0.0.0/0 gateway=9.9.9.9 scope=10 target-scope=12
add distance=3 dst-address=9.9.9.9/32 gateway=192.168.1 scope=10 target-scope=11
+++++++++++++++++++
add comment=SecondaryISP distance=5 dst-address=0.0.0.0/0 gateway=192.168.2.1 scope=10 target-scope=30
…

If perchance you want to check both routes recursively then try the following ( the primary WAN checks cloudfare and the secondary WAN quad9):
…
/ip route
add check-gateway=ping distance=2 dst-address=0.0.0.0/0 gateway=1.0.0.1 scope=10 target-scope=12
add distance=2 dst-address=1.0.0.1/32 gateway=192.168.1.1 scope=10 target-scope=11
add distance=5 dst-address=0.0.0.0/0 gateway=9.9.9.9 scope=10 target-scope=12
add distance=5 dst-address=9.9.9.9/32 gateway=192.168.2.1 scope=10 target-scope=11

i will try this, but not sure are those Dns are quite safe ?

anav · September 19, 2022, 5:00pm

Hmm only billions of people use them yes they are safe. You are not sending any data on them simply checking if its available.
Just ensure you use different DNS servers than these ones under /IP DNS, dont remember why though.

sindy · September 19, 2022, 5:41pm

Because if you use x.x.x.x as the “canary” address (to monitor the transparency of a articular WAN), it is only reachable through that WAN. So strictly speaking you can use the canary address also as DNS servers but not as the only one.

S8T8 · September 19, 2022, 10:09pm

Apologies for adding question not too relevant, I tried to set up a failover but never understood how to properly do using @anav example, I have a PPPoE and DHCP Client connection, or a static IP and DHCP Client connections, should work this easy way?

/interface pppoe-client add add-default-route=yes default-route-distance=1 interface=ether1...
/ip dhcp-client add add-default-route=yes default-route-distance=2 interface=ether2...

anav · September 19, 2022, 10:35pm

I would change the pppoe connection and wan2 connection such that ADD DEFAULT ROUTE=NO.

Then you can add the routes needed manually as per the examples.

sindy · September 20, 2022, 6:00am

If you have to piggyback a topic, choose a long-sleeping one rather than making a live one a multi-threaded mess. Or, even better, just create your own and wait for the moderators to approve it, it rarely takes longer than a day.

To your question - @anav’s advice will work in a limited number of cases. If you want a better one, do what I’ve suggested above.

tomislav91 · September 20, 2022, 9:47am

I will change it, also I’ll try solution for faster/better changing manually wan, if I have problems with that ISP (some pings goes around 200+) and i want to change it mannualy, i must disable rules in Ip-routes and thats only way. Can we maybe somehow make it more easily, for support team. Problem is if you dont do it fast, it stuck.

tomislav91 · September 20, 2022, 9:52am

so If i use DNS for IP-ROUTE , thaose DNS i shouldn’t use for IP-DNS?

sindy · September 20, 2022, 10:52am

You can but while the WAN whose transparency is monitored using 8.8.4.4 has no connection to the internet, DNS requests to 8.8.4.4 fail, so the DNS cliens switch to the other DNS server. Most of them do not use round robin but keep using the same server as long as it is reachable; once it stops responding, they move to the next one in the list and keep using it even if the previous one becomes rechable again. So if the primary link fails now and then whereas the backup is stable, your DNS traffic stays on the backup link.

anav · September 20, 2022, 11:16am

Interesting, in any case there are lots of decent DNS servers out there.
My advice is to stick with the good DNS servers for your IP DNS service, like cloudfare/quad9
and use google for the routes checking…

tomislav91 · September 20, 2022, 11:39am

so better idea is to use for example 1.0.0.1 and 9.9.9.9 for checking in IP-DNS and for IP-DNS 8.8.4.4 and 8.8.8.8?

From your post

/ip route
add check-gateway=ping distance=2 dst-address=0.0.0.0/0 gateway=1.0.0.1 scope=10 target-scope=12
add distance=2 dst-address=1.0.0.1/32 gateway=192.168.1.1 scope=10 target-scope=11
add check-gateway=ping distance=3 dst-address=0.0.0.0/0 gateway=9.9.9.9 scope=10 target-scope=12
add distance=3 dst-address=9.9.9.9/32 gateway=192.168.1 scope=10 target-scope=11
+++++++++++++++++++
add comment=SecondaryISP distance=5 dst-address=0.0.0.0/0 gateway=192.168.2.1 scope=10 target-scope=30

or recursive

/ip route
add check-gateway=ping distance=2 dst-address=0.0.0.0/0 gateway=1.0.0.1 scope=10 target-scope=12
add distance=2 dst-address=1.0.0.1/32 gateway=192.168.1.1 scope=10 target-scope=11
add distance=5 dst-address=0.0.0.0/0 gateway=9.9.9.9 scope=10 target-scope=12
add distance=5 dst-address=9.9.9.9/32 gateway=192.168.2.1 scope=10 target-scope=11

What is better to use for our setup? Should we add some connection tracking removing or something? Is there anykind of way to change routes manually faster than disabling in the routes?