Dual WAN Failover script - feedback pls

ilium007 · April 26, 2025, 9:53am

After hours spent trying to get netwatch to work for me I ended up writing my own failover script. Would be happy to hear peoples suggestions on what I have come up with.

https://gist.github.com/ilium007/5cbe63ce9a148746a7842c1dc55bb967

The script is configurable for the destination ping check address, the number of ping requests to send, the required % of successful responses, the responses must be below a configurable latency value and the script handles debounce (if required) on the down and up state changes.

The failover script is scheduled to run every 30 seconds.

NOTE: the route distance changes are commented out in the script to avoid causing any loss of connectivity until fully tested.

A successful run in the logs:

2025-04-26 19:43:21 script,info Failover Check: Ping seq=0 latency=133 ms - ALLOW
 2025-04-26 19:43:21 script,info Failover Check: Ping seq=1 latency=133 ms - ALLOW
 2025-04-26 19:43:21 script,info Failover Check: Ping seq=2 latency=133 ms - ALLOW
 2025-04-26 19:43:21 script,info Failover Check: Ping seq=3 latency=133 ms - ALLOW
 2025-04-26 19:43:21 script,info Failover Check: Ping seq=4 latency=133 ms - ALLOW
 2025-04-26 19:43:21 script,info Failover Check: Ping seq=5 latency=133 ms - ALLOW
 2025-04-26 19:43:21 script,info Failover Check: Ping seq=6 latency=133 ms - ALLOW
 2025-04-26 19:43:21 script,info Failover Check: Ping seq=7 latency=133 ms - ALLOW
 2025-04-26 19:43:21 script,info Failover Check: Ping seq=8 latency=133 ms - ALLOW
 2025-04-26 19:43:21 script,info Failover Check: Ping seq=9 latency=133 ms - ALLOW
 2025-04-26 19:43:21 script,info Failover Check: Ping seq=10 latency=133 ms - ALLOW
 2025-04-26 19:43:21 script,info Failover Check: Ping seq=11 latency=133 ms - ALLOW
 2025-04-26 19:43:21 script,info Failover Check: Ping seq=12 latency=133 ms - ALLOW
 2025-04-26 19:43:21 script,info Failover Check: Ping seq=13 latency=133 ms - ALLOW
 2025-04-26 19:43:21 script,info Failover Check: Ping seq=14 latency=133 ms - ALLOW
 2025-04-26 19:43:21 script,info Failover Check: 15/15 successful pings (100%)

Forced latency in ICMP response causes WAN failover to secondary_route after 2 failed check cycles:

 2025-04-26 19:49:21 script,warning Failover Check: Ping seq=0 latency=114 ms EXCEEDS threshold (110 ms) - DISCARD
 2025-04-26 19:49:21 script,warning Failover Check: Ping seq=1 latency=111 ms EXCEEDS threshold (110 ms) - DISCARD
 2025-04-26 19:49:21 script,info Failover Check: Ping seq=2 latency=106 ms - ALLOW
 2025-04-26 19:49:21 script,warning Failover Check: Ping seq=3 latency=125 ms EXCEEDS threshold (110 ms) - DISCARD
 2025-04-26 19:49:21 script,warning Failover Check: Ping seq=4 latency=127 ms EXCEEDS threshold (110 ms) - DISCARD
 2025-04-26 19:49:21 script,info Failover Check: Ping seq=5 latency=69 ms - ALLOW
 2025-04-26 19:49:21 script,info Failover Check: Ping seq=6 latency=86 ms - ALLOW
 2025-04-26 19:49:21 script,warning Failover Check: Ping seq=7 latency=116 ms EXCEEDS threshold (110 ms) - DISCARD
 2025-04-26 19:49:21 script,warning Failover Check: Ping seq=8 latency=125 ms EXCEEDS threshold (110 ms) - DISCARD
 2025-04-26 19:49:21 script,warning Failover Check: Ping seq=9 latency=119 ms EXCEEDS threshold (110 ms) - DISCARD
 2025-04-26 19:49:21 script,info Failover Check: Ping seq=10 latency=76 ms - ALLOW
 2025-04-26 19:49:21 script,info Failover Check: Ping seq=11 latency=71 ms - ALLOW
 2025-04-26 19:49:21 script,warning Failover Check: Ping seq=12 latency=113 ms EXCEEDS threshold (110 ms) - DISCARD
 2025-04-26 19:49:21 script,warning Failover Check: Ping seq=13 latency=131 ms EXCEEDS threshold (110 ms) - DISCARD
 2025-04-26 19:49:21 script,info Failover Check: Ping seq=14 latency=102 ms - ALLOW
 2025-04-26 19:49:21 script,info Failover Check: 6/15 successful pings (40%)
 2025-04-26 19:49:21 script,warning Failover Check: DOWN debounce in progress - 1/2 failures.
 2025-04-26 19:49:39 script,info Failover Check: Ping seq=0 latency=92 ms - ALLOW
 2025-04-26 19:49:39 script,info Failover Check: Ping seq=1 latency=90 ms - ALLOW
 2025-04-26 19:49:39 script,info Failover Check: Ping seq=2 latency=100 ms - ALLOW
 2025-04-26 19:49:39 script,warning Failover Check: Ping seq=3 latency=118 ms EXCEEDS threshold (110 ms) - DISCARD
 2025-04-26 19:49:39 script,info Failover Check: Ping seq=4 latency=82 ms - ALLOW
 2025-04-26 19:49:39 script,info Failover Check: Ping seq=5 latency=93 ms - ALLOW
 2025-04-26 19:49:39 script,info Failover Check: Ping seq=6 latency=91 ms - ALLOW
 2025-04-26 19:49:39 script,info Failover Check: Ping seq=7 latency=92 ms - ALLOW
 2025-04-26 19:49:39 script,info Failover Check: Ping seq=8 latency=106 ms - ALLOW
 2025-04-26 19:49:39 script,info Failover Check: Ping seq=9 latency=88 ms - ALLOW
 2025-04-26 19:49:39 script,info Failover Check: Ping seq=10 latency=101 ms - ALLOW
 2025-04-26 19:49:39 script,info Failover Check: Ping seq=11 latency=104 ms - ALLOW
 2025-04-26 19:49:39 script,warning Failover Check: Ping seq=12 latency=120 ms EXCEEDS threshold (110 ms) - DISCARD
 2025-04-26 19:49:39 script,warning Failover Check: Ping seq=13 latency=122 ms EXCEEDS threshold (110 ms) - DISCARD
 2025-04-26 19:49:39 script,info Failover Check: Ping seq=14 latency=81 ms - ALLOW
 2025-04-26 19:49:39 script,info Failover Check: 12/15 successful pings (80%)
 2025-04-26 19:49:39 script,warning Failover Check: Transitioned DOWN after 2 consecutive failures. Set distance=3.

WAN failover to primary_route after 3 successful check cycles:

2025-04-26 19:51:04 script,info Failover Check: Ping seq=0 latency=133 ms - ALLOW
 2025-04-26 19:51:04 script,info Failover Check: Ping seq=1 latency=133 ms - ALLOW
 2025-04-26 19:51:04 script,info Failover Check: Ping seq=2 latency=133 ms - ALLOW
 2025-04-26 19:51:04 script,info Failover Check: Ping seq=3 latency=133 ms - ALLOW
 2025-04-26 19:51:04 script,info Failover Check: Ping seq=4 latency=133 ms - ALLOW
 2025-04-26 19:51:04 script,info Failover Check: Ping seq=5 latency=133 ms - ALLOW
 2025-04-26 19:51:04 script,info Failover Check: Ping seq=6 latency=133 ms - ALLOW
 2025-04-26 19:51:04 script,info Failover Check: Ping seq=7 latency=133 ms - ALLOW
 2025-04-26 19:51:04 script,info Failover Check: Ping seq=8 latency=133 ms - ALLOW
 2025-04-26 19:51:04 script,info Failover Check: Ping seq=9 latency=133 ms - ALLOW
 2025-04-26 19:51:04 script,info Failover Check: Ping seq=10 latency=133 ms - ALLOW
 2025-04-26 19:51:04 script,info Failover Check: Ping seq=11 latency=133 ms - ALLOW
 2025-04-26 19:51:04 script,info Failover Check: Ping seq=12 latency=133 ms - ALLOW
 2025-04-26 19:51:04 script,info Failover Check: Ping seq=13 latency=133 ms - ALLOW
 2025-04-26 19:51:04 script,info Failover Check: Ping seq=14 latency=133 ms - ALLOW
 2025-04-26 19:51:04 script,info Failover Check: 15/15 successful pings (100%)
 2025-04-26 19:51:04 script,info Failover Check: UP debounce in progress - 1/3 successes.
 2025-04-26 19:51:10 script,info Failover Check: Ping seq=0 latency=133 ms - ALLOW
 2025-04-26 19:51:10 script,info Failover Check: Ping seq=1 latency=133 ms - ALLOW
 2025-04-26 19:51:10 script,info Failover Check: Ping seq=2 latency=133 ms - ALLOW
 2025-04-26 19:51:10 script,info Failover Check: Ping seq=3 latency=133 ms - ALLOW
 2025-04-26 19:51:10 script,info Failover Check: Ping seq=4 latency=133 ms - ALLOW
 2025-04-26 19:51:10 script,info Failover Check: Ping seq=5 latency=133 ms - ALLOW
 2025-04-26 19:51:10 script,info Failover Check: Ping seq=6 latency=133 ms - ALLOW
 2025-04-26 19:51:10 script,info Failover Check: Ping seq=7 latency=133 ms - ALLOW
 2025-04-26 19:51:10 script,info Failover Check: Ping seq=8 latency=133 ms - ALLOW
 2025-04-26 19:51:10 script,info Failover Check: Ping seq=9 latency=133 ms - ALLOW
 2025-04-26 19:51:10 script,info Failover Check: Ping seq=10 latency=133 ms - ALLOW
 2025-04-26 19:51:10 script,info Failover Check: Ping seq=11 latency=133 ms - ALLOW
 2025-04-26 19:51:10 script,info Failover Check: Ping seq=12 latency=133 ms - ALLOW
 2025-04-26 19:51:10 script,info Failover Check: Ping seq=13 latency=133 ms - ALLOW
 2025-04-26 19:51:10 script,info Failover Check: Ping seq=14 latency=133 ms - ALLOW
 2025-04-26 19:51:10 script,info Failover Check: 15/15 successful pings (100%)
 2025-04-26 19:51:10 script,info Failover Check: UP debounce in progress - 2/3 successes.
 2025-04-26 19:51:15 script,info Failover Check: Ping seq=0 latency=133 ms - ALLOW
 2025-04-26 19:51:15 script,info Failover Check: Ping seq=1 latency=133 ms - ALLOW
 2025-04-26 19:51:15 script,info Failover Check: Ping seq=2 latency=133 ms - ALLOW
 2025-04-26 19:51:15 script,info Failover Check: Ping seq=3 latency=133 ms - ALLOW
 2025-04-26 19:51:15 script,info Failover Check: Ping seq=4 latency=133 ms - ALLOW
 2025-04-26 19:51:15 script,info Failover Check: Ping seq=5 latency=133 ms - ALLOW
 2025-04-26 19:51:15 script,info Failover Check: Ping seq=6 latency=133 ms - ALLOW
 2025-04-26 19:51:15 script,info Failover Check: Ping seq=7 latency=133 ms - ALLOW
 2025-04-26 19:51:15 script,info Failover Check: Ping seq=8 latency=133 ms - ALLOW
 2025-04-26 19:51:15 script,info Failover Check: Ping seq=9 latency=133 ms - ALLOW
 2025-04-26 19:51:15 script,info Failover Check: Ping seq=10 latency=133 ms - ALLOW
 2025-04-26 19:51:15 script,info Failover Check: Ping seq=11 latency=136 ms - ALLOW
 2025-04-26 19:51:15 script,info Failover Check: Ping seq=12 latency=133 ms - ALLOW
 2025-04-26 19:51:15 script,info Failover Check: Ping seq=13 latency=133 ms - ALLOW
 2025-04-26 19:51:15 script,info Failover Check: Ping seq=14 latency=133 ms - ALLOW
 2025-04-26 19:51:15 script,info Failover Check: 15/15 successful pings (100%)
 2025-04-26 19:51:15 script,info Failover Check: Transitioned UP after 3 consecutive successes. Set distance=1.

Let me know your thoughts.

anav · April 26, 2025, 3:22pm

Will stick to recursive, works and is much easier or via netwatch if one doesnt want to wait 10 seconds etc…

ilium007 · April 26, 2025, 3:24pm

You can’t fine tune latency, percentage success or debounce with the recursive method. This is just another take on failover that is tested and works as does the recursive method and the netwatch method, both with their own issues.

ivicask · April 26, 2025, 4:00pm

I agree on this, recursive would false failover multiple times a day for me, specially poor LTE links when they get saturated.

BTW i just use this to get pings to multiple servers to prevents false drops, also you dont need to force IPS thru specific gateways you can just specify interface in the ping command if applicable in your setup, like ether1, pppoe-out1 etc…

Example from my script, no advanced math needed

:local servers {"8.8.8.8"; "1.1.1.1"; "9.9.9.9"} (you can add as many you want)
# Ping Count
:local ServerPings 0
:foreach s in=$servers do={
    :if ([/ping $s count=3 interval=500ms interface=ether1] > 0) do={
        :set ServerPings ($ServerPings + 1)
    }
}

ilium007 · April 26, 2025, 11:03pm

I tested the interface ping and could not get it working, I made a post about it on these forums.

ivicask · April 27, 2025, 5:56am

I use this on multiple routers and It works fine, what is exact problem your side?

EDIT:
Checked your posts, yeah it works for me, no matter the route distance pings always go properly to right interfaces and failover script works like that for me for months since I made it.

ilium007 · April 27, 2025, 6:23am

I’ll test again tonight but I think the issue was with netwatch trying to get ICMP packets out anyway it could.

ilium007 · April 27, 2025, 7:37am

So I have just tested again. Failed over the primary WAN interface on ether2 by setting route distance to 3 allowing the secondary WAN interface on ether4 to become the default gateway.

[xxxx@RB5009] > /ip/route/export where dst-address=0.0.0.0/0
/ip route
add comment=primary_route disabled=no distance=3 dst-address=0.0.0.0/0 gateway=124.19.98.145 routing-table=main scope=30 suppress-hw-offload=no target-scope=10
add comment=secondary_route disabled=no distance=2 dst-address=0.0.0.0/0 gateway=10.31.0.2 routing-table=main scope=30 suppress-hw-offload=no target-scope=10

Ping and trace route works via the secondary gateway on ether4 (10.31.0.2):

[xxxx@RB5009] > ping 8.8.8.8
  SEQ HOST                                     SIZE TTL TIME       STATUS
    0 8.8.8.8                                    56 113 25ms538us
    1 8.8.8.8                                    56 113 35ms619us
    2 8.8.8.8                                    56 113 36ms934us
    3 8.8.8.8                                    56 113 30ms41us
    sent=4 received=4 packet-loss=0% min-rtt=25ms538us avg-rtt=32ms33us max-rtt=36ms934us


[xxxx@RB5009] > /tool/traceroute 8.8.8.8
ADDRESS                          LOSS SENT    LAST     AVG    BEST   WORST STD-DEV STATUS
10.31.0.2                          0%    2   0.4ms     0.5     0.4     0.6     0.1
                                 100%    2 timeout
10.4.78.163                        0%    2    29ms    30.5      29      32     1.5
                                 100%    2 timeout
10.5.86.97                         0%    1  24.2ms    24.2    24.2    24.2       0
10.5.86.98                         0%    1  42.1ms    42.1    42.1    42.1       0
10.5.86.105                        0%    1  34.8ms    34.8    34.8    34.8       0
203.50.61.96                       0%    1  42.1ms    42.1    42.1    42.1       0
203.50.11.177                      0%    1  27.7ms    27.7    27.7    27.7       0
58.163.91.194                      0%    1  33.2ms    33.2    33.2    33.2       0
192.178.97.87                      0%    1  30.6ms    30.6    30.6    30.6       0
142.250.234.211                    0%    1  37.9ms    37.9    37.9    37.9       0
8.8.8.8                            0%    1  30.9ms    30.9    30.9    30.9       0

ether2 is still up alongside ether4:

[xxxx@RB5009] > /interface/ethernet/print where name=ether2
Flags: R - RUNNING
Columns: NAME, MTU, MAC-ADDRESS, ARP, SWITCH
#   NAME     MTU  MAC-ADDRESS        ARP      SWITCH
;;; OPTUS INTERNET
1 R ether2  1500  D4:01:C3:24:27:9D  enabled  switch1

[xxxx@RB5009] > /interface/ethernet/print where name=ether4
Flags: R - RUNNING
Columns: NAME, MTU, MAC-ADDRESS, ARP, SWITCH
#   NAME     MTU  MAC-ADDRESS        ARP      SWITCH
;;; RUT950
3 R ether4  1500  D4:01:C3:24:27:9F  enabled  switch1

I am unable to ping or traceroute to 8.8.8.8 via ether2 even though it is still up:

[xxxx@RB5009] > ping 8.8.8.8 interface=ether2
  SEQ HOST                                     SIZE TTL TIME       STATUS
    0 8.8.8.8                                                      timeout
    1 8.8.8.8                                                      timeout
    2 8.8.8.8                                                      timeout
    3 124.19.98.xxx                              84  64 150ms569us host unreachable
    4 8.8.8.8                                                      timeout
    sent=5 received=0 packet-loss=100%


[xxxx@RB5009] > /tool/traceroute 8.8.8.8 interface=ether2
ADDRESS                          LOSS SENT    LAST     AVG    BEST   WORST STD-DEV STATUS
                                 100%    2 timeout
                                 100%    2 timeout
                                 100%    1 timeout
124.19.98.xxx                     0%    1 101.9ms   101.9   101.9   101.9       0 host unreachable from 124.19.98.xxx
                                   0%    0     0ms

I am able to ping the ether2 (primary) gateway and confirm the single hop traceroute:

[xxxx@RB5009] > ping 124.19.98.xxx
  SEQ HOST                                     SIZE TTL TIME       STATUS
    0 124.19.98.xxx                             56  64 6ms847us
    1 124.19.98.xxx                              56  64 5ms587us
    2 124.19.98.xxx                              56  64 6ms939us
    3 124.19.98.xxx                              56  64 5ms666us
    4 124.19.98.xxx                             56  64 5ms586us
    sent=5 received=5 packet-loss=0% min-rtt=5ms586us avg-rtt=6ms125us max-rtt=6ms939us

[xxxx@RB5009] > /tool/traceroute 124.19.98.xxx
ADDRESS                          LOSS SENT    LAST     AVG    BEST   WORST STD-DEV STATUS
124.19.98.xxx                      0%    4   5.7ms     5.6     5.4     5.7     0.1

ivicask · April 27, 2025, 7:50am

What if you try just switch them around as when real failover would happen, use distances 1 and 2 instead that 3 maybe there is something weird happening around it?

ilium007 · April 27, 2025, 7:53am

Thats exactly what I have shown above. ether2 was set to route distance 3 from 1 thus making ether4 with route distance 2 the default gateway.

It fails exactly the same back with the primary ether2 up (ie. set to route distance 1). I am unable to ping 8.8.8.8 via the secondary (ether4) interface when ether 2 route distance is set back to 1 and ether4 route distance 2:

[xxxx@RB5009] > ping 8.8.8.8 interface=ether4
  SEQ HOST                                     SIZE TTL TIME       STATUS
    0 8.8.8.8                                                      timeout
    1 8.8.8.8                                                      timeout
    2 8.8.8.8                                                      timeout
    3 10.31.0.1                                  84  64 81ms234us  host unreachable
    sent=4 received=0 packet-loss=100%
    
    [xxxx@RB5009] > /tool/traceroute 8.8.8.8 interface=ether4
ADDRESS                          LOSS SENT    LAST     AVG    BEST   WORST STD-DEV STATUS
                                 100%    3 timeout
                                 100%    3 timeout
                                 100%    2 timeout
10.31.0.1                          0%    2 113.4ms   121.3   113.4   129.1     7.9 host unreachable from 10.31.0.1
                                   0%    0     0ms

ilium007 · April 27, 2025, 11:42am

Side by side. RB5009 (left) ether4(10.31.0.1) ------ RUT950 (right) 10.31.0.2

Screenshot 2025-04-27 at 21.37.35.png

ICMP packets never arrive at the RUT950 interface.

ilium007 · April 27, 2025, 12:12pm

Here is an interesting post from Mikrotik staff in 2011 - ping interface is for IPv6 only, not IPv4

http://forum.mikrotik.com/t/fail-over-ping-from-interface/45194/2

jaclaz · April 27, 2025, 12:42pm

You can use a workaround, adding a specific routing table for the address you want to ping (in V6) was simple and easy:
http://forum.mikrotik.com/t/in-2-wan-interfaces-setup-how-to-ping-from-the-interface-with-higher-distance/100172/1
among the betterings in V7 the routing-table parameter for ping was removed, but you can use in it a vrf (with no interfaces), see:
http://forum.mikrotik.com/t/how-to-use-ping-with-multiple-routing-marks-in-ros-version-7/175887/3
BUT in some cases there is the need to do a normal ping (to the gateway or some other address) just before the one via specified vrf (see last post in that thread).

ilium007 · April 27, 2025, 1:00pm

My solution on the GitHub page above uses two static routes, one to 8.8.8.8 via the primary WAN gateway and a second is a blackhole route. Works fine. This troubleshooting was in response to the comment stating I could ping via an interface.

I can confirm that setting up the VRF with interfaces=none

/ip vrf
add interfaces=none name=testwan

and a route via the new route table testwan:

/ip route
add disabled=no distance=1 dst-address=8.8.8.8/32 gateway=10.31.0.2 routing-table=testwan suppress-hw-offload=no

you have to ping something via the main route table before the ping succeeds on the new VRF. Seems like a pretty major bug to me:

/[xxxx@RB5009] > ping 8.8.8.8 vrf=testwan
  SEQ HOST                                     SIZE TTL TIME       STATUS
    0 8.8.8.8                                                      timeout
    1 8.8.8.8                                                      timeout
    2 8.8.8.8                                                      timeout
    sent=3 received=0 packet-loss=100%
[xxxx@RB5009] > ping 8.8.8.8
  SEQ HOST                                     SIZE TTL TIME       STATUS
    0 8.8.8.8                                    56  58 123ms540us
    1 8.8.8.8                                    56  58 123ms512us
    2 8.8.8.8                                    56  58 123ms500us
    sent=3 received=3 packet-loss=0% min-rtt=123ms500us avg-rtt=123ms517us max-rtt=123ms540us
[xxxx@RB5009] > ping 8.8.8.8 vrf=testwan
  SEQ HOST                                     SIZE TTL TIME       STATUS
    0 8.8.8.8                                    56  46 222ms540us
    1 8.8.8.8                                    56  46 198ms653us
    2 8.8.8.8                                    56  46 136ms761us
    sent=3 received=3 packet-loss=0% min-rtt=136ms761us avg-rtt=185ms984us max-rtt=222ms540us