Dual WAN failover - check internet

ilium007 · April 25, 2025, 7:35am

I am setting up dual WAN failover using netwatch and scripts to manipulate two 0.0.0.0/0 route distances and need to check if my main ISP is back up before bringing up the main ISP default route.
Screenshot 2025-04-25 at 17.26.11.png
During testing (with Main ISP still up) I can’t ping via ether4 to 8.8.8.8.

[xxxx@RB5009] > ping 8.8.8.8 interface=ether4
  SEQ HOST                                     SIZE TTL TIME       STATUS                                                                          
    0 8.8.8.8                                                      timeout                                                                         
    1 8.8.8.8                                                      timeout                                                                         
    2 8.8.8.8                                                      timeout                                                                         
    3 10.31.0.1                                  84  64 120ms974us host unreachable                                                                
    sent=4 received=0 packet-loss=100%

The RUT950 has internet access via an LTE connection and from that device I can ping 8.8.8.8.

Pinging via ether 2 works as expected:

[xxxx@RB5009] > ping 8.8.8.8 interface=ether2
  SEQ HOST                                     SIZE TTL TIME       STATUS                                                                          
    0 8.8.8.8                                    56 118 6ms676us  
    1 8.8.8.8                                    56 118 6ms529us  
    2 8.8.8.8                                    56 118 6ms519us  
    sent=3 received=3 packet-loss=0% min-rtt=6ms519us avg-rtt=6ms574us max-rtt=6ms676us

The default routes:

/ip/route/export where dst-address=0.0.0.0/0
/ip route
add comment=primary_route disabled=no distance=1 dst-address=0.0.0.0/0 gateway=124.1.2.2 routing-table=main scope=30 suppress-hw-offload=no \
    target-scope=10
add comment=secondary_route disabled=no distance=2 dst-address=0.0.0.0/0 gateway=10.31.0.2 routing-table=main scope=30 suppress-hw-offload=no \
    target-scope=10

Both ether2 and ether4 interfaces have NAT configured and I can ping 10.31.0.2 from the RB5009.

Why can’t I ping 8.8.8.8 via ether4?

A9691 · April 25, 2025, 10:11am

[xxxx@RB5009] > ping 8.8.8.8 interface=ether2
  SEQ HOST                                     SIZE TTL TIME       STATUS                                                                          
    0 8.8.8.8                                    56 118 6ms676us  
    1 8.8.8.8                                    56 118 6ms529us  
    2 8.8.8.8                                    56 118 6ms519us  
    sent=3 received=3 packet-loss=0% min-rtt=6ms519us avg-rtt=6ms574us max-rtt=6ms676us

I used /ping interface= in my testing environment and it worked. But not on the real gateway. It worked for one interface and did not for the other. Both interfaces were ethernet with fixed ip.
I’m using right now:

ip vrf add name=Test1 interfaces=none
/ip route add gateway=<nexthopp on ether1> routing-table=Test1

With this setup /ping 8.8.8.8 vrf=Test1 works. In case it does not work for an interface I just increase in routing rule the distance. A mistake here is not a game changer in my case.

I’ couldn’t find more info about how a vrf with interfaces=none works and I also cannot say anything about weather this is the right way.
(RoS v 7.16.2)

jaclaz · April 25, 2025, 11:52am

It seems normal to me.

Run

/ip route print

what do you get as output?

Likely the route via 124.1.2.2 will be AS (Active Static) whilst the one via 10.31.0.2 will be only S (Static) (due to the bigger distance).
A route that is not active is like it doesn’t exist.
You will have also a DAc (Dynamic Active connect) route for 10.31.0.0 on ether4 and one (still DAc) on ether2 for 124.1.2.0, these are automatically created from the IP addresses you assigned to the interfaces.

So what happens should be:

you ask for 8.8.8.8 on ether4
there is a route on ether 4 is for 10.31.0.0, and clearly 8.8.8.8 is not part of that range
there is no route for 8.8.8.8 (as contained in 0.0.0.0) using ether4

When you use the other interface:

you ask for 8.8.8.8 on ether2
there is a route on ether2, it is for 124.1.2.0, and clearly 8.8.8.8 is not part of that range
there is an Active route for 0.0.0.0 (that contains 8.8.8.8, via ether2)
this route is taken and via the gateway 124.1.2.2 the 8.8.8.8 is reached

ilium007 · April 25, 2025, 12:06pm

/ip/route/print
Flags: D - DYNAMIC; A - ACTIVE; c - CONNECT, s - STATIC
Columns: DST-ADDRESS, GATEWAY, DISTANCE
 #     DST-ADDRESS        GATEWAY         DISTANCE
;;; primary_route
 0  As 0.0.0.0/0          124.1.2.2          1
;;; secondary_route
 1   s 0.0.0.0/0          10.31.0.2              2
   DAc 10.31.0.0/29       ether4                 0

My failover script changes the route distance on ether2 to ‘3’ enabling the default route via ether4. I was hoping I could use ping 8.8.8.8 interface=ether2 to periodically check if the primary WAN was back up but my testing (I am currently testing around the other way, that is, try gin to ping out via ether4) shows that this will not work.

ilium007 · April 25, 2025, 12:27pm

I used /ping interface= in my testing environment and it worked. But not on the real gateway. It worked for one interface and did not for the other. Both interfaces were ethernet with fixed ip.
I’m using right now:
ip vrf add name=Test1 interfaces=none
/ip route add gateway=<nexthopp on ether1> routing-table=Test1
With this setup /ping 8.8.8.8 vrf=Test1 works. In case it does not work for an interface I just increase in routing rule the distance. A mistake here is not a game changer in my case.

I’ couldn’t find more info about how a vrf with interfaces=none works and I also cannot say anything about weather this is the right way.
(RoS v 7.16.2)

This didn’t work for me:

/ip/vrf/add name=testwan interfaces=none
/ip route add gateway=10.32.0.2 routing-table=testwan

ping 8.8.8.8 vrf=testwan
  SEQ HOST                                     SIZE TTL TIME       STATUS
    0 8.8.8.8                                                      timeout
    1 8.8.8.8                                                      timeout
    2 8.8.8.8                                                      timeout
    3 8.8.8.8                                                      timeout
    sent=4 received=0 packet-loss=100%

ping 8.8.8.8 vrf=main
  SEQ HOST                                     SIZE TTL TIME       STATUS
    0 8.8.8.8                                    56 118 6ms580us
    1 8.8.8.8                                    56 118 6ms538us
    2 8.8.8.8                                    56 118 6ms554us
    sent=3 received=3 packet-loss=0% min-rtt=6ms538us avg-rtt=6ms557us max-rtt=6ms580us

Strange though… trace route via the test wan vrf works and the ping does not:

/tool/traceroute 8.8.8.8 vrf=testwan
ADDRESS                          LOSS SENT    LAST     AVG    BEST   WORST STD-DEV STATUS
10.31.0.2                          0%   18   0.5ms     0.5     0.4     0.5       0
                                 100%   18 timeout
10.4.78.163                        0%   17  51.2ms    51.5    34.6    65.8     8.1
                                 100%   17 timeout
10.5.86.97                         0%   17  51.1ms    46.6    28.9    57.5     8.8
10.5.86.98                         0%   17  29.9ms    30.4    28.6    39.9     2.4
10.5.86.105                        0%   17  29.6ms    29.7    25.9      31       1
203.50.61.96                       0%   17  30.2ms    31.7    29.9    39.9     2.2
203.50.11.177                      0%   17  30.9ms    29.9      28      34     1.3
58.163.91.194                      0%   17  29.9ms    31.2    28.7    38.7     2.9
192.178.97.87                      0%   17  30.9ms    31.2    30.4    32.9     0.7
142.250.234.211                    0%   17  29.9ms    31.5      29    39.9     3.3
8.8.8.8                            0%   17    30ms    29.7      28    38.9     2.4

And via primary gateway:

/tool/traceroute 8.8.8.8
ADDRESS                          LOSS SENT    LAST     AVG    BEST   WORST STD-DEV STATUS
124.1.2.2                      0%    4   5.3ms     5.3     5.2     5.4     0.1
                                 100%    4 timeout
59.154.142.250                     0%    3   5.9ms     5.9     5.9     5.9       0
74.125.147.174                     0%    3   6.3ms     6.3     6.3     6.4       0
192.178.97.225                     0%    3   6.5ms     6.4     6.4     6.5       0
142.251.64.179                     0%    3   6.5ms     6.5     6.5     6.5       0
8.8.8.8                            0%    3   6.4ms     6.4     6.4     6.5       0

jaclaz · April 25, 2025, 12:32pm

/ip/route/print
Flags: D - DYNAMIC; A - ACTIVE; c - CONNECT, s - STATIC
Columns: DST-ADDRESS, GATEWAY, DISTANCE
 #     DST-ADDRESS        GATEWAY         DISTANCE
;;; primary_route
 0  As 0.0.0.0/0          124.1.2.2          1
;;; secondary_route
 1   s 0.0.0.0/0          10.31.0.2              2
   DAc 10.31.0.0/29       ether4                 0
My failover script changes the route distance on ether2 to ‘3’ enabling the default route via ether4. I was hoping I could use ping 8.8.8.8 interface=ether2 to periodically check if the primary WAN was back up but my testing (I am currently testing around the other way, that is, try gin to ping out via ether4) shows that this will not work.

But nothing prevents you from adding a “narrow” /32 route via ether4 for the chosen address (since 8.8.8.8 is more widely used, better IMHO 8.8.4.4 for this use or another DNS server).

Check this other (simpler) approach:
http://forum.mikrotik.com/t/simpler-failover-for-two-gateways-i-found-working/169108/1
http://forum.mikrotik.com/t/simpler-failover-for-two-gateways-i-found-working/169108/1

ilium007 · April 25, 2025, 12:47pm

I don’t understand why trace route works and ping doesn’t:

[xxxx@RB5009] > /ip/vrf/print
Flags: X - disabled; * - builtin
 0    name="testwan" interfaces=none

 1  * name="main" interfaces=all

[xxxx@RB5009] > /ip/route/print detail where dst-address=0.0.0.0/0
Flags: D - dynamic; X - disabled, I - inactive, A - active;
c - connect, s - static, r - rip, b - bgp, o - ospf, i - is-is, d - dhcp, v - vpn, m - modem, y - bgp-mpls-vpn; H - hw-offloaded; + - ecmp
 0  As   ;;; primary_route
         dst-address=0.0.0.0/0 routing-table=main gateway=124.1.2.2 immediate-gw=124.1.2.2%ether2 distance=1 scope=30 target-scope=10
         suppress-hw-offload=no

 1   s   ;;; secondary_route
         dst-address=0.0.0.0/0 routing-table=main gateway=10.31.0.2 immediate-gw=10.31.0.2%ether4 distance=2 scope=30 target-scope=10 suppress-hw-offload=no

17  As   dst-address=0.0.0.0/0 routing-table=testwan gateway=10.31.0.2 immediate-gw=10.31.0.2%ether4 distance=1 scope=30 target-scope=10
[xxxx@RB5009] >

[xxxx@RB5009] > ping 8.8.8.8 vrf=testwan
  SEQ HOST                                     SIZE TTL TIME       STATUS
    0 8.8.8.8                                                      timeout
    1 8.8.8.8                                                      timeout
    2 8.8.8.8                                                      timeout
    3 8.8.8.8                                                      timeout
    4 8.8.8.8                                                      timeout
    5 8.8.8.8                                                      timeout
    6 8.8.8.8                                                      timeout
    7 8.8.8.8                                                      timeout
    8 8.8.8.8                                                      timeout
    sent=9 received=0 packet-loss=100%

[xxxx@RB5009] > /tool/traceroute 8.8.8.8 vrf=testwan
ADDRESS                          LOSS SENT    LAST     AVG    BEST   WORST STD-DEV STATUS
10.31.0.2                          0%    4   0.4ms     0.5     0.4     0.7     0.1
                                 100%    4 timeout
10.4.78.163                        0%    3  51.4ms    43.3    29.2    51.4      10
                                 100%    3 timeout
10.5.86.97                         0%    3  49.6ms      47    31.2    60.2      12
10.5.86.98                         0%    3  29.8ms    27.1    21.9    29.8     3.7
10.5.86.105                        0%    3    30ms    30.3    29.9    30.9     0.4
203.50.61.96                       0%    3  30.8ms    30.9    30.8      31     0.1
203.50.11.177                      0%    3  28.9ms    28.9    27.9    29.8     0.8
58.163.91.194                      0%    3  30.9ms    30.6    29.9    30.9     0.5
192.178.97.87                      0%    3    31ms    30.9    30.7      31     0.1
142.250.234.211                    0%    3  29.8ms      30    29.8    30.2     0.2
8.8.8.8

The ping via testwan vrf should go via 10.31.0.2%ether4. Is it a return path issue?

ilium007 · April 25, 2025, 12:58pm

This seems buggy… I went into winbox and tried to use the GUI ping tool to 8.8.8.8 via the test wan vrf and it failed. I clicked ARP ping, it failed, delselected and it then pinged via the test wan vrf.

Now works in cli as well:

[xxxx@RB5009] > ping 8.8.8.8 vrf=testwan
  SEQ HOST                                     SIZE TTL TIME       STATUS
    0 8.8.8.8                                    56 113 121ms12us
    1 8.8.8.8                                    56 113 29ms654us
    2 8.8.8.8                                    56 113 40ms294us
    3 8.8.8.8                                    56 113 34ms595us
    sent=4 received=4 packet-loss=0% min-rtt=29ms654us avg-rtt=56ms388us max-rtt=121ms12us

[xxxx@RB5009] > ping 8.8.8.8
  SEQ HOST                                     SIZE TTL TIME       STATUS
    0 8.8.8.8                                    56 118 6ms688us
    1 8.8.8.8                                    56 118 6ms646us
    2 8.8.8.8                                    56 118 6ms645us
    3 8.8.8.8                                    56 118 6ms653us
    sent=4 received=4 packet-loss=0% min-rtt=6ms645us avg-rtt=6ms658us max-rtt=6ms688us

anav · April 25, 2025, 3:44pm

Netwatch leaks out any wan to find a connection and thus you need to blackhole any netwatch routing with a second following route same table distance add one.

jaclaz · April 25, 2025, 3:55pm

Which is essentially point #2 in the given:
http://forum.mikrotik.com/t/simpler-failover-for-two-gateways-i-found-working/169108/1

anav · April 25, 2025, 4:05pm

Sweet!!

ilium007 · April 25, 2025, 11:57pm

I was trying to do this without using the 8.8.8.8/32 narrow route hence trying to ping via the down (primary) interface. At the moment the second VRF / route table method is working.

Edit: I spoke too soon. This morning the VRF method no longer allows a ping to 8.8.8.8 but again traceroute works:

[xxxx@RB5009] > ping 8.8.8.8 vrf=testwan
  SEQ HOST                                     SIZE TTL TIME       STATUS
    0 8.8.8.8                                                      timeout
    1 8.8.8.8                                                      timeout
    2 8.8.8.8                                                      timeout
    3 8.8.8.8                                                      timeout
    4 8.8.8.8                                                      timeout
    sent=5 received=0 packet-loss=100%

[xxxx@RB5009] > /tool/traceroute 8.8.8.8 vrf=testwan
ADDRESS                          LOSS SENT    LAST     AVG    BEST   WORST STD-DEV STATUS
10.31.0.2                          0%    3   1.3ms     0.8     0.5     1.3     0.3
                                 100%    3 timeout
10.4.78.163                        0%    2  61.7ms    49.3    36.9    61.7    12.4
                                 100%    2 timeout
10.5.86.97                         0%    2    58ms    43.6    29.2      58    14.4
10.5.86.98                         0%    2  29.7ms    29.8    29.7    29.9     0.1
10.5.86.105                        0%    2    30ms    29.9    29.8      30     0.1
203.50.61.96                       0%    2  30.9ms    30.9    30.9    30.9       0
203.50.11.177                      0%    2  29.9ms    29.8    29.7    29.9     0.1
74.125.49.138                      0%    2    30ms    30.1      30    30.1     0.1
192.178.97.219                     0%    2  29.9ms    29.9    29.9    29.9       0
142.251.64.179                     0%    2  30.1ms      30    29.9    30.1     0.1
8.8.8.8                            0%    2  30.6ms    30.3    29.9    30.6     0.4

ilium007 · April 26, 2025, 1:37am

Decided to use the suggested /32 route but using 4.2.2.2 so no DNS is interrupted during failover. Using ICMP probe type in netwatch. This seems to be working.

A9691 · April 28, 2025, 7:28am

I don’t understand why trace route works and ping doesn’t:

[xxxx@RB5009] > /ip/vrf/print
Flags: X - disabled; * - builtin
 0    name="testwan" interfaces=none

 1  * name="main" interfaces=all

[xxxx@RB5009] > /ip/route/print detail where dst-address=0.0.0.0/0
Flags: D - dynamic; X - disabled, I - inactive, A - active;
c - connect, s - static, r - rip, b - bgp, o - ospf, i - is-is, d - dhcp, v - vpn, m - modem, y - bgp-mpls-vpn; H - hw-offloaded; + - ecmp
 0  As   ;;; primary_route
         dst-address=0.0.0.0/0 routing-table=main gateway=124.1.2.2 immediate-gw=124.1.2.2%ether2 distance=1 scope=30 target-scope=10
         suppress-hw-offload=no

 1   s   ;;; secondary_route
         dst-address=0.0.0.0/0 routing-table=main gateway=10.31.0.2 immediate-gw=10.31.0.2%ether4 distance=2 scope=30 target-scope=10 suppress-hw-offload=no

17  As   dst-address=0.0.0.0/0 routing-table=testwan gateway=10.31.0.2 immediate-gw=10.31.0.2%ether4 distance=1 scope=30 target-scope=10
[xxxx@RB5009] >

[xxxx@RB5009] > ping 8.8.8.8 vrf=testwan
  SEQ HOST                                     SIZE TTL TIME       STATUS
    0 8.8.8.8                                                      timeout
    1 8.8.8.8                                                      timeout
    2 8.8.8.8                                                      timeout
    3 8.8.8.8                                                      timeout
    4 8.8.8.8                                                      timeout
    5 8.8.8.8                                                      timeout
    6 8.8.8.8                                                      timeout
    7 8.8.8.8                                                      timeout
    8 8.8.8.8                                                      timeout
    sent=9 received=0 packet-loss=100%

[xxxx@RB5009] > /tool/traceroute 8.8.8.8 vrf=testwan
ADDRESS                          LOSS SENT    LAST     AVG    BEST   WORST STD-DEV STATUS
10.31.0.2                          0%    4   0.4ms     0.5     0.4     0.7     0.1
                                 100%    4 timeout
10.4.78.163                        0%    3  51.4ms    43.3    29.2    51.4      10
                                 100%    3 timeout
10.5.86.97                         0%    3  49.6ms      47    31.2    60.2      12
10.5.86.98                         0%    3  29.8ms    27.1    21.9    29.8     3.7
10.5.86.105                        0%    3    30ms    30.3    29.9    30.9     0.4
203.50.61.96                       0%    3  30.8ms    30.9    30.8      31     0.1
203.50.11.177                      0%    3  28.9ms    28.9    27.9    29.8     0.8
58.163.91.194                      0%    3  30.9ms    30.6    29.9    30.9     0.5
192.178.97.87                      0%    3    31ms    30.9    30.7      31     0.1
142.250.234.211                    0%    3  29.8ms      30    29.8    30.2     0.2
8.8.8.8

The ping via testwan vrf should go via 10.31.0.2%ether4. Is it a return path issue?

The only difference I see in the routing through 10.31.0.2 is that in vrf there is no suppress-hw-offload=no. I would like to know if it makes any difference?

robmaltsystems · April 28, 2025, 8:48am

Aside, but what’s the current best practise around WAN failover to LTE? When I last did this, we were still on RouterOS v6. Is the method/support different in RouterOS 7? From memory, it was mainly PING tests plus scripting.

jaclaz · April 28, 2025, 10:01am

I don’t think there is unanimous consent on the matter, basically it is recursive vs. netwatch, each one has some slight different features, but if properly implemented they both work just fine in most setups.

Then some people believes that the one (or the other) can be bettered or fine tuned by complicating the one or the other with sophisticated scripts, and again as long as they work, they are just fine.

Only as a side-side note, there is also a (not yet tested/reported about AFAIK) newish approach hinted about in Mikrotik’s help page for Netwatch ICMP testing, making use not of the reply of the canary address, but leveraging on a low TTL exhausting on an intermediate hop :
https://help.mikrotik.com/docs/spaces/ROS/pages/8323208/Netwatch

accept-icmp-time-exceeded=yes can be used together with a manually set low ttl value to monitor Internet connectivity, without relying on a specific endpoint.

For example, you can monitor a public IP address, but that address can filter your ICMP request, or just become unreachable itself, if the Netwatch probe is using this address to monitor Internet connectivity this would cause a false alarm.

To make sure you can reach the Internet, it’s generally enough to make sure you can reach a device a few routing hops away. Low time to live value will expire in transit to the specified host you want to monitor - each router passing the ICMP packet will subtract “1” from TTL value, upon TTL reaching 0, ICMP “time exceeded” packet will be generated, and sent back to the Netwatch probe. If all other fail thresholds are not broken, this response will be considered a success.

ilium007 · April 28, 2025, 10:36am

I did read this note the other day and didn’t get time to test it further. I have looked at it tonight and It is actually a clever method of determining if a link is up or down.

I would like to use netwatch but there is simply not enough documentation around how it considers a connection back up. In my testing with both simple and ICMP probes it seems that after 1 successful ICMP response the link was deemed back up, I need better odds on a link up transition.

ilium007 · April 28, 2025, 10:54am

I noted that the Netwatch doco mentions the ICMP probe accepts a vrf option so I thought, great, I can set up an interface=none vrf and then a static route to 1.1.1.1 via this vrf so that in a failover (which could last for hours) network devices still had access to 1.1.1.1 DNS. Unfortunately netwatch seems to suffer the same issue I documented here: http://forum.mikrotik.com/t/dual-wan-failover-script-feedback-pls/183423/1 in that when pinging via a vrf the device requires one successful ping via the main vrf before it will send ICMP packets via the specified vrf:

Screenshot 2025-04-28 at 20.45.52.png

Screenshot 2025-04-28 at 20.47.31.png

robmaltsystems · April 28, 2025, 11:04am

I might make a new post on this as don’t want to take over the original post. Are there moderator controls here to split posts?

ilium007 · April 28, 2025, 11:15am

Do you have a fix for this?