Failover not working

Hello team!!!

I have seen recently the following scenario:

  • Mikrotik router with 2 WAN connections
  • When all the connecitons work I can connect from Winbox and I can ping through any WAN with no need to create mangle rules (No routing marks currently)
  • Once, the main ISP failed (The one with the less distance to 0.0.0.0/0 route) and I could not connect from outside to any public IP and both stop responding the ping.
  • The problem were with the main ISP, which was intermitent, it went down and up about each 5-10 minutes (5-10 minutes down, 5-10 minutes up, 5-10 minutes down …)
  • When it was down the ping to the seconday WAN from outside also stoped working
  • When the main ISP was down, the gateway was still answering from the router but not from outside.

After this issue I realliced that the “check gateway” option was not checked in both 0.0.0.0/0 routes, but anyway, as the dg was responding from inside I think this could not solve the issue.
I know I can create an script to ping to other places from both interfaces to make it work a “better” failover, but what I am wondering is about why the secondary WAN connection stop responding ping when the main WAN connection failed.

Any idea?
Regards,
Damián

You have to check that for yourself using /tool sniffer quick ip-protocol=icmp while both WANs are running fine, but as you don’t use any routing-mark and only base the failover on two default routes with different distance values, the responses to ping requests arriving to WAN2 are sent via WAN1, with the source IP of WAN2. This is possible if WAN1 has a public IP (so no src-nat handling of the responses on their way to the requestor) and if the ISP doesn’t validate the source addresses of packets coming from clients (which is quite common). As you’ve realized yourself, the WAN1 doesn’t go down so the route through it stays active.


Correct. That’s why I always recommend to monitor the transparency of the complete path from your WAN all the way to the internet, by checking availability of some public address(es) which is (are) not in your ISP’s network.

There is that excellent article on this from Timo Puistaja. The check-gateway mechanism, which is used there to check the availability of the “virtual gateways” so to say, checks the gateway availability every 10 seconds. If this periodicity is sufficient for you, this is the best approach as you can completely avoid any scripting. If you need to react more swiftly, scripting cannot be avoided - dynamic routing protocols are out of question in case of two WAN uplinks from different ISPs.

Hello!!!

Sindy, you are the best!!!
Great tutorial, I will try to apply the 2º method in all our Mikrotik routers.
Do you know if this will work in routers with routing marks? I think I need to create the sames routes twice, once for the right routing mark and once without marks

“/tool sniffer quick ip-protocol=icmp” is showing me a lot of items but I can not identify my ping (My MAC address was not showed there). Anyway I think you are right, pings to seconday WAN are being responsed from the main WAN with the seconday WAN IP.

Regards,
Damián

Correct, but the good news is that it is enough to create only the topmost routes with routing-marks; the routes to the “virtual gateways”, i.e. those used for recursive next-hop search, are sufficient in table main.


I’m not sure how MAC address is related, but yes, on a busy router there may be a lot of icmp traffic. So add ip-address=the.ip.from.which.you.ping to reduce the amount of traffic shown.

I think it should be mentioned, in case your recursive failover uses some known DNS Servers like 8.8.8.8 then if that DNS is used by your Router as well it won’t work…
So make sure you use DNS Servers on your Router more than the ones that are used on your recursive failover as well…

Thank you both,

Sindy, you were right, I recently added the ip-address to the sniffer and tested, I got the following:

INTERFACE                           TIME    NUM DI SRC-MAC           DST-MAC           VLAN  
ether11                           76.146      1 <- 10:1B:54:BB:25:2B 6C:3B:6B:08:75:DE 4     
vlan4-vlan-wan1                   76.146      2 <- 10:1B:54:BB:25:2B 6C:3B:6B:08:75:DE
vlan5-vlan-wan2               76.146      3 -> 6C:3B:6B:08:75:DE 00:A7:42:0D:FF:D1
ether11                           76.146      4 -> 6C:3B:6B:08:75:DE 00:A7:42:0D:FF:D1 5     
ether11                            77.16      5 <- 10:1B:54:BB:25:2B 6C:3B:6B:08:75:DE 4     
vlan4-vlan-wan1                    77.16      6 <- 10:1B:54:BB:25:2B 6C:3B:6B:08:75:DE
vlan5-vlan-wan2                77.16      7 -> 6C:3B:6B:08:75:DE 00:A7:42:0D:FF:D1
ether11                            77.16      8 -> 6C:3B:6B:08:75:DE 00:A7:42:0D:FF:D1 5     
ether11                           78.167      9 <- 10:1B:54:BB:25:2B 6C:3B:6B:08:75:DE 4     
vlan4-vlan-wan1                   78.167     10 <- 10:1B:54:BB:25:2B 6C:3B:6B:08:75:DE
vlan5-vlan-wan2               78.167     11 -> 6C:3B:6B:08:75:DE 00:A7:42:0D:FF:D1
ether11                           78.167     12 -> 6C:3B:6B:08:75:DE 00:A7:42:0D:FF:D1 5     
ether11                           79.168     13 <- 10:1B:54:BB:25:2B 6C:3B:6B:08:75:DE 4     
vlan4-vlan-wan1                   79.168     14 <- 10:1B:54:BB:25:2B 6C:3B:6B:08:75:DE
vlan5-vlan-wan2               79.168     15 -> 6C:3B:6B:08:75:DE 00:A7:42:0D:FF:D1
ether11                           79.168     16 -> 6C:3B:6B:08:75:DE 00:A7:42:0D:FF:D1 5     
ether11                           80.183     17 <- 10:1B:54:BB:25:2B 6C:3B:6B:08:75:DE 4     
vlan4-vlan-wan1                   80.183     18 <- 10:1B:54:BB:25:2B 6C:3B:6B:08:75:DE
vlan5-vlan-wan2               80.183     19 -> 6C:3B:6B:08:75:DE 00:A7:42:0D:FF:D1
ether11                           80.183     20 -> 6C:3B:6B:08:75:DE 00:A7:42:0D:FF:D1 5

Now wan2 is the main, I dont know why ether11 is there (This is the phisical port where are attached both vlans) , but we can see that the ping to wan1 (secondary IP) is being responsed from wan2
I think, this is using the wan1 IP because if not, my machine wont recognice the response.

Zacharias, good point I will take care.

Regards,
Damián

Here is also a manual/topic about this:

http://forum.mikrotik.com/t/advanced-routing-failover-without-scripting/136599/1

Sorry Zacharias, do you mean the failover wont work?
I though if I have 8.8.8.8 as DNS in my router, all DNS querys should be delivered from the WAN which has the route for 8.8.8.8

Regards,
Damián

Reverse - the failover will work, but if you use 8.8.8.8 as the “virtual gateway” to check the transparency of the path via one of the WANs and this path breaks, the other recursive route will become active, but it won’t be usable to reach the 8.8.8.8 because the more selective route to 8.8.8.8/32 will shadow it.

But it will only shadow it within the routing table main - if you set up a default route with some routing-mark via the other WAN, 8.8.8.8 will be reachable via that route. So it is possible to use mangle rules to assign that routing-mark to forwarded or locally originated packets towards 8.8.8.8:53 to keep 8.8.8.8 usable as DNS server simultaneously with using it as the “virtual gateway”.

@sindy yes that is what i meant…
However, since i had never tested that the DNS would indeed not work i thought giving it a try on GNS3…
So i created a recursive failover with 8.8.8.8 for the first line and 8.8.4.4 for the second one…
When the 1st line was off and the 8.8.8.8 was listed as unreachable on my Routing Table, my Router that had one and only 1 DNS configured (8.8.8.8 ), no other Dynamic or Static, it could resolve every address using that DNS Server…
The VPC with DNS configured only the Local Router could as well resolve DNS requests that i could see under IP DNS Cache on the CHR…

I was always Flashing the DNS Cach to be sure…
I even removed the DNS completelly from the Router so that i be sure that i do not miss something, when i did add the DNS back everything was resolved though it shouldn’t…

/ip route print detail
 0   S  dst-address=0.0.0.0/0 gateway=8.8.8.8 gateway-status=8.8.8.8 unreachable check-gateway=ping distance=1 scope=30 
        target-scope=10

And the Nextohop Table…

 ip route nexthop print detail 
 1 address=8.8.8.8 gw-state=unreachable forwarding-nexthop="" interface="" scope=10 check-gateway=icmp gw-check-ok=yes

And the DNS Table:

ip dns print 
                      servers: 8.8.8.8
              dynamic-servers: 
        allow-remote-requests: yes

That’s quite surprising, because the same test in my case (except that I use 8.8.4.4 on the dead WAN and 8.8.8.8 on the alive one), and that I’m testing on a hAP ac², shows that until /ip firewall mangle add action=mark-routing chain=output dst-address=8.8.4.4 dst-port=53 new-routing-mark=rm-test protocol=udp, the DNS doesn’t work…

[me@MyTik] > ip route nexthop print
0 address=8.8.4.4 gw-state=recursive forwarding-nexthop=192.168.25.100 interface=“” scope=30 check-gateway=icmp gw-check-ok=no
1 address=8.8.8.8 gw-state=recursive forwarding-nexthop=193.194.40.150 interface=“” scope=30 check-gateway=icmp gw-check-ok=yes

No @sindy it does not work as it should, i doubled checked… So if you don’t mind taking a look in case i dont see something obvious…

 0   S  dst-address=0.0.0.0/0 gateway=10.10.10.1 gateway-status=10.10.10.1 unreachable distance=1 scope=30 target-scope=10 
 1 A S  dst-address=0.0.0.0/0 gateway=10.10.11.1 gateway-status=10.10.11.1 recursive via 192.168.75.2 ether2 distance=2 scope=30 
        target-scope=10 
 2 A S  dst-address=8.8.4.4/32 gateway=192.168.75.2 gateway-status=192.168.75.2 reachable via  ether2 distance=1 scope=10 target-scope=10 
 3 A S  dst-address=8.8.8.8/32 gateway=192.168.1.1 gateway-status=192.168.1.1 reachable via  ether1 distance=1 scope=10 target-scope=10 
 4   S  dst-address=10.10.10.1/32 gateway=8.8.8.8 gateway-status=8.8.8.8 recursive via 192.168.1.1 ether1 check-gateway=ping distance=1 
        scope=10 target-scope=10 
 5 A S  dst-address=10.10.11.1/32 gateway=8.8.4.4 gateway-status=8.8.4.4 recursive via 192.168.75.2 ether2 check-gateway=ping distance=1 
        scope=10 target-scope=10

And Nextohop Table:

 0 address=8.8.4.4 gw-state=recursive forwarding-nexthop=192.168.75.2 interface="" scope=10 check-gateway=icmp gw-check-ok=yes 
 1 address=8.8.8.8 gw-state=recursive forwarding-nexthop=192.168.1.1 interface="" scope=10 check-gateway=icmp gw-check-ok=no 
 2 address=10.10.10.1 gw-state=unreachable forwarding-nexthop="" interface="" scope=10 check-gateway=none 
 3 address=10.10.11.1 gw-state=recursive forwarding-nexthop=192.168.75.2 interface="" scope=10 check-gateway=none 
 4 address=192.168.1.1 gw-state=reachable forwarding-nexthop="" interface="" scope=10 check-gateway=none 
 5 address=192.168.75.2 gw-state=reachable forwarding-nexthop="" interface="" scope=10 check-gateway=none

Router DNS:

ip dns print 
servers: 8.8.8.8
dynamic-servers:

The Client has as DNS the Local IP address of the Router

NAME        : PC3[1]
IP/MASK     : 192.168.10.254/24
GATEWAY     : 192.168.10.1
DNS         : 192.168.10.1
DHCP SERVER : 192.168.10.1

The Recursive route works just fine, if you see the Routing Table 8.8.8.8 is unreachable and the Next hop check GW is to no.. So everything looks and works fine except the DNS that is alway reachable (ICMP, DNS requests etc)

I had a simpler setup initially (as I check just a single reference address via each WAN, the two corresponding default routes were set to use directly 8.8.8.8 and 8.8.4.4 as their respective gateways), but I’ve copied your one (0.0.0.0/0 via 10.10.8.8/32 via 8.8.8.8/32) and I still get the same result - unless I use other routing table than main for the DNS queries to 8.8.4.4 (which is the one made unreachable in my case), they remain unresponded. So try sniffing without indicating an interface to find out which path do they take in your setup.

In my case, the WAN which checks 8.8.4.4 is just a temporary one via my mobile’s “hotspot” function, and the wireless client of that hotspot is another 'Tik whose IP address is the gateway of the route to 8.8.4.4, so the Ethernet is physically up on the 'Tik with dual-WAN setup but the path is broken at the wireless side. I’ve created the whole recursive failover setup at home only for testing purposes. So depending on how you simulate the unreachability of the 8.8.8.8 via its associated WAN, there may be some difference in behaviour?

I did a packet sniff, the request goes out from the other line to reach 8.8.8.8, but i guess that was obvious…
The unreachability is simulated through firewall or broken link on GNS3…

I also did test the simpler setup and the result remains the same… The DNS requests are always served…

Sorry, I needed to take care about other issues.
Ok, I will test first with only 1 router (Which has marks) but each ISP provided us 2 DNS servers, so maybe I can use those DNS for the failover purpose and 8.8.8.8 as DNS Server
I will let you know the results.

Regards,
Damián

It’s not a good idea to ping the DNS servers provided by the ISP as a check that the ISP’s own connections to internet work. They may respond to pings although they’ve lost access to the internet themselves.

Yeah, thanks, I have realliced this after my post
In those cases failover wont work

Regards

Hello Sindy,

Some time before, I needed to set up marks in a router with dinamyc IPs, then you sugested me to add the following script in the dhcp-client:

{
  :local routeId [/ip route find distance=1 dst-address~"0.0.0.0/0" routing-mark=to_WAN1]
  if (($bound=1) and ([ip route get $routeId gateway]!=$"gateway-address")) do={
    /ip route set $routeId gateway=$"gateway-address"
  }
}

Sorry about this basic question, in this case, I supose will need to modify those routes for 8.8.8.8 and 208.67.220.220 (Without marks), because the only route with mark is which has the virtual IP as gateway
The code should be something like this?

{
  :local routeId [/ip route find distance=1 dst-address~"208.67.220.220"]
  if (($bound=1) and ([ip route get $routeId gateway]!=$"gateway-address")) do={
    /ip route set $routeId gateway=$"gateway-address"
  }
  :local routeId [/ip route find distance=1 dst-address~"8.8.8.8/32"]
  if (($bound=1) and ([ip route get $routeId gateway]!=$"gateway-address")) do={
    /ip route set $routeId gateway=$"gateway-address"
  }
}

Am I right?

Regards,
Damián

Yes, just replace the :local routeId by :set routeId when handling the second route to make it formally correct. :local creates the variable, and that can be done only once in a given scope (code block), :set assigns a new value to an already existing variable, so it cannot be done before you create the variable. RouterOS currently accepts the second :local without complaints, but it may not in future.

Perfect as allways!!
Thanks Sindy!!!