Simple Failover - ISP1 PPPoE & ISP2 LTE

Good day

I know this question has been asked more times than I’ve hit the snooze button on my alarm. I’ve searched a few posts and tried recursive routing but I’m still missing something.

I have a PPPoE connection (Primary ISP) with a Dynamic IP on Ether1. “Add default route” is selected with Distance 1.
LTE Router (192.168.8.1) on Ether 2 (192.168.8.2)

The Dynamic route 0.0.0.0/0 send traffic through the gateway “pppoe-out1”
I’ve added a static route as follow:

add dst-address=0.0.0.0 gateway=192.168.8.1 distance=2

There are basically 4 ICMP stages:
Reply from 1.1.1.1: bytes=32 time=4ms TTL=58 1
Reply from 1.1.1.1: bytes=32 time=3ms TTL=58 1
Reply from 1.1.1.1: bytes=32 time=3ms TTL=58 1
Reply from 1.1.1.1: bytes=32 time=28ms TTL=57 2
Reply from 1.1.1.1: bytes=32 time=91ms TTL=57 2
Reply from 1.1.1.1: bytes=32 time=30ms TTL=57 2
Request timed out. 3
Request timed out. 3
Request timed out. 3
Request timed out. 3
Request timed out. 3
Reply from 1.1.1.1: bytes=32 time=3ms TTL=58 4
Reply from 1.1.1.1: bytes=32 time=3ms TTL=58 4
Reply from 1.1.1.1: bytes=32 time=3ms TTL=58 4

Stage 1 is when the primary link is working. I then drop the traffic so stage 2 the traffic goes through the LTE.

When the link is restored, the traffic is dropped by the firewall because they are not valid. This is most likely because there are active connections on the LTE but the primary ISP route distance is 1 and the Mikrotik is prioritizing that route.

When I clear all the firewall connections, I get a response on the primary ISP (stage 4).

Unfortunately I cannot set a static route on the Primary ISP. I’ve tried to add the following IP route:

/ip route
add dst-address=1.1.1.1 gateway=pppoe-out1 distance=1

I then added a Netwatch Monitor:

/tool netwatch
add host=1.1.1.1 interval=00:00:30 timeout=0.5 up-script="ip firewall connection {:foreach r in=[find] do={remove $r}}"

When the primary link goes down I can still reach 1.1.1.1 on the Mikrotik so the Netwatch rule cannot detect that the ISP1 is down.

Any help would be appreciated.
Thanks

These:
add dst-address=0.0.0.0 gateway=192.168.8.1 distance=2
add dst-address=1.1.1.1 gateway=pppoe-out1 distance=1

Should actually be, I believe:
add dst-address=0.0.0.0/0 gateway=192.168.8.1 distance=2
add dst-address=1.1.1.1/32 gateway=pppoe-out1 distance=1
(but maybe the network is implied)

But that shouldn’t be the issue, please post the output of:
/ip route print
two times, once when the pppoe is up and once when it is down and traffic goes through LTE.

PPPoE Up

DST-ADDRESS PREF-SRC GATEWAY DISTANCE

0 ADS 0.0.0.0/0 pppoe-out1 1
1 S 0.0.0.0/0 192.168.8.1 2
2 A S 1.1.1.1/32 pppoe-out1 1
3 ADC 10.10.10.1/32 10.10.10.213 sstp-out1 0
4 ADC 10.102.11.62/32 102.66.23.235 pppoe-out1 0
5 ADC 192.168.8.0/24 192.168.8.2 ether4 0
6 ADC 192.168.10.0/24 192.168.10.254 bridge1 0
7 DC 192.168.11.0/24 192.168.11.1 ether3 255

PPPoE Down

DST-ADDRESS PREF-SRC GATEWAY DISTANCE

0 A S 0.0.0.0/0 192.168.8.1 2
1 S 1.1.1.1/32 pppoe-out1 1
2 ADC 10.10.10.1/32 10.10.10.213 sstp-out1 0
3 ADC 192.168.8.0/24 192.168.8.2 ether4 0
4 ADC 192.168.10.0/24 192.168.10.254 bridge1 0
5 DC 192.168.11.0/24 192.168.11.1 ether3 255

When the primary link goes down I can still reach 1.1.1.1 on the Mikrotik so the Netwatch rule cannot detect that the ISP1 is down.

You need 2 separate test hosts (eg 1.1.1.1 and 1.1.2.2), and also check with 2 routes that will then resolve as recursive routes only if the check passes on that route.

Have a look at this: http://forum.mikrotik.com/t/advanced-routing-failover-without-scripting/136599/1

It is applicable for this case, but describes the more complex case of load balancing. The failover requested is just a simplified form of load balancing.

Key ingredient is the (brain hurting) concept of “recursive routes”. Recursive routes allow in this case to set a route via the PPoE (GW1) and one via the LTE provider (GW2) in this case
(the routes resolving the recursive route request), and to check which one really works., based on the fact that via that route some Host on Internet is reachable.
If some general available IP address (specificly choosen for just one route) is reachable via the route of GW1 or another IP for the route of GW2 only then that resolving route will be used.

Load balancing or failover, is just based what you do with the availability of the 2 routes. If one has priority ( e.g. smaller distance ) then only that one will be used. The other one only as failover.
If you split traffic priority with the routing-marks, then load balancing is possible.

When you have PPPoE up:
0 ADS 0.0.0.0/0 pppoe-out1 1<- this is (DHCP) and Active, ALL outbound traffic goes through this
1 S 0.0.0.0/0 192.168.8.1 2 -<- this is (Static) and NOT Active (because of higher distance=2), so it is irrelevant
2 A S 1.1.1.1/32 pppoe-out1 1<- this, (Static) is Active and being (much) narrower catches the specific outbound traffic with destination 1.1.1.1, BUT, since it has the same gateway than the main route, it is pretty much irrelevant

You are - I believe - simulating the lack of internet by removing the connection between the Mikrotik router and the ISP modem/router.
The point of a recursive route is to check not only that the local modem/router is up, but also that it has internet (this can be simulated by removing the connection from the ISP modem router to the ISP, i.e. detaching the cable coming from the wall).

When you have PPPoE down:
<here imagine the missing route to 0.0.0.0/0 through pppoe-out1> ← since you have not anymore the pppoe-out1 connected, the route is not created at all
0 A S 0.0.0.0/0 192.168.8.1 2 ← this is (Static) and became Active, ALL outbound traffic goes through this (the LTE), INCLUDING the traffic destined to 1.1.1.1
1 S 1.1.1.1/32 pppoe-out1 1 ← this is (Static) and NOT active, since the gateway cannot be reached, so it is irrelevant

In this case the 1.1.1.1 netwatch pings go through the wider route going through the LTE gateway.
You should be able to check the path the connection takes by using /tool trace (instead of ping) in the two cases.

The tutorial bpwl linked to is way too complex, before that one, read this simpler one (but still it will take some effort to grasp the concepts), point I.:
https://web.archive.org/web/20230524061649/https://forum.mikrotik.com/viewtopic.php?t=182373

Once you have got the idea, go through this one:
http://forum.mikrotik.com/t/most-effective-failover/160877/1

Right now you are not using recursive routing at all.

You are in a peculiar case as your “main” route is seemingly not a static one you added, but rather it comes from the DHCP client, and has distance 1.
You should disable the dhcp client setting to get that route and manually add a static one:
https://help.mikrotik.com/docs/display/ROS/DHCP#DHCP-DHCPOptions
add-default-route=no

In a static setup, your recursive routing would then be something like:
/ip route
add comment=PrimaryISP check-gateway=ping distance=1 dst-address=0.0.0.0/0 gateway=1.1.1.1 scope=10 target-scope=12
add comment=nexthopcheck distance=1 dst-address=1.1.1.1/32 gateway=pppoe-out1 scope=11 target-scope=11
add comment=SecondaryISP distance=2 dst-address=0.0.0.0/0 gateway=192.168.8.1 scope=10 target-scope=30

An alternative approach, that would do in your case is this one (easier to understand and implement), using Netwatch to check a public site (8.8.8.8 in the example, you should replace it with 1.1.1.1):
http://forum.mikrotik.com/t/simpler-failover-for-two-gateways-i-found-working/169108/1
if you choose this one, you can leave the DHCP originated route, but set the option in dhcp client to give it a greater distance.

Thank you very much. This post helped me to get the config right:
http://forum.mikrotik.com/t/most-effective-failover/160877/1

I left the “Add default route” tick on the PPPoE. My config is now as follow:

/ip route
add distance=1 dst-address=1.1.1.1/32 gateway=pppoe-out1
add distance=20 dst-address=1.1.1.1/32 type=blackhole

My netwatch rule looks like this:

/tool netwatch
add down-script="/ip firewall connection remove [find]\r\
\n" host=1.1.1.1 interval=5s timeout=500ms up-script="/ip firewall connection remove [find] "

Thanks again for all the response.

If your current config works for you, good :slight_smile: .

You should anyway check this post by rextended on that same thread:
http://forum.mikrotik.com/t/most-effective-failover/160877/1

Point of note: remove only connections that have a longish remaining timeout, to avoid attempting to remove connections that already timed out and thus do not exist anymore

For me, the simple setup here:

https://www.prinmath.com/ham/mikrotik-failover.htm

Has worked for me for years. MY setup is the same as yours (primary is PPPoE, secondary is LTE). Short & sweet.