Advanced Routing Failover without Scripting

Sorry, I should clarify. Client traffic not working is my primary concern. I just tried pinging from the router for troubleshooting, but it seems like that is mostly working, actually… it just doesn’t seem to be able to switch in the middle of a ping command?

I’ve been testing by disconnecting the uplink of ISP 1. Here are the routing details for those conditions:

 0   S  dst-address=0.0.0.0/0 gateway=8.8.8.8
        gateway-status=8.8.8.8 recursive via 10.1.0.1 ether1 check-gateway=ping
        distance=1 scope=30 target-scope=10 routing-mark=to_ISP1

 1 A S  dst-address=0.0.0.0/0 gateway=8.8.4.4
        gateway-status=8.8.4.4 recursive via 10.2.0.1 ether2 check-gateway=ping
        distance=2 scope=30 target-scope=10 routing-mark=to_ISP1

 2 A S  dst-address=0.0.0.0/0 gateway=8.8.4.4
        gateway-status=8.8.4.4 recursive via 10.2.0.1 ether2 check-gateway=ping
        distance=1 scope=30 target-scope=10 routing-mark=to_ISP2

 3   S  dst-address=0.0.0.0/0 gateway=8.8.8.8
        gateway-status=8.8.8.8 recursive via 10.1.0.1 ether1 check-gateway=ping
        distance=2 scope=30 target-scope=10 routing-mark=to_ISP2

 4 A S  ;;; hack 1
        dst-address=0.0.0.0/0 gateway=10.1.0.1
        gateway-status=10.1.0.1 reachable via  ether1 distance=3 scope=30
        target-scope=10

 5   S  ;;; hack 2
        dst-address=0.0.0.0/0 gateway=10.2.0.1
        gateway-status=10.2.0.1 reachable via  ether2 distance=4 scope=30
        target-scope=10

 6 A S  dst-address=8.8.4.4/32 gateway=10.2.0.1
        gateway-status=10.2.0.1 reachable via  ether2 distance=1 scope=10
        target-scope=10

 7 A S  dst-address=8.8.8.8/32 gateway=10.1.0.1
        gateway-status=10.1.0.1 reachable via  ether1 distance=1 scope=10
        target-scope=10

 8 ADC  dst-address=10.1.0.0/30 pref-src=10.1.0.2 gateway=ether1
        gateway-status=ether1 reachable distance=0 scope=10

 9 ADC  dst-address=10.2.0.0/30 pref-src=10.2.0.2 gateway=ether2
        gateway-status=ether2 reachable distance=0 scope=10

10 ADC  dst-address=172.30.0.0/24 pref-src=172.30.0.1 gateway=bridge
        gateway-status=bridge reachable distance=0 scope=10

The router seems to find its way out, but not the clients. I’m pinging 9.9.9.9. Doing a trace from the client reveals that it still seems to try to go via ISP 1.

Your Firewall Mangle rules only mark router’s traffic (chain=output). For clients, you need to mark in chain=prerouting. You can see an example in the manual: https://wiki.mikrotik.com/wiki/Manual:PCC#Policy_routing

thx for doing:

but I still have a little understanding challenge. It is this here:

or when traffic is swapped back to GW1.

In detail, my remaining questions are:

  1. When GW1 comes back up again, new connections will go through GW1 as it has the lower distance but what happens to the established connections? Are these moved from GW2 to GW2 automatically or only if GW2 goes down?
  2. When are routes recalculated? Each time an interface comes up?
  3. How often is ping check-gateway=ping carried out?

thx
Stefan


@Chupaka, thx for sharing all this with us :slight_smile: :slight_smile:

  1. All connection on connection-track and the others are broken, I made some script for clear all “EX” connections, useful for SIP and the others.
  2. Yes and not, is not the only reason, like “ping” on external IP
  3. 10 seconds

I’m really sorry but I don’t get it.
Does it mean, that if it was once conn-tracked of GW1 and has not been cleared yet, it will be swapped back from GW2 to GW1 as it has the lower distance?


so anytime the routing table is used?


this is clear?
It this fixed or can it be modified?

  1. This is more clear: ALL IS BROKEN, and all (related to the inactive gateway) the connections memorized on connection-track are all invalid, but the system do not clear it until single timeout for each connection is reached.

  2. NO, simply can’t have a complete list on “when”, but are not recaulculated for each use.

  3. what mean “this is clear?”

  4. I never find a way to change that, is hardcoded somewhere to 10 seconds.

1a) ok, all cleared at timeout.
1b) Any automatic switch back from GW2 to GW1 for established connections?
2) so some cases are known, but not all cases, when “recalculation”. You can expect it happens on related interface when pinging an external IP or link goes down/up.
3) sorry, forgot to remove “?”

1b) If I’m not sure if coming back faulty gateway the disrupted connections works again…

I would say this is answered in #26 of NAT: Masquerade can leak private IP, why&how? - MikroTik.

My mind is still blocked to formulate a reasonable answer so that the support answers in detail, as what happens in the background is still unclear.

For peace of mind,
destroy all connection joints with the gateway that is no longer available
and you will never have regrets.

that would be done how?

Some examples
here

here

and here

great, thx

Note Post from VERSION 7 beta, which will impact this thread eventually

Hi, Chupaka great write up. will this work with the same default gateway that uses the interface variable “%” ?

Unfortunately, no: routes with interface specified do not participate in recursive route lookup, at least in RouterOS v6

Hi Chupaka,

As @anav points out above, you might want to update your tutorial to increase target-scope to be one more than scope, as that is necessary on RouterOS v7. This is not due to a bug, but due instead to an intentional change in the behavior of the feature.

Thanks, I’ve commented in that topic (http://forum.mikrotik.com/t/recursive-routes/147430/6) and updated the tutorial to use scopes 11 and 12 for resolving routes. Now that looks even more complex :smiley:

Hi,

first: Thanks for all the information. It is a lot of helpful information so far.

However, it is actually a bit much and therefore I am a bit confused to find the really relevant parts for me and what exactly what part does.

I have a failover WAN already working using recursive routing. However, with my current setup, the gateway IPs I use to check that the WAN (WAN1) is reachable (8.8.8.8 and 8.8.4.4) will not be reachable, once it falls back to the failover WAN (WAN2) which is a bit of an issue.

Do I understand this correctly and that is why the example in the first post use routing-marks for WAN1 and WAN2 gateway checks? in adition to the normal routing table?

If so, I guess I have to mark the traffic accordingly using mangle rules? Or is it somehow automatically marked by the corresponding final gateway - which I doubt? If I use mangle, I guess I need prerouting and output (clients and router)? Also why are the hosts inverted for ISP2, shouldn’t the checks be the same order?
I guess I only need output and even then only if it goes to my hosts “8.8.8.8”/“8.8.4.4” using the corresponding interface to mark it accordingly as I really only need the marks for the gateway checks?

Can someone provide a full working sample, including necessary mangle rules to get this working and explain what each section does exactly, so I now how I best modify it for my scenario?

I guess optionally I could check against hosts, that I do not need, e.g. cloudflare DNS servers, if I use Google DNS servers myself but not really that nice I think.

Also I only want fallover WAN, no load balancing as the secondary WAN is way slower and will not improve performance but hinder it if WAN1 is up. It is just there in case WAN1 is down, so people can still access the necessary applications that require an internet connection.

And is it just me or is it relatively complicated to setup this feature? I know it is a ROS limitation but wouldn’t it just be easier to add hosts to check for a route and if non are reachable, the route is down? then all if this could be condensed into two routing entries.

Thanks for any feedback in advance.

For just failover , I don’t see the need to test the 2nd (backup) route with recursive routes, if returning to the first is OK, whenever it becomes available again.

Testing the first as usual with recursive routes. If the first fails the test, it will be down, then the second is used (based on larger distance value than first).
No need to test. If it is not working it was the last fallback anyway. If the first is operational it will be chosen by its lower distance.