Netwatch and VRF

Hi!
I’m trying to use Netwatch to check for the availability of internet connection of cabled and LTE WANs. The intention is to create a failover system.

What I’m trying to do is to route the test hosts used by Netwatch on the right WAN, so that I can be sure that a certain test is done ona specific connection, and reading around I found that I can use VRF to achieve this result… but I’m certainly doing something really wrong, but I don’t know what.

At the moment the system is actually working on checking the WANs, but the final result is that the router and the LAN are always offline anyway.

I try to copy here the configuration I’m using.

The VRF definitions

/ip vrf
add interfaces=WAN_01 name=WAN1_Test
add interfaces=LTE_01 name=WAN2_Test

The host that I use to check the WANs

/ip firewall address-list
add address=9.9.9.9 list=WAN2_Check_Host
add address=1.1.1.1 list=WAN1_Check_Host
add address=208.67.220.220 list=WAN1_Check_Host
add address=216.58.205.36 list=WAN2_Check_Host

/ip firewall mangle
add action=mark-routing chain=output dst-address-list=WAN1_Check_Host new-routing-mark=WAN1_Test passthrough=no
add action=mark-routing chain=prerouting new-routing-mark=WAN1_Test passthrough=no src-address-list=WAN1_Check_Host
add action=mark-routing chain=output dst-address-list=WAN2_Check_Host new-routing-mark=WAN2_Test passthrough=no
add action=mark-routing chain=prerouting new-routing-mark=WAN2_Test passthrough=no src-address-list=WAN2_Check_Host

The routing table

/ip route
add comment="Backup connection [WAN2__LTE]" disabled=yes
distance=1 dst-address=0.0.0.0/0 gateway=LTE_01 routing-table=main scope=
30 suppress-hw-offload=no target-scope=10
add comment="Main connection [WAN1__CableDSL]" disabled=no distance=1
dst-address=0.0.0.0/0 gateway=10.1.0.1%WAN_01 routing-table=main scope=30
suppress-hw-offload=no target-scope=10
add comment="Check route for main connection" disabled=no distance=1 dst-address=0.0.0.0/0 gateway=
10.1.0.1%WAN_01 routing-table=WAN1_Test scope=30 suppress-hw-offload=no
target-scope=10 vrf-interface=WAN_01
add comment="Check route for backup connection" disabled=no distance=10 dst-address=0.0.0.0/0 gateway=LTE_01
routing-table=WAN2_Test scope=30 suppress-hw-offload=no target-scope=10
vrf-interface=LTE_01

The Netwatch configuration

/tool netwatch
add disabled=no down-script="" host=1.1.1.1@WAN1_Test http-codes="" interval=
15s name=WAN1 port=53 startup-delay=30s test-script="" timeout=1s type=
tcp-conn up-script=""
add disabled=no dns-server=9.9.9.9 down-script="" host=9.9.9.9@WAN2_Test
http-codes="" interval=15s name=WAN2 port=53 startup-delay=30s
test-script="" thr-tcp-conn-time=999us timeout=1s type=simple up-script=
""
add disabled=no dns-server=9.9.9.9 down-script="" host=
216.58.205.36@WAN2_Test http-codes="" interval=15s name=WAN2B port=443
startup-delay=0s test-script="" timeout=1s type=tcp-conn up-script=""
add disabled=no down-script="" host=208.67.220.220@WAN1_Test http-codes=""
interval=15s name=WAN1B port=53 startup-delay=30s test-script="" timeout=
1s type=tcp-conn up-script=""

I played a little and I understood (maybe wrongly) that when an interface is assigned to a VRF cannot be used for another one, but I don’t know how solve this, because I think I need the interface to check the hosts AND to let the router and the LAN go online… I’m actually a bit confused. :sad_but_relieved_face:

As usual, any suggestion is really appreciated!

Denis

When you want to implement failover, VRF is likely not what you want.

Check the possibilities of user-defined route tables and route marking.

You can try this, simpler and simplest:
Simpler Failover for two Gateways I found working
Simpler Failover for two Gateways I found working - #19 by jaclaz

Crazy how this is still the way to go (with modifications, of course - but love to see this thread being digged-out from time to time :slight_smile: )

Why crazy? :confused:

It is simple, it works (for some suitable cases), until a better or simpler approach will come out it remains a very valid reference. (and BTW, it is not like there are that many alternatives that are complete, working and easily reproducible, most of similar aimed threads are either missing bits and pieces here and there or tailored to specific configs and hard to follow and/or adapt to different setups). :smiley:

1 Like

That‘s true. Furthermore, failover seems a topic many people want to get done with ROS. After years of using the linked approach I can say, that this is still doing the job nicely every time.

Thanks very much!
I read the linked posts and I understood that VRF are really the wrong way.

I’ve modified the configuration.

The routing table:

/routing table
add disabled=no fib name=WAN2_Test
add disabled=no fib name=WAN1_Test

The firewall rules:

/ip firewall mangle
add action=mark-routing chain=output dst-address=1.1.1.1 new-routing-mark=
WAN1_Test passthrough=no
add action=mark-routing chain=output dst-address=208.67.220.220
new-routing-mark=WAN1_Test passthrough=no
add action=mark-routing chain=output dst-address=9.9.9.9 new-routing-mark=
WAN2_Test passthrough=no
add action=mark-routing chain=output dst-address=216.58.205.36
new-routing-mark=WAN2_Test passthrough=no

The routing table:

/ip route
add comment="Backup connection [WAN2__4GLTEWindTre]" disabled=yes
distance=1 dst-address=0.0.0.0/0 gateway=LTE_01 routing-table=main scope=
30 suppress-hw-offload=no target-scope=10
add comment="Main connection [WAN1__Elsynet]" disabled=yes distance=1
dst-address=0.0.0.0/0 gateway=10.1.0.1%WAN_01 routing-table=main scope=30
suppress-hw-offload=no target-scope=10
add comment="Main connection check route" disabled=no distance=1 dst-address=0.0.0.0/0 gateway=
10.1.0.1%WAN_01 routing-table=WAN1_Test scope=30 suppress-hw-offload=no
target-scope=10
add comment="Backup connection check route" disabled=no distance=1 dst-address=0.0.0.0/0 gateway=LTE_01
routing-table=WAN2_Test scope=30 suppress-hw-offload=no target-scope=10

…but it doesnt work. Actually, Netwatch is always down, and also using the ping tool fails with the check hosts. What I am doing wrong now?

…I forgot the Netwatch configuration, but it’s always the same:

/tool netwatch
add disabled=no down-script="" host=1.1.1.1 http-codes="" interval=15s name=
WAN1 port=53 startup-delay=30s test-script="" timeout=1s type=tcp-conn
up-script=""
add disabled=no dns-server=9.9.9.9 down-script="" host=9.9.9.9 http-codes=""
interval=15s name=WAN2 port=53 startup-delay=30s test-script=""
thr-tcp-conn-time=999us timeout=1s type=simple up-script=""
add disabled=no dns-server=9.9.9.9 down-script="" host=216.58.205.36
http-codes="" interval=15s name=WAN2B port=443 startup-delay=0s
test-script="" timeout=1s type=tcp-conn up-script=""
add disabled=no down-script="" host=208.67.220.220 http-codes="" interval=15s
name=WAN1B port=53 startup-delay=30s test-script="" timeout=1s type=
tcp-conn up-script=""

Try the simplest one (no need of added routing tables, everything stays in "main", no need of mangle rules).
Simpler Failover for two Gateways I found working - #19 by jaclaz

At first glance of what you posted you are not anyway following the original (simpler but not simplest) Filo's approach, in that one you need only one extra routing table and only one mangle rule).

Maybe related, maybe not, the given methods use the "normal", icmp type, likely "simple" would do as well, you seem to be using type=tcp-conn instead, I have no idea how that works.

I think it would be easier if you start from scratch as opposed to edit your earlier attempt.

Yes, when you really want only failover (the names CableDSL and LTE sort of suggest that) you do not need multiple route tables and recursive routes are the best way.

When in reality you want both connections to be available at the same time, maybe for some kind if load balancing to to send some traffic over one connection and some over the other, multiple route tables are the way to go (combined with recursive routes).

Netwatch is a no-no for this usage. It has the problems you describe, and is unnecessarily complex in operation.

... in your opinion ...

Seriously, the linked to methods work just fine for a few people, so - unless something has changed in recent ROs version - they cannot be a no-no.

It is true that ICMP kind of probe may be tricky to tune on some connections, but with some patience it can be done in case of issues, and one can always use the "simple" one instead.

Coincidentally, anav just posted a variation of the netwatch approach using the "simple" method and a few other changes:

“simple” netwatch will switch over after a single missed ping, not good.

“icmp” netwatch has strange parameters that are very difficult to understand and get right.

Both have the problem that only a single destination can be watched, or you would require two netwatch instances and a complicated script to switch over when both are down.

Recursive routing is resilient against a single missed ping, it can monitor multiple destinations when used in two layers (first layer pings the ISP router, second layer pings a couple of external systems, default route is the 3rd layer). And with beta RouterOS you can even tweak the ping rate as desired.

Finally recursive routing does not require config change to operate, so no chance it is left in an undesired state e.g. when power fails.

@TheKoder

All your up-script and down-script in the netwatch configuration look like empty.
Even if the netwatch is triggered, without suitable scripts there won't be any effects.

Yes, they are empty because I check the state of Netwatch with a scheduled script, so I can bypass the permissions limitations of Netwatch itslef, to set some global variables.

Hi Pelchi,
Interesting points, will think about them.
My immediate response is checking two canaries might alleviate the repeat phenomena good enough.
However, then we need a system script as netwatch can only check one host at a time........... and now perhaps clearer why Koder has gone in this direction.

There also seems to be a plethora of other parameters now available and as you said, if avoidable then better, but none seem to point to repeat x times within X seconds to be sure........

Problem is I have no idea how these icmp probes work. Is it one ping or a bunch of packets sent within an interval and thus really means 3 pings within 200ms for example.

Hi Pelchi, in summary, you are describing the phenomena or issue of FALSE POSITIVES.

  • the ISP is flaky but we dont want flapping besides using srcnat to avoid killing connections.
  • for some reason the DNS canary or some part along the way interrupted the single ping but is not reflective of the actual connectivity.

However the ICMP probe options get complex the initial ones seem straightforward.
packet-interval (Default: 50ms) The time between ICMP-request packet send
packet-count (Default: 10) Total count of ICMP packets to send out within a single test

THis seems to me to indicate that in actuality, the router sends out 10 pings spaced out each by 50ms.
So this is not a one of failure, its 10 send and response failures already spanning at least 500ms.

Therefore I think that effectively rules out the DNS address being strangely not accessible.
If its not good enough would 20 packets at 200ms make you happy?
20 checks covering 2 whole seconds??

Agreed a flaky ISP is a different matter and would need a script to combat flapping, perhaps
we could ask MT to add a repeat function. repeat default every X seconds and take action only after Y number of repeats

Other food for thought is this NOTE in MT docs

accept-icmp-time-exceeded=yes can be used together with a manually set low ttl value to monitor Internet connectivity, without relying on a specific endpoint.

For example, you can monitor a public IP address, but that address can filter your ICMP request, or just become unreachable itself, if the Netwatch probe is using this address to monitor Internet connectivity this would cause a false alarm.

To make sure you can reach the Internet, it's generally enough to make sure you can reach a device a few routing hops away. Low time to live value will expire in transit to the specified host you want to monitor - each router passing the ICMP packet will subtract "1" from TTL value, upon TTL reaching 0, ICMP "time exceeded" packet will be generated, and sent back to the Netwatch probe. If all other fail thresholds are not broken, this response will be considered a success.

I think this means, pick an IP address that is 3-4 hops away ( would say at least 2 hops past your ISP), set TTL to 2, and if you get a response message stating Timed out internet is up.
If internet is down ( aka ISP ) then one would get nil response aka no messages, if by fluke you reached the public IP you would get ping results .

So maybe we can dispense with canary DNS?

But how to choose close enough HOP address that is public.

For shits and giggles I use traceroute to DNS canary 1.1.1.1. There were 6 hops within my ISP the first three locally ( Nova Scotia) , the fourth in another province (New Brunwick) and the fifth in another province (Ontario) and the last one same ISP, but in the US. It was this last ISP one, the sixth hop, that the history started showing solid green ( which I take to mean Exited the ISP internal internet and was live in the WWW).

The seventh hop was to an intermediary party location and the next two were in the cloudflare network.

In summary
hop1 - ISP local city (same IP structure) - red line on History for the most part
hop2 - ISP local city (same IP strucure) - red line on History for the most part
hop3 - ISP local city (same IP structure) - red line on History for the most part
hop4 - ISP Different Province1 ( same IP structure) - red line History for the most part
hop5 - ISP Different Province2 ( different IP structure1) - red line History for the most part
hop6 - ISP Different Country (different IP structure1) - Green LIne history solid
hop7 - 3rd Party Different Country (diff IP structure2) - Green Line History solid
hop8 - Cloudflare (141.101.73.216) - Green line History Solid
hop9 - Cloudflare (1.1.1.1) - Green Line History Solid

My take on this if using the feature noted at the top would be to set TTL to
7.

So… this is the final and working version.

The main part is the routing configuration:

/ip route
add comment="Backup [WAN2__4GLTE]" disabled=yes
distance=1 dst-address=0.0.0.0/0 gateway=LTE_01 routing-table=main scope=
30 suppress-hw-offload=no target-scope=10
add comment="Main [WAN1__CableDSL]" disabled=no distance=2
dst-address=0.0.0.0/0 gateway=10.1.0.1%WAN_01 routing-table=main scope=30
suppress-hw-offload=no target-scope=10
add blackhole comment="Blackhole WAN1 Host check" disabled=no distance=2
dst-address=1.1.1.1/32 gateway="" routing-table=main scope=30
suppress-hw-offload=no target-scope=10
add comment="WAN1 Check host" disabled=no distance=1 dst-address=1.1.1.1/32
gateway=10.1.0.1%WAN_01 routing-table=main scope=30 suppress-hw-offload=
no target-scope=10
add comment="WAN1B Check host" disabled=no distance=1 dst-address=
208.67.220.220/32 gateway=10.1.0.1%WAN_01 routing-table=main scope=30
suppress-hw-offload=no target-scope=10
add blackhole comment="Blackhole WAN1B Host check" disabled=no distance=2
dst-address=208.67.220.220/32 gateway="" routing-table=main scope=30
suppress-hw-offload=no target-scope=10
add comment="WAN2 Check host" disabled=no distance=1 dst-address=9.9.9.9/32
gateway=LTE_01 routing-table=main scope=30 suppress-hw-offload=no
target-scope=10
add blackhole comment="Blackhole WAN2 Host check" disabled=no distance=2
dst-address=9.9.9.9/32 gateway="" routing-table=main scope=30
suppress-hw-offload=no target-scope=10
add comment="WAN2B Check host" disabled=no distance=1 dst-address=
216.58.205.36/32 gateway=LTE_01 routing-table=main scope=30
suppress-hw-offload=no target-scope=10
add blackhole comment="Blackhole WAN2B Host check" disabled=no distance=2
dst-address=216.58.205.36/32 gateway="" routing-table=main scope=30
suppress-hw-offload=no target-scope=10

then Netwatch:

/tool netwatch
add disabled=no down-script="" host=1.1.1.1 http-codes="" interval=20s name=
WAN1 port=53 startup-delay=30s test-script="" timeout=5s type=icmp
up-script=""
add disabled=no down-script="" host=9.9.9.9 http-codes=""
interval=20s name=WAN2 port=53 startup-delay=30s test-script="" timeout=5s type=icmp up-script=""
add disabled=no down-script="" host=216.58.205.36
http-codes="" interval=20s name=WAN2B port=443 startup-delay=0s
test-script="" timeout=5s type=tcp-conn up-script=""
add disabled=no down-script="" host=208.67.220.220 http-codes="" interval=20s
name=WAN1B port=53 startup-delay=30s test-script="" timeout=5s type=
tcp-conn up-script=""

and the checking script:

/system scheduler
add name=SystemInit on-event="/system/script/run Init " policy=
ftp,reboot,read,write,policy,test,password,sniff,sensitive,romon
start-time=startup
add interval=10s name=NetwatchCheck on-event=
"/system/script/run NetwatchCheck" policy=
ftp,reboot,read,write,policy,test,password,sniff,sensitive,romon
start-time=startup

/system script
add dont-require-permissions=no name=NetwatchCheck owner=admin policy=
ftp,reboot,read,write,policy,test,password,sniff,sensitive,romon source="
:global WANStatus
:global Time ([/system clock get date] . " " . [/system clock get time])

:local WAN1NetwatchStatus
:local WAN2NetwatchStatus
:local WAN1BNetwatchStatus
:local WAN2BNetwatchStatus

:local WAN1InterfaceRunning
:local WAN2InterfaceRunning

:set WAN1NetwatchStatus [/tool netwatch get [find where name=WAN1] status]
:set WAN2NetwatchStatus [/tool netwatch get [find where name=WAN2] status]

:set WAN1BNetwatchStatus [/tool netwatch get [find where name=WAN1B] status]
:set WAN2BNetwatchStatus [/tool netwatch get [find where name=WAN2B] status]

:set WAN2InterfaceRunning [/interface get [find where name=LTE_01] value-name=running]
:set WAN1InterfaceRunning [/interface get [find where name=WAN_01] value-name=running]

:if (($WAN1NetwatchStatus = "up" || $WAN1BNetwatchStatus = "up") &
& $WAN1InterfaceRunning = true) do={
:set $WAN1NetwatchStatus "UP"
} else={
:set $WAN1NetwatchStatus "DOWN"
}

:if (($WAN2NetwatchStatus = "up" || $WAN2BNetwatchStatus = "up") &
& $WAN2InterfaceRunning = true) do={
:set $WAN2NetwatchStatus "UP"
} else={
:set $WAN2NetwatchStatus "DOWN"
}

:if (($WAN1NetwatchStatus = "UP") && (($WANStatus->"WAN1") = "DOWN")) do={
# Disable backup connection
[:execute {/system/script/run WAN1_Up}]
}

:if (($WAN1NetwatchStatus = "DOWN") && (($WANStatus->"WAN1") = "UP")) do={
# Enable backup connection
[:execute {/system/script/run WAN1_Down}]
}

:if (($WAN2NetwatchStatus = "UP") && (($WANStatus->"WAN2") = "DOWN")) do={
[:execute {/system/script/run WAN2_Up}]
}

:if (($WAN2NetwatchStatus = "DOWN") && (($WANStatus->"WAN2") = "UP")) do={
[:execute {/system/script/run WAN2_Down}]
}"

And everything works, so thanks very much for helping me solve this problem.

I still don’t understand, however, how the thing Routing tables and Mark routing works. I see that they are not a good way to manage failover, but I don’t get how they actually work.

Anyway thanks again for the help!

@TheKoder
Good that It works now.

@anav
Some settings in Netwatch ICMP have been changed in recent versions, and the documentation, Is partly missing and partly plain wrong.
I made - with a lot of assistance by Amm0 - a couple attempts to expand/elaborate on these, but didn't consider the accept-icmp-time-exceeded=yes , so thanks for having brought It up and tested.
Only for the record the thread where some other settings were discussed:
Netwatch UP threshold

Much thanks for pointing me to that thread, can I assume if the documentation was wrong that you already alerted MT?