sanity check required: dual-wan failover with netwatch + scheduler + routing table

Hi mikrotik community,

I am building a Dual-WAN setup with automatic failover using netwatch and I need a sanity check of my config.

Things I want to achive:

  • latest RouterOS 7.13 on a L009
  • no recursive routing solution
  • my head always spins when using recursive routing with multiple monitored hosts per ISP uplink :confused:
    • I want to have more control in deciding when a monitored host is over an acceptable threshold (maybe one uplink is rural LTE and needs more relaxed ping/jitter/timeouts)
    • the very much improved netwatch in RouterOS 7 gives me the ability to set different metrics per monitored host
    • each ISP uplink has two netwatchers which monitor two hosts only over that ISP uplink (so I am monitoring four distinct hosts)
    • a scheduled script runs every 5 seconds and collects all failed hosts per ISP from netwatch and disables the default route per ISP
    • an uplink is considered failed, when all netwatched hosts for that uplink fail
    • an uplink is considered available, when at least one netwatched hosts for that uplink is available
  • I would prefer to solve this with routing tables and no mangling involved
  • I have no need for Load-Balancing, but I want to steer certain internal traffic out on a specific ISP uplink, unless that uplink fails, then it should fail back to the other uplink (think traffic from a VoIP vlan, which goes out to a low-bandwith but still low-latency ISP uplink, while the rest uses a LTE-backend)
  • my ISP uplinks are with DHCP, but I’ve got that covered with a dhcp-client script, which sets the default route (and correct src-nat for that interface)
  • I am using src-nat for each ISP uplink instead of masquerade (because masquerade causes different problems) – the correct src-nat settings get set by my dhcp-client script – and NAT connections table gets flushed in the scheduled script

While this solution does involve some custom scripting, I am more interested in solving the gears involving the overall routing. So far, in my (limited) testing, most parts work. Netwatch works reliable and the scheduled script disabled/enabled the uplink depending on what monitored host I block way farther upstream. When default rules do change, the NAT connections are flushed reliable.

Questions:

  • Is this layout sane, or are there hidden pitfalls to watch out for?
  • I am struggling with adding a routing rule for my Voice-VLAN/address-range, which gets pinned to e.g. ISP2 unless that one fails, then it probably should use main table?

Any insights are greatly appreciated.

RouterOS default setup:

/interface list member
add interface=bridge list=LAN
add interface=ether1 list=WAN
add interface=ether2 list=WAN
/ip address
add address=192.168.88.1/24 comment=defconf interface=bridge network=192.168.88.0

NAT and DHCP-Client:

/ip firewall nat
# 10.33.x.y is part of my lab setup, correct to-addresses are set by dhcp-client, we prefer src-nat instead of masquerade
add action=src-nat chain=srcnat out-interface=ether1 to-addresses=10.33.20.63
add action=src-nat chain=srcnat out-interface=ether2 to-addresses=10.33.30.55

/ip dhcp-client
add add-default-route=no comment="UPLINK Telekom" interface=ether1 script=":if (\$bound=1) do={\r\
    \n\t/ip firewall nat set [ find where out-interface=\$\"interface\" ] to-addresses=\$\"lease-address\"\r\
    \n\t/ip route set [ find where comment=\"ISP1\" and dst-address=\"0.0.0.0/0\" and routing-table=\"main\" ] gateway=\$\"gateway-address\"\r\
    \n}" use-peer-ntp=no
add add-default-route=no comment="UPLINK Vodafone" interface=ether2 script=":if (\$bound=1) do={\r\
    \n\t/ip firewall nat set [ find where out-interface=\$\"interface\" ] to-addresses=\$\"lease-address\"\r\
    \n\t/ip route set [ find where comment=\"ISP2\" and dst-address=\"0.0.0.0/0\" and routing-table=\"main\" ] gateway=\$\"gateway-address\"\r\
    \n}" use-peer-ntp=no

Routing: ← struggling

/routing table
add disabled=no fib name=to_ISP1
add disabled=no fib name=to_ISP2

/routing rule
add action=lookup-only-in-table disabled=no dst-address=9.9.9.9/32 table=to_ISP1
add action=lookup-only-in-table disabled=no dst-address=208.67.222.222/32 table=to_ISP1
add action=lookup-only-in-table disabled=no dst-address=149.112.112.112/32 table=to_ISP2
add action=lookup-only-in-table disabled=no dst-address=208.67.220.220/32 table=to_ISP2

/ip route
add comment=ISP1 disabled=no distance=1 dst-address=0.0.0.0/0 gateway=10.33.20.1 pref-src="" routing-table=main scope=30 suppress-hw-offload=no target-scope=10
add comment=ISP2 disabled=no distance=2 dst-address=0.0.0.0/0 gateway=10.33.30.1 pref-src="" routing-table=main scope=30 suppress-hw-offload=no target-scope=10
add comment=ISP1 disabled=no distance=1 dst-address=9.9.9.9/32 gateway=10.33.20.1 pref-src="" routing-table=to_ISP1 scope=30 suppress-hw-offload=no target-scope=10
add comment=ISP2 disabled=no distance=1 dst-address=149.112.112.112/32 gateway=10.33.30.1 pref-src="" routing-table=to_ISP2 scope=30 suppress-hw-offload=no target-scope=10
add comment=ISP1 disabled=no distance=1 dst-address=208.67.222.222/32 gateway=10.33.20.1 pref-src="" routing-table=to_ISP1 scope=30 suppress-hw-offload=no target-scope=10
add comment=ISP2 disabled=no distance=1 dst-address=208.67.220.220/32 gateway=10.33.30.1 pref-src="" routing-table=to_ISP2 scope=30 suppress-hw-offload=no target-scope=10

Netwatch + Scheduler:

/system scheduler
add interval=5s name=schedule1 on-event=netwatch-down policy=ftp,reboot,read,write,policy,test,password,sniff,sensitive,romon start-date=2023-12-20 start-time=22:43:33

/tool netwatch
add comment=ISP1 disabled=no host=208.67.222.222  http-codes="" interval=3s start-delay=15s startup-delay=15s test-script="" type=icmp
add comment=ISP1 disabled=no host=9.9.9.9         http-codes="" interval=3s start-delay=15s startup-delay=15s test-script="" type=icmp
add comment=ISP2 disabled=no host=208.67.220.220  http-codes="" interval=3s start-delay=15s startup-delay=15s test-script="" type=icmp
add comment=ISP2 disabled=no host=149.112.112.112 http-codes="" interval=3s start-delay=15s startup-delay=15s test-script="" type=icmp

Watcher-Script:

# established connections get only flushed when the routing table actually changed
:local watcherFlushConnections false
:local watcherUplinks {"ISP1";"ISP2"}
:foreach isp in=$watcherUplinks do={
	:local watcherFailed [ :len [ /tool netwatch find where comment=$"isp" and status="down" ] ]
	:local watcherAvailable [ :len [ /tool netwatch find where comment=$"isp" ] ]
	:log debug ( $"isp" . ": " . $watcherFailed . "/" . $watcherAvailable )

	# disable uplinks routes only if all netwatchers for that uplink are unreachable
	:if ($watcherFailed >= $watcherAvailable) do={
		:foreach idx in=[ /ip route find where comment=$"isp" and dst-address="0.0.0.0/0" ] do={
			:local watcherStatus ([ /ip route get $idx disabled ])
			:if ($watcherStatus=false) do={
				:log warning ("disabling route for " . $isp)
				/ip route disable $idx
				:set watcherFlushConnections true
			}
		}
	}
	# enable uplinks routes if any netwatchers for that uplink are reachable
	:if ($watcherFailed < $watcherAvailable) do={
		:foreach idx in=[ /ip route find where comment=$"isp" and dst-address="0.0.0.0/0" ] do={
			:local watcherStatus ([ /ip route get $idx disabled ])
			:if ($watcherStatus=true) do={
				:log warning ("enabling route for " . $isp)
				/ip route enable $idx
				:set watcherFlushConnections true
			}
		}
	}
}
# if default routes have been changed, we need to kill current NAT connections
:if ($watcherFlushConnections=true) do={
	:log warning "flushing NAT connections"
	/ip firewall connection remove [ find ]
}