Problem with failover script.

Hi everyone,

I am using a failover script that pings an outside source for example Google DNS also the router has bandwidth based failover setup.

All is fine except when a power failure happened’s or a reboot is required. The failover works by increasing the route distances and here is where the problem lies. After a power outage or reboot the environmental variables are reset to zero but the routes remain the same. The script sends pings through the defunked ISP1 finding there is still a problem and again increases the routes from say 3 to 6 and so on.

:local InterfaceISP1 ether1-WAN1
:local InterfaceISP2 ether2-WAN2
:local ISP1 Interface1
:local ISP2 Interface2
:local ISP2s Interface2
:local PingTarget 8.8.8.8
:local FailTreshold 15
:local DistanceIncrease1 2
:local DistanceIncrease2 1
:local DistanceIncrease2s 2
:global PingFailCountISP1
:global PingFailCountISP2
:if ([:typeof $PingFailCountISP1] = "nothing") do={:set PingFailCountISP1 0}
:if ([:typeof $PingFailCountISP2] = "nothing") do={:set PingFailCountISP2 0}
:local PingResult
:set PingResult [ping $PingTarget count=1 interface=$InterfaceISP1]
:put $PingResult
:if ($PingResult = 0) do={
	:if ($PingFailCountISP1 < ($FailTreshold + 2)) do={
		:set PingFailCountISP1 ($PingFailCountISP1 + 1)
		:if (($PingFailCountISP1 = $FailTreshold) && (/ip route find comment=$"ISP1" distance="<3")) do={
			:log warning "ISP1 has a problem en route to $PingTarget - increasing distance of routes."
			:foreach i in=[/ip route find comment=$"ISP1"] do=\
				{/ip route set $i distance=([/ip route get $i distance] + $DistanceIncrease1)}
			:log warning "Route distance increase finished."
			/tool e-mail send to="my@email.com" \
			subject=([/system identity get name] . "_WAN1_Down")\
			password=password user=user\
			body="$[/system identity get name] WAN1 Down Increasing Routes on $[/system clock get date] at $[/system clock get time]"\
			server=[:resolve "smtp.live.com"] start-tls=yes port=587\
			from="my@router.com"
			:log info "Failure Sent to E-mail"
		}	
	}
}
:if ($PingResult = 1) do={
	:if ($PingFailCountISP1 > 0) do={
		:set PingFailCountISP1 ($PingFailCountISP1 - 1)	
		:if ($PingFailCountISP1 = ($FailTreshold -1)) do={
			:log warning "ISP1 can reach $PingTarget again - bringing back original distance of routes."
			:foreach i in=[/ip route find comment=$"ISP1"] do=\
				{/ip route set $i distance=([/ip route get $i distance] - $DistanceIncrease1)}
			:log warning "Route distance decrease finished."
			/tool e-mail send to="my@email.com" \
			subject=([/system identity get name] . "_WAN1_UP")\
			password=password user=user\
			body="$[/system identity get name] WAN1 Up decreasing Routes on $[/system clock get date] at $[/system clock get time]"\
			server=[:resolve "smtp.live.com"] start-tls=yes port=587\
			from="my@router.com"
			:log info "Failure Sent to E-mail"
		}	
	}
}
:delay 5s
:set PingResult [ping $PingTarget count=1 interface=$InterfaceISP2]
:put $PingResult
:if ($PingResult = 0) do={
	:if ($PingFailCountISP2 < ($FailTreshold + 2)) do={
		:set PingFailCountISP2 ($PingFailCountISP2 + 1)	
		:if (($PingFailCountISP2 = $FailTreshold) && (/ip route find comment=$"ISP2" distance="<3")) do={
			:log warning "ISP2 has a problem en route to $PingTarget - increasing distance of routes."
			:foreach i in=[/ip route find comment="ISP2"] do=\
				{/ip route set $i distance=([/ip route get $i distance] + $DistanceIncrease2)}
			:log warning "Route distance increase finished."
			/tool e-mail send to="my@email.com" \
			subject=([/system identity get name] . "_WAN2_Down")\
			password=password user=user\
			body="$[/system identity get name] WAN2 Down Increasing Routes on $[/system clock get date] at $[/system clock get time]"\
			server=[:resolve "smtp.live.com"] start-tls=yes port=587\
			from="my@router.com"
			:log info "Failure Sent to E-mail"
		}	
	}
}
:set PingResult [ping $PingTarget count=1 interface=$InterfaceISP2]
:put $PingResult
:if ($PingResult = 0) do={
	:if ($PingFailCountISP2 < ($FailTreshold + 2)) do={
		:set PingFailCountISP2 ($PingFailCountISP2 + 1)	
		:if (($PingFailCountISP2 = $FailTreshold) && (/ip route find comment=$"ISP2s" distance="<3")) do={
			:log warning "ISP2 has a problem en route to $PingTarget - increasing distance of routes."
			:foreach i in=[/ip route find comment="ISP2s"] do=\
				{/ip route set $i distance=([/ip route get $i distance] + $DistanceIncrease2s)}
			:log warning "Route distance increase finished."
			/tool e-mail send to="my@email.com" \
			subject=([/system identity get name] . "_WAN2_Down")\
			password=password user=user\
			body="$[/system identity get name] WAN2 Down Increasing Routes on $[/system clock get date] at $[/system clock get time]"\
			server=[:resolve "smtp.live.com"] start-tls=yes port=587\
			from="my@router.com"
			:log info "Failure Sent to E-mail"
		}	
	}
}
:if ($PingResult = 1) do={
	:if ($PingFailCountISP2 > 0) do={
		:set PingFailCountISP2 ($PingFailCountISP2 - 1)
		:if ($PingFailCountISP2 = ($FailTreshold - 1)) do={
			:log warning "ISP2 can reach $PingTarget again - bringing back original distance of routes."
			:foreach i in=[/ip route find comment="ISP2"] do=\
				{/ip route set $i distance=([/ip route get $i distance] - $DistanceIncrease2)}
			:log warning "Route distance decrease finished."
			/tool e-mail send to="my@email.com" \
			subject=([/system identity get name] . "_WAN2_Up")\
			password=password user=user\
			body="$[/system identity get name] WAN2 UP Decreasing Routes on $[/system clock get date] at $[/system clock get time]"\
			server=[:resolve "smtp.live.com"] start-tls=yes port=587\
			from="my@router.com"
			:log info "Failure Sent to E-mail"
		}
	}
}

I have added this line in bold to try to resolve this but my script skills are very limited.

:if (($PingFailCountISP1 = $FailTreshold) [b]&& (/ip route find comment=$"ISP1" distance="<3")[/b])

Is this the correct?
Thank you in advance.

I am using the netwatch with static routes. Checking ip has static route trhu its wan, and second route to blackhole with distance 99 in case the respective wan is not available so the blackhole becomes active and do not leave the ping go via other wans.

Default routes to wans have distance 12 for wan1, 13 for wan2, 14 for wan3 and so on…

The netwatch has such command for going up (amongst others for logging, setting smtp servers and so on…):

/ip route set distance=12 [find distance=22];

and reverse for going down.

There are no problems with restarting or with keeping any variables. It does not matter what is the state of the connections when netwatch runs first. In case the wan is down, blackhole eats the ping and netwatch moves the distance. When the gateway becomes accessible, the pinging address static route becomes active and the blackohle inactive. Netwatch sends ping thru the wan1 to its gateway. When the address is now accessible, the netwatch changes the default route distance for this wan back from 22 to 12 and it becomes the first default route. The same happens with other routes to other wans also.

Of course, distances 12,13,14 and 22,23,24 are used only once.

Hi Harda, excellent post.

You can put the script here for route and netwatch for I can see how you can use it ?

Thank you so much.

Thanks for the reply jarda,

I would love to see the scripting for that. But in the mean time could you confirm what I did is correct and if its not kindly give me a hint as to how to correct it. The reason for this is I know the script works. It has been tested and tried over long periods of time.

Again thanks for your help.

I hope this will help in the understanding of the failover script above.

/ip route
add check-gateway=ping comment=Interface1 distance=1 gateway=10.3.1.1 \
    routing-mark=WAN1_rout
add check-gateway=ping comment=Interface2s distance=1 gateway=190.x.x.101 \
    routing-mark=WAN2_rout scope=10
add check-gateway=ping comment=Interface1 distance=1 gateway=10.3.1.1
add check-gateway=ping comment=Interface2 distance=2 gateway=190.x.x.101 \
    scope=10

Thanks

Hi jarda,

I think I understand what you suggested.

Basically you set the netwatch to ping some host through the WAN. When there is a probem in route the netwatch see’s it as being down and runs the down scrip and when the issue is resolved netwatch runs the up script.

Netwatch “down” for WAN1

/ip route set distance=12 [find distance=22];

Netwatch “up” for WAN1

/ip route set distance=22 [find distance=12];

Netwatch “down” WAN2

/ip route set distance=13 [find distance=23];

Netwatch “up” WAN2

/ip route set distance=23 [find distance=13];

Is my understanding correct?

Thanks

Exactly. Sorry I hadn’t time to sit at computer to write more details. Using phone mainly… But you understood well. Does it work for you or you need some additional help?

Oh. You wrote it in opposite way. When up, you need to lower the distance, when down, you have to rise. Otherwise it is good.

Thanks a million jarda,

/tool netwatch
add comment=WAN1Failover disabled=yes down-script=\
    "/ip route set distance=22 [find distance=12];" host=8.8.8.8 interval=5m \
    up-script="/ip route set distance=12 [find distance=22];"
add comment=WAN2Failover disabled=yes down-script=\
    "/ip route set distance=23 [find distance=13];" host=8.8.8.8 interval=5m \
    up-script="/ip route set distance=13 [find distance=23];"

Thanks

That’s it. But 5minutes interval looks to be quite long for failover, don’t you think? I’m using 5 seconds. And, you cannot use the same ip for checking more than one wan. It is better to use some ip that is not important for you so you don’t mind if it is not accessible.

Have you created the static routes mentioned in the beginning?

Thank you. Jarda, you can post examples of static route? I Do not understand the question of the black hole for one route.

Thanks a lot.

Hi jarda,

yes, I have my static routes setup. Your right 5 min interval was too long and have since set it to 30 seconds. I may have found a small issue.

I am using a bandwidth based fail over so if something happened to WAN2 and the script sees that WAN1 is saturated and starts sending data to that WAN2 the client would start to get timeouts.

My question is can the netwatch script be directed to ping out a specific WAN port so I can monitor each port individually?

mjvneto here is an example of a static route.

/ip route
      add check-gateway=ping comment=Interface1 distance=1 gateway=10.3.1.1

Below is a screen shot of my route list “A” is for active and “S” is for static
static route.PNG

Unfortunatelly not. I already asked for such functionality to be implemented:
http://forum.mikrotik.com/t/feature-request-netwatch-parameters-extension/75502/1

One year ago. Looked like none would need the same…

You are welcomed to join my request.

Instead of it, you have to use the routing rules to ensure that.

Thanks jarda,

Its kinda weird that no one else would request this. Its such a comprehensible way for failover and various other uses.