Failover to part1 script

rounin · April 9, 2022, 8:54am

Here is an attempt at a simple failover rescue script.

Partition 0 considered “primary”, partition 1 considered “safe-config”. HW is a CCR-2116.

I want to switch to part1 if the router has booted, but the network appears to be down, ie if upgrade borks so it boots ok but an interface is lost / corrupted, which happened to me in a 7.2 upgrade.

Any thoughts on this concept of operation? Any best practices for deciding the current partition is healthy? Trying to keep it simple.

{
	:if( [/partitions get part1 active] ) do={
		:log info "Safeboot Watchdog: Already on part1, exiting"
		/quit
	}	

	:log info "Safeboot Watchdog: Waiting 180s"
	:delay 180s

	# Maybe we probe a web page instead of ping
	# :local fetchresult [/tool/fetch url="https://www.google.com" mode=https check-certificate=yes as-value output=user]
	# :if($fetchresult->"status" = "finished") do={

	:for i from=0 to=5 do={
		:log error "Safeboot Watchdog: Pinging 8.8.8.8"

		:if( [/tool/ping address=8.8.8.8 count=10] = 0) do={
			:log error "Could not ping 8.8.8.8"
			:delay 30s
		} else={
			:log info "Safeboot Watchdog: Ping ok"
			/quit
		}
	}

	# Maybe we probe another host, if the first one is down
	:for i from=0 to=5 do={
		:log error "Pinging 8.8.4.4"

		:if( [/tool/ping address=8.8.4.4 count=10] = 0) do={
			:log error "Safeboot Watchdog: Could not ping 8.8.4.4"
			:delay 30s
		} else={
			:log info "Safeboot Watchdog: Ping ok"
			/quit
		}
	}

	:log error "Safeboot Watchdog: Network seems to be down!"

	:log info "Safeboot Watchdog: Activating part1"
	/partitions {
		activate part1
	}

	:log info "Safeboot Watchdog: Rebooting"
	/system reboot
}

fragtion · April 9, 2022, 9:33am

I love this idea and will watch this thread closely. I manage several mikrotik devices remotely, most vulnerable as a “single point of failure” due to tight budgets. 7.2 bricked a number of devices (at least 1 in every 10) that I upgraded although all of them would still boot fine, only to a corrupted config… I never thought of using partitions as a way to recover from that, but as long as these “partition watchdog scripts” themselves don’t get wiped by the corruption, it seems this could be a real lifesaver approach (albeit tedious to set up, and maybe prone to false-positives). Of course the best solution would be more reliable software/firmware update procedure from Mikrotik themselves, but some of us don’t have the time or luxury of waiting indefinitely for things that may never even happen Every update seems to introduce a myriad of new issues, which it would seem the upgrading process itself clearly isn’t immune from

I’m new to partitions on MikroTik though. Any pointers to a crash course for dummies ? Can’t find any video content on youtube at least

rextended · April 9, 2022, 9:39am

On production device do not put 7.x, but keep 6.48.6 long-term, use 7.2 if the 7.x is the only version compatible with that device…

And about partitioning, not work with all device, or at least you need space inside, and partitioning 16M is a bad idea for working space…

And about the script, I don’t have any to say, is functional.

rounin · April 10, 2022, 8:49am

Huh. Seems like ping command is different in 7.2 vs 6.48.

In 6.48, I get a 0 or a 1 as a result from ping, so the if statement works.

[admin@RouterOS] > :global val [/ping address=8.8.8.8 count=1]     
  SEQ HOST                                     SIZE TTL TIME  STATUS           
    0 8.8.8.8                                    56  59 8ms  
    sent=1 received=1 packet-loss=0% min-rtt=8ms avg-rtt=8ms max-rtt=8ms 

[admin@RouterOS] > :put $val                                  
1

In 7.2, I don’t seem to get any result from ping, my if statement always takes the else path.

[admin@MT-RB4011] > :global val [/ping address=8.8.8.8 count=1]
Columns: SEQ, HOST, SIZE, TTL, TIME
SEQ  HOST     SIZE  TTL  TIME    
  0  8.8.8.8    56   60  9ms450us

[admin@MT-RB4011] > :put $val

[admin@MT-RB4011] >

It seems like ping behaves differently with storage to variable in v7.2? The stats line “sent=1 received=1 packet-loss=0% min-rtt=8ms avg-rtt=8ms max-rtt=8ms” is also missing.

rounin · April 10, 2022, 8:55am

:global val [/tool ping address=8.8.8.8 count=1 as-value]

returns something sensible in v7. I’ll update to that

rounin · April 10, 2022, 9:20am

Somewhat annoyingly the output of ping as-value is not very normalized.

A good ping could return

.id=*0;host=8.8.8.8;seq=0;size=56;time=00:00:00.008848;ttl=60

and a bad ping could return

.id=*0;host=8.8.8.9;seq=0;status=timeout

, ie, no status field on success.

Would be nice if status field was always there. So treating no status as a success, by checking :typeof ($pingres->“status”) = “nothing”

{
	:if ( [/partitions get part1 running] ) do={
	:log info "Safeboot Watchdog: Already on part1, exiting"
	/quit
	}	

	:log info "Safeboot Watchdog: Waiting 180s"
	:delay 180s

	# Maybe we probe a web page instead of ping
	# :local fetchresult [/tool/fetch url="https://www.google.com" mode=https check-certificate=yes as-value output=user]
	# :if ($fetchresult->"status" = "finished") do={

	:for i from=0 to=9 do={
	:log info "Safeboot Watchdog: Pinging 8.8.8.8"

		:do {
			:local pingres [/tool/ping address=8.8.8.8 count=1 interval=2 as-value]
			:if ( [:typeof ($pingres->"status")] = "nothing" ) do={
				:log info "Safeboot Watchdog: Ping 8.8.8.8 ok"
				/quit
			} else={
				:log warning "Safeboot Watchdog: Could not ping 8.8.8.8"
				:delay 30s
			}
		} on-error={
			:log warning "Safeboot Watchdog: Ping error 8.8.8.8"
			:delay 30s
		}
	}

	:log error "Safeboot Watchdog: Network seems to be down!"

	:log info "Safeboot Watchdog: Activating part1"
	/partitions activate part1

	:log info "Safeboot Watchdog: Rebooting"
	/system reboot
}

kirstenmw · May 30, 2022, 10:09am

Only with count=1 or =2. With count=3 (or more) you get received packages as return value as with 6.48.

armyofonegh · May 30, 2022, 2:31pm

Did the upper script work ?

rextended · May 30, 2022, 3:54pm

Those is a question?