Suggestion: Completely virtual router based on two physical routers

bbs2web · February 11, 2018, 12:43am

Many thanks, you’ve saved me days! I tested this on virtualised routers first and had a problem that all interfaces would get disabled, including the VRRP parent, until I hashed out the following line in the ha_startup script:
/system routerboard settings set silent-boot=yes

It’s a virtual x86, so it made sense that it failed. I additionally reduced the subnet in the ha_config from /24 to /29. The ha_switchrole script appears to have hardcoded values, which don’t match the settings from ha_config, so I set the HA sync interface and then assume it should ping the slave (169.254.23.2), right?

I see no references to the scripts using telnet or ssh so I additionally stopped it restricting those protocols to the HA addresses:
Edited ha_startup script from:
:foreach service in [:toarray “ftp,telnet,ssh”] do={
to:
:foreach service in [:toarray “ftp”] do={

Excellent work, we typically implement redundancy using OSPF, BGP and/or VRRP but bridging VPLS tunnels and retrofitting redundancy on complicated routers with allot of /30 subnets is very easy using the collection of scripts you’re written!

Mikrotik should really incorporate your work as a heartbeat HA function, instead of wasting time on kid control…

nathan1 · February 11, 2018, 1:02am

Hey bbs2web,

Nice work debugging it for your platform. We can put a an on-error around the silent-boot so it works correctly in both cases.

I assume you changed the VRRP address as well when you changed it to a /29? I’d only be reluctant to switch it to a /29 since it won’t cover what I have used for the .10 VRRP address since it was created. I guess we can go with a /28 if you feel that you really want to shrink the /24.

I adjust the rules for all 3 services to make sure that the other device can always be used to manually access all of the services. It is more of a management/debugging tool when something might go wrong vs. part of ha-mikrotik automation.

Good catch on the switchrole, it is actually a script I very rarely use and wasn’t intended to be committed. It needs to be changed to use $haOtherAddress and $haInterface rather than the fixed IP and interface.

Is it generally working well for you on x86? How long does it take for an ha_pushbackup to slave to boot back up?

Many thanks, you’ve saved me days! I tested this on virtualised routers first and had a problem that all interfaces would get disabled, including the VRRP parent, until I hashed out the following line in the ha_startup script:
/system routerboard settings set silent-boot=yes

It’s a virtual x86, so it made sense that it failed. I additionally reduced the subnet in the ha_config from /24 to /29. The ha_switchrole script appears to have hardcore values which don’t match the settings from ha_config so I set the HA sync interface and then assume it should ping the slave (169.254.23.2), right?

I see no references to the scripts using telnet or ssh so I additionally stopped it restricting those protocols to the HA addresses:
Edited ha_startup script from:
:foreach service in [:toarray “ftp,telnet,ssh”] do={
to:
:foreach service in [:toarray “ftp”] do={

Excellent work, we typically implement redundancy using OSPF, BGP and/or VRRP but bridging VPLS tunnels and retrofitting redundancy on complicated routers with allot of /30 subnets is very easy using the collection of scripts you’re written!

Mikrotik should really incorporate your work as a heartbeat HA function, instead of wasting time on kid control…

tetecko:

Give this a go: https://github.com/svlsResearch/ha-mikrotik

bbs2web · February 11, 2018, 6:46pm

Hi Nathan,

Booting a x86 virtual takes approximately 40 seconds. I converted a customer’s active backup routers that we were maintaining, with about 70 individual vrrp interfaces to your ha system. Entire process took about 30 minutes and the process is elegantly simple.

No longer have to work with /29 subnets everywhere and no longer have to do everything twice.

Yes, I made first master 169.254.23.1/29, the initial slave 169.254.23.2/29 and the floating vrrp ip 169.254.23.3.

I’m implementing this on two pairs of CCR1036 routers, at a financial institution, during their maintenance window tomorrow morning. They already have a spanning tree mess, with their Cisco stack running RPVST+ and their HyperV environment running with switches in MSTP mode. This way they have 10 seconds failover redundancy for bridged vlans using VPLS between their primary and DR site.

The client has PCI DSS and ISO compliance tests scheduled in the next 45 days. Confident that everything works!

Really, really excellent work, well done and thank you!

bbs2web · February 18, 2018, 7:40am

Would you please consider accepting the following patch, it does the following:

Changes '] > ’ to stop rancid (configuration revision management) matching it to the RouterOS prompt.
Changes netmask from /24 to /29 and moved VRRP IP from .10 to .3.
Set schedulers’ start date to Unix Epoch (Jan/01/1970).
Set schedulers’ intervals and start time to prevent overlapping.
Only change FTP service, prevents SSH not being reachable on master or enabling Telnet.
Replaces hard coded values with variables.
Disables adding default route (makes loopback interfaces reachable).
Disables silencing Routerboard boot process by default and handle errors (eg VM)

--- HA_init.rsc 2018-02-18 08:54:22.000000000 +0200
+++ ../../HA_init.rsc   2018-02-18 09:32:25.000000000 +0200
@@ -1,7 +1,7 @@
 :do {
 /system script
 remove [find name=ha_checkchanges_new]
-add name=ha_checkchanges_new owner=admin policy=ftp,reboot,read,write,policy,test,password,sniff,sensitive source=":if ([:len [/system script job find where script=\"ha_checkchanges\"]] > 1) do={:error \"already running checkchanges\"; } \
+add name=ha_checkchanges_new owner=admin policy=ftp,reboot,read,write,policy,test,password,sniff,sensitive source=":if ([:len [/system script job find where script=\"ha_checkchanges\"]]  > 1) do={:error \"already running checkchanges\"; } \
        \n:global isMaster\
        \n:global isStandbyInSync\
        \n:global haPassword\
@@ -39,11 +39,11 @@
 remove [find name=ha_config_new]
 add name=ha_config_new owner=admin policy=ftp,reboot,read,write,policy,test,password,sniff,sensitive source="/system script run [find name=\"ha_config_base\"]\
        \n:global haNetwork \"169.254.23.0\"\
-       \n:global haNetmask \"255.255.255.0\"\
-       \n:global haNetmaskBits \"24\"\
+       \n:global haNetmask \"255.255.255.248\"\
+       \n:global haNetmaskBits \"29\"\
        \n:global haAddressA \"169.254.23.1\"\
        \n:global haAddressB \"169.254.23.2\"\
-       \n:global haAddressVRRP \"169.254.23.10\""
+       \n:global haAddressVRRP \"169.254.23.3\""
 remove [find name=ha_functions_new]
 add name=ha_functions_new owner=admin policy=ftp,reboot,read,write,policy,test,password,sniff,sensitive source=":global HADebug do={\
        \n   :put \$1\
@@ -103,7 +103,7 @@
        \n   :error \"Are you sure the other device is configured properly? I am unable to ping MAC \$pingMac\"\
        \n}\
        \n\
-       \n:if ([:len [/ip address find where interface=\"\$haInterface\" and comment!=\"HA_AUTO\"]] > 0) do {\
+       \n:if ([:len [/ip address find where interface=\"\$haInterface\" and comment!=\"HA_AUTO\"]]  > 0) do {\
        \n   :error \"Interface \$haInterface has IP addresses. HA should completely own the interface and it cannot be used by anything else. Please correct\"\
        \n}\
        \n\
@@ -155,7 +155,7 @@
        \n:execute \"ha_setidentity\"\
        \n:do { :local k [/system script find name=\"on_master\"]; if ([:len \$k] = 1) do={ /system script run \$k } } on-error={ :put \"on_master failed\" }"
 remove [find name=ha_pushbackup_new]
-add name=ha_pushbackup_new owner=admin policy=ftp,reboot,read,write,policy,test,password,sniff,sensitive source=":if ([:len [/system script job find where script=\"ha_pushbackup\"]] > 1) do={:error \"already running pushbackup\"; } \
+add name=ha_pushbackup_new owner=admin policy=ftp,reboot,read,write,policy,test,password,sniff,sensitive source=":if ([:len [/system script job find where script=\"ha_pushbackup\"]]  > 1) do={:error \"already running pushbackup\"; } \
        \n:global haPassword\
        \n:global isMaster\
        \n:global haAddressOther\
@@ -247,7 +247,7 @@
        \n}\
        \n/log warning \"ha_startup: 0.3\"\
        \n/interface ethernet disable [find]\
-       \n:global haStartupHAVersion \"0.2alpha - ea961767e45b63b81aac87eed37301d8b70bedf7\"\
+       \n:global haStartupHAVersion \"0.2alpha - 858dc62b5a9e215a5e5896137a053d01d16695c6\"\
        \n:global isStandbyInSync false\
        \n:global isMaster false\
        \n:global haPassword\
@@ -268,7 +268,7 @@
        \n/system scheduler remove [find comment=\"HA_AUTO\"]\
        \n\
        \n#Pause on-error just in case we error out before the spin loop - hope 5 seconds is enough.\
-       \n/system scheduler add comment=HA_AUTO name=ha_startup on-event=\":do {:global haInterface; /system script run [find name=ha_startup]; } on-error={ :delay 5; /interface ethernet disable [find default-name!=\\\"\\\$haInterface\\\"]; /log error \\\"ha_startup: FAILED - DISABLED ALL INTERFACES\\\" }\" policy=ftp,reboot,read,write,policy,test,password,sniff,sensitive start-time=startup \
+       \n/system scheduler add comment=HA_AUTO name=ha_startup on-event=\":do {:global haInterface; /system script run [find name=ha_startup]; } on-error={ :delay 5; /interface ethernet disable [find default-name!=\\\"\\\$haInterface\\\"]; /log error \\\"ha_startup: FAILED - DISABLED ALL INTERFACES\\\" }\" policy=ftp,reboot,read,write,policy,test,password,sniff,sensitive start-date=Jan/01/1970 start-time=startup \
        \n\
        \n#/interface ethernet reset-mac-address\
        \n/ip address remove [find interface=\"\$haInterface\"]\
@@ -315,8 +315,8 @@
        \n   }\
        \n}\
        \n\
-       \n/ip route remove [find comment=\"HA_AUTO\"]   \
-       \n/ip route add gateway=\$haAddressOther distance=250 comment=HA_AUTO\
+       \n#/ip route remove [find comment=\"HA_AUTO\"]   \
+       \n#/ip route add gateway=\$haAddressOther distance=250 comment=HA_AUTO\
        \n\
        \n/log warning \"ha_startup: 4\"\
        \n\
@@ -337,10 +337,10 @@
        \n/ip address add address=\$haAddressVRRP netmask=255.255.255.255 interface=HA_VRRP comment=\"HA_AUTO\"\
        \n\
        \n/log warning \"ha_startup: 6\"\
-       \n/system scheduler add comment=HA_AUTO interval=30m name=ha_exportcurrent on-event=\"/export file=\\\"HA_current.rsc\\\"\" policy=ftp,reboot,read,write,policy,test,password,sniff,sensitive start-date=jan/20/2000 start-time=22:37:10\
-       \n/system scheduler add interval=10m name=ha_checkchanges on-event=ha_checkchanges policy=ftp,reboot,read,write,policy,test,password,sniff,sensitive start-date=jan/1/2000 start-time=18:00:30 comment=HA_AUTO\
+       \n/system scheduler add comment=HA_AUTO interval=10m name=ha_exportcurrent on-event=\"/export file=\\\"HA_current.rsc\\\"\" policy=ftp,reboot,read,write,policy,test,password,sniff,sensitive start-date=Jan/01/1970 start-time=00:05:00\
+       \n/system scheduler add interval=10m name=ha_checkchanges on-event=ha_checkchanges policy=ftp,reboot,read,write,policy,test,password,sniff,sensitive start-date=Jan/01/1970 start-time=00:10:00 comment=HA_AUTO\
        \n#Still need this - things like DHCP leases dont cause a system config change, we want to backup periodically.\
-       \n/system scheduler add comment=HA_AUTO interval=24h name=ha_auto_pushbackup on-event=ha_pushbackup policy=ftp,reboot,read,write,policy,test,password,sniff,sensitive start-date=jan/20/2000 start-time=05:00:00\
+       \n/system scheduler add comment=HA_AUTO interval=24h name=ha_auto_pushbackup on-event=ha_pushbackup policy=ftp,reboot,read,write,policy,test,password,sniff,sensitive start-date=Jan/01/1970 start-time=05:00:00\
        \n/log warning \"ha_startup: 7\"\
        \n:if ([:len [/file find name=\"HA_dsa\"]] = 1) do={\
        \n   /ip ssh import-host-key private-key-file=HA_rsa\
@@ -352,9 +352,9 @@
        \n/user add address=\"\$haNetwork/\$haNetmaskBits\" comment=HA_AUTO group=full name=ha password=\"\$haPassword\"\
        \n/log warning \"ha_startup: 8\"\
        \n#So you dont get annoyed with constant beeping\
-       \n/system routerboard settings set silent-boot=yes\
+       \n#:do {/system routerboard settings set silent-boot=yes} on-error={};\
        \n\
-       \n:foreach service in [:toarray \"ftp,telnet,ssh\"] do={\
+       \n:foreach service in [:toarray \"ftp\"] do={\
        \n   :local serviceAddresses \"\"\
        \n   :foreach k in=[/ip service get [find name=\$service] address] do={\
        \n      :if (\$k != \"\$haAddressA/32\" and \$k != \"\$haAddressB/32\" and \$k != \"\$haAddressVRRP/32\") do {\
@@ -365,7 +365,7 @@
        \n   /ip service set [find name=\$service] address=[:toarray \$serviceAddresses]\
        \n}\
        \n\
-       \n:if ([:len [/file find where name=\"HA_run-after-hastartup.rsc\"]] > 0) do {\
+       \n:if ([:len [/file find where name=\"HA_run-after-hastartup.rsc\"]]  > 0) do {\
        \n   /import HA_run-after-hastartup.rsc\
        \n}\
        \n/delay 5\
@@ -388,7 +388,7 @@
        \n   /system script run [find name=\"ha_pushbackup\"]\
        \n   :put \"delaying 60\"\
        \n   /delay 60\
-       \n   :if (\$isMaster && [/ping 169.254.23.3 count=1 interface=ether1 ttl=1] >= 1) do {\
+       \n   :if (\$isMaster && [/ping \$haAddressOther count=1 interface=\$haInterface ttl=1]  >= 1) do {\
        \n      :put \"REBOOTING MYSELF\"\
        \n      :execute \"/system reboot\"\
        \n   } else {\
diff -uNr scripts/ha_checkchanges.script ../../scripts/ha_checkchanges.script
--- scripts/ha_checkchanges.script      2018-02-17 11:58:46.000000000 +0200
+++ ../../scripts/ha_checkchanges.script        2018-02-17 12:35:29.000000000 +0200
@@ -1,4 +1,4 @@
-:if ([:len [/system script job find where script="ha_checkchanges"]] > 1) do={:error "already running checkchanges"; }
+:if ([:len [/system script job find where script="ha_checkchanges"]]  > 1) do={:error "already running checkchanges"; }
 :global isMaster
 :global isStandbyInSync
 :global haPassword
diff -uNr scripts/ha_config.script ../../scripts/ha_config.script
--- scripts/ha_config.script    2018-02-18 08:54:28.000000000 +0200
+++ ../../scripts/ha_config.script      2018-02-18 08:54:06.000000000 +0200
@@ -1,7 +1,7 @@
 /system script run [find name="ha_config_base"]
 :global haNetwork "169.254.23.0"
-:global haNetmask "255.255.255.0"
-:global haNetmaskBits "24"
+:global haNetmask "255.255.255.248"
+:global haNetmaskBits "29"
 :global haAddressA "169.254.23.1"
 :global haAddressB "169.254.23.2"
-:global haAddressVRRP "169.254.23.10"
\ No newline at end of file
+:global haAddressVRRP "169.254.23.3"
\ No newline at end of file
diff -uNr scripts/ha_install.script ../../scripts/ha_install.script
--- scripts/ha_install.script   2018-02-17 12:13:18.000000000 +0200
+++ ../../scripts/ha_install.script     2018-02-17 12:37:49.000000000 +0200
@@ -29,7 +29,7 @@
    :error "Are you sure the other device is configured properly? I am unable to ping MAC $pingMac"
 }

-:if ([:len [/ip address find where interface="$haInterface" and comment!="HA_AUTO"]] > 0) do {
+:if ([:len [/ip address find where interface="$haInterface" and comment!="HA_AUTO"]]  > 0) do {
    :error "Interface $haInterface has IP addresses. HA should completely own the interface and it cannot be used by anything else. Please correct"
 }

diff -uNr scripts/ha_pushbackup.script ../../scripts/ha_pushbackup.script
--- scripts/ha_pushbackup.script        2018-02-17 12:13:47.000000000 +0200
+++ ../../scripts/ha_pushbackup.script  2018-02-17 12:38:25.000000000 +0200
@@ -1,4 +1,4 @@
-:if ([:len [/system script job find where script="ha_pushbackup"]] > 1) do={:error "already running pushbackup"; }
+:if ([:len [/system script job find where script="ha_pushbackup"]]  > 1) do={:error "already running pushbackup"; }
 :global haPassword
 :global isMaster
 :global haAddressOther
diff -uNr scripts/ha_startup.script ../../scripts/ha_startup.script
--- scripts/ha_startup.script   2018-02-17 12:39:39.000000000 +0200
+++ ../../scripts/ha_startup.script     2018-02-18 09:32:33.000000000 +0200
@@ -35,7 +35,7 @@
 /system scheduler remove [find comment="HA_AUTO"]

 #Pause on-error just in case we error out before the spin loop - hope 5 seconds is enough.
-/system scheduler add comment=HA_AUTO name=ha_startup on-event=":do {:global haInterface; /system script run [find name=ha_startup]; } on-error={ :delay 5; /interface ethernet disable [find default-name!=\"\$haInterface\"]; /log error \"ha_startup: FAILED - DISABLED ALL INTERFACES\" }" policy=ftp,reboot,read,write,policy,test,password,sniff,sensitive start-time=startup
+/system scheduler add comment=HA_AUTO name=ha_startup on-event=":do {:global haInterface; /system script run [find name=ha_startup]; } on-error={ :delay 5; /interface ethernet disable [find default-name!=\"\$haInterface\"]; /log error \"ha_startup: FAILED - DISABLED ALL INTERFACES\" }" policy=ftp,reboot,read,write,policy,test,password,sniff,sensitive start-date=Jan/01/1970 start-time=startup

 #/interface ethernet reset-mac-address
 /ip address remove [find interface="$haInterface"]
@@ -82,8 +82,8 @@
    }
 }

-/ip route remove [find comment="HA_AUTO"]
-/ip route add gateway=$haAddressOther distance=250 comment=HA_AUTO
+#/ip route remove [find comment="HA_AUTO"]
+#/ip route add gateway=$haAddressOther distance=250 comment=HA_AUTO

 /log warning "ha_startup: 4"

@@ -104,10 +104,10 @@
 /ip address add address=$haAddressVRRP netmask=255.255.255.255 interface=HA_VRRP comment="HA_AUTO"

 /log warning "ha_startup: 6"
-/system scheduler add comment=HA_AUTO interval=30m name=ha_exportcurrent on-event="/export file=\"HA_current.rsc\"" policy=ftp,reboot,read,write,policy,test,password,sniff,sensitive start-date=jan/20/2000 start-time=22:37:10
-/system scheduler add interval=10m name=ha_checkchanges on-event=ha_checkchanges policy=ftp,reboot,read,write,policy,test,password,sniff,sensitive start-date=jan/1/2000 start-time=18:00:30 comment=HA_AUTO
+/system scheduler add comment=HA_AUTO interval=10m name=ha_exportcurrent on-event="/export file=\"HA_current.rsc\"" policy=ftp,reboot,read,write,policy,test,password,sniff,sensitive start-date=Jan/01/1970 start-time=00:05:00
+/system scheduler add interval=10m name=ha_checkchanges on-event=ha_checkchanges policy=ftp,reboot,read,write,policy,test,password,sniff,sensitive start-date=Jan/01/1970 start-time=00:10:00 comment=HA_AUTO
 #Still need this - things like DHCP leases dont cause a system config change, we want to backup periodically.
-/system scheduler add comment=HA_AUTO interval=24h name=ha_auto_pushbackup on-event=ha_pushbackup policy=ftp,reboot,read,write,policy,test,password,sniff,sensitive start-date=jan/20/2000 start-time=05:00:00
+/system scheduler add comment=HA_AUTO interval=24h name=ha_auto_pushbackup on-event=ha_pushbackup policy=ftp,reboot,read,write,policy,test,password,sniff,sensitive start-date=Jan/01/1970 start-time=05:00:00
 /log warning "ha_startup: 7"
 :if ([:len [/file find name="HA_dsa"]] = 1) do={
    /ip ssh import-host-key private-key-file=HA_rsa
@@ -119,9 +119,9 @@
 /user add address="$haNetwork/$haNetmaskBits" comment=HA_AUTO group=full name=ha password="$haPassword"
 /log warning "ha_startup: 8"
 #So you dont get annoyed with constant beeping
-/system routerboard settings set silent-boot=yes
+#:do {/system routerboard settings set silent-boot=yes} on-error={};

-:foreach service in [:toarray "ftp,telnet,ssh"] do={
+:foreach service in [:toarray "ftp"] do={
    :local serviceAddresses ""
    :foreach k in=[/ip service get [find name=$service] address] do={
       :if ($k != "$haAddressA/32" and $k != "$haAddressB/32" and $k != "$haAddressVRRP/32") do {
@@ -132,7 +132,7 @@
    /ip service set [find name=$service] address=[:toarray $serviceAddresses]
 }

-:if ([:len [/file find where name="HA_run-after-hastartup.rsc"]] > 0) do {
+:if ([:len [/file find where name="HA_run-after-hastartup.rsc"]]  > 0) do {
    /import HA_run-after-hastartup.rsc
 }
 /delay 5
diff -uNr scripts/ha_switchrole.script ../../scripts/ha_switchrole.script
--- scripts/ha_switchrole.script        2018-02-17 12:14:19.000000000 +0200
+++ ../../scripts/ha_switchrole.script  2018-02-18 09:17:57.000000000 +0200
@@ -4,7 +4,7 @@
    /system script run [find name="ha_pushbackup"]
    :put "delaying 60"
    /delay 60
-   :if ($isMaster && [/ping 169.254.23.3 count=1 interface=ether1 ttl=1] >= 1) do {
+   :if ($isMaster && [/ping $haAddressOther count=1 interface=$haInterface ttl=1]  >= 1) do {
       :put "REBOOTING MYSELF"
       :execute "/system reboot"
    } else {

bbs2web · February 18, 2018, 1:10pm

The following patch keeps the HA heartbeat and configuration synchronisation interface’s original MAC address on both routers. Not necessary on hardware routers with a direct point-to-point network cable but necessary when working with virtual guests or where HA interfaces connect via switch:

--- scripts/ha_startup.script   2018-02-17 12:39:39.000000000 +0200
+++ ../../scripts/ha_startup.script     2018-02-18 15:01:54.000000000 +0200
@@ -37,9 +37,9 @@
 #Pause on-error just in case we error out before the spin loop - hope 5 seconds is enough.
 /system scheduler add comment=HA_AUTO name=ha_startup on-event=":do {:global haInterface; /system script run [find name=ha_startup]; } on-error={ :delay 5; /interface ethernet disable [find default-name!=\"\$haInterface\"]; /log error \"ha_startup: FAILED - DISABLED ALL INTERFACES\" }" policy=ftp,reboot,read,write,policy,test,password,sniff,sensitive start-date=Jan/01/1970 start-time=startup

-#/interface ethernet reset-mac-address
+/interface ethernet reset-mac-address [find default-name="$haInterface"]
 /ip address remove [find interface="$haInterface"]
 /ip address remove [find comment="HA_AUTO"]
 /interface vrrp remove [find name="HA_VRRP"]

nathan1 · February 18, 2018, 1:10pm

All sound good and I will integrate them but two questions:
Nice catch on rancid, it actually impacts me as well. I think we need to fix rancid and give it a stricter prompt for export. Even if we escape ha-mikrotik, it will still break rancid if there is any other script on the devices that use ] >.
Can you try the below patch to rancid and see how it works for you?

Can you help me understand the “Disables adding default route” and how it interacts with the loopbacks for you?
I actually use the default because I have a MASQUERADE rule in my setup that allows the ha-mikrotik network to get out to the internet. I do this to test RouterOS upgrades: I login to the standby, do a RouterOS upgrade, then do a push from the primary, check if the standby looks right, then switch roles and repeat on the new secondary (old master).

Additionally, I’ve been thinking about giving the secondary a stable known address in addition to the floating ones (ie: .3 is always master, .4 is always secondary). If I do this, it would allow for a NAT setup to allow easier external access to the secondary for monitoring. Additionally, maybe an simple that can be used with the Mikrotik SNMP script GET to monitor the state of the pair. Any thoughts on how you might want to monitor the secondary in general?

--- mtrancid.orig	2018-02-18 07:55:03.199828386 -0500
+++ mtrancid	2018-02-18 07:55:20.856371114 -0500
@@ -235,9 +235,13 @@
 	print STDERR "    In Export: $_" if ($debug);
 	my $buffer = "";
 
+    #Be much stricter on the quit prompt when exporting. If scripts contain ] > then it is incorrectly terminated early.
+    my $prompt_quit = "${prompt}quit\$";
+	print STDERR "    Quit prompt for export: $prompt_quit\n" if ($debug);
+
 	while (<INPUT>) {
 		tr/\015//d;
-		if (/$prompt/) { $found_end=1; $clean_run=1; return 0};
+		if (/$prompt_quit/) { $found_end=1; $clean_run=1; return 0};
 		next if(/^(\s*|\s*$cmd\s*)$/);
 		next if(/^#/);
 		return(1) if /(bad command name )/;

nathan1 · February 18, 2018, 2:50pm

With regard to changing to a /29…we are going to need a better upgrade procedure. Upgrades (rather undocumented) have always consisted of basically just doing an /import HA_init.rsc, pushing, switch roles, push, done. If we change the default VRRP addressing and then use this method then this will break all existing users that use the /24. The secondary ends up taking over and they never reconcile their differences and end up in a reboot loop.

I agree that the user should be able to select their own network but I think I’d rather do it with the existence of an alternate configuration that overrides the standard configuration.
It can also be done as extra parameters to $HAInstall to make it easier to deploy clusters that are similar.

Would this work for you?

PS: Any interest in taking this to the github project so we can track the features/issues a little cleaner?

bbs2web · February 18, 2018, 2:56pm

I centralise logging and was receiving SMS messages indicating loss of BGP peers. This was due to me originating syslog messages from the loopback IPs, which would then route out:

/system logging action
set 3 remote=54.119.65.26 src-address=54.79.22.1

I prefer having the standby router exclusively accessible via the acting master, PuTTY’s tunneling features really help with this…

I hear your point about having predictable master/slave IPs, but currently handle standby router monitoring by getting notified if the HA interface on the acting master is down two checks in a row (we run Zabbix and have automated discovery which notifies us of any interface which is down when it was ever up). This way I simply need to know that the HA interface is operational and it will not send notifications if it happens to get checked whilst rebooting).

I understand your more conservative approach to RouterOS updates. I had:

Upgraded acting master, which switches it to standby mode
Connected to new standby router, upgraded firmware to complete the process and rebooted
Validated configuration via mac telnet
Repeated the steps above on the current master

Can you help me understand the “Disables adding default route” and how it interacts with the loopbacks for you?
I actually use the default because I have a MASQUERADE rule in my setup that allows the ha-mikrotik network to get out to the internet. I do this to test RouterOS upgrades: I login to the standby, do a RouterOS upgrade, then do a push from the primary, check if the standby looks right, then switch roles and repeat on the new secondary (old master).

Additionally, I’ve been thinking about giving the secondary a stable known address in addition to the floating ones (ie: .3 is always master, .4 is always secondary). If I do this, it would allow for a NAT setup to allow easier external access to the secondary for monitoring. Additionally, maybe an simple that can be used with the Mikrotik SNMP script GET to monitor the state of the pair. Any thoughts on how you might want to monitor the secondary in general?

bbs2web · February 18, 2018, 3:01pm

Perfect, I’ll have some time tomorrow to fiddle with Rancid and agree that discussing this on Github is probably better. Perhaps I should break up the patch in to separate ones, where each one handles a specific point?

nathan1 · February 18, 2018, 3:20pm

Let’s pick it up from here on github.

I have integrated your changes into a test branch for us: https://github.com/svlsResearch/ha-mikrotik/commits/bbs2webtest
Issues created for the exclusions: https://github.com/svlsResearch/ha-mikrotik/issues

Excluded for now:

No rancid escape fix here. If you still want to do this escaping, let’s do it with the generate script. The rancid fix appears to be working OK for me though.
Kept the default gateway for now. I understand your use case though, you don’t want your secondary getting out.
Keeps original /24 addressing until we can sort out the ha-mikrotik upgrade path.

ovidiu · March 6, 2018, 6:54am

No problem not using CCRs, they are definitely expensive for many deployments. I just wanted to let you know that you are the first one that I know of to test alternative platforms, so good for all of us. I would like to hear how well it works for you after you run for a while.

The boot delay sounds like a great solution if you just want one to always become primary when they are both booted nearly simultaneously (i.e. after power recovery). This wouldn’t force A to become primary again after A was primary and then rebooted but that is the feature I could add if you really wanted it. I think this could work based on a pretty simply change that enables VRRP preemption.

It sounds like you have found a pretty workable solution though. Maybe you run it for a while and then see if you generally find it stable and if you still want this feature after a while of running, I will add it. How does that sound?

2 week passed without any problem, the delayed startup ensure the desired router to be the active one.
So this script is working fine on smaller routers as well.

millenium7 · August 29, 2018, 11:57pm

Anyone tested and confirmed this works exactly as expected on 6.42.x ?
We’re running this on a couple of routers in a data center and it seems to work fine. However 2 problems i’ve noticed and I don’t know if they are an issue with the later firmware or something going on with the script

I can’t seem to make either of them a preemptive Master. I’ve tried adjusting VRRP priorities but if I reboot A and then B takes over, A will never be master until B reboots. We would rather have A always be the active master if it’s online
I noticed the VRRP instance flaps a lot. I currently have B totally disconnected because it was flapping every few hours. We’ve tried changing ethernet cables and the same problem still happens. This is a big problem because these routers run BGP as well as PPPoE connections, resulting in extended downtime during a change over. Fine if we have an actual router failure, but not fine during normal day to day operation. There doesn’t appear to be a physical interface issue, i’m not sure if its VRRP or the script. Can I just increase the VRRP timers to start with? (Won’t break anything on the script or pairs?)

I also have another question regarding firmware updates. Is there any special care that must be taken? i.e. do I need to update both routers at same time or can I do 1, bring it online, reboot the other so the one with latest firmware becomes active, check everything is working fine and then update the backup?

nathan1 · August 30, 2018, 12:42am

I have not been able to test it on 6.42.x just yet, you may be the first. It is on my todo list. VRRP should not be flapping at all - are they directly connected or are you going via a switch? anything interesting in the logs? Were you running 6.38.x before going to 6.42.x? Did you have any of this VRRP flapping before or is this new? What does the CPU load look like? My units are not very heavy on CPU load, I wonder if your timers are slipping from other loads (BGP? high PPPoE count?)

Regarding preempting, this is by design. Since ha-mikrotik is not stateful, it is rather expensive to keep switching masters (ie: VPN users disconnected 2x), so I made it this way intentionally. Others have asked about preemption but nobody seemed bothered by it enough to warrant it being implemented. See my note below on the VRRP interval on why your change may not have stuck.

As far as firmware upgrades go, I have always done it by upgrading the standby and then checking if it looks right and then doing the master, sometimes forcing sync and then doing another reboot before letting the upgraded guy takeover. Since ha-mikrotik is not supported by Mikrotik themselves, it is somewhat of a crapshoot but I have had general good success. I have many pairs running this code so I generally pick the pair that won’t be catastrophic if something goes wrong for the upgrade test.

For changing the VRRP interval, you would want to edit the ha_startup script on the master (look for line after “ha_startup: 5”) and then sync the standby and then reboot the master after the standby reboots. If you get the timers out of sync, I believe they will ignore each other and both become master. You can’t do this via the VRRP interfaces, as they will be removed and rebuilt on every boot.

I hope this helps.

Anyone tested and confirmed this works exactly as expected on 6.42.x ?
We’re running this on a couple of routers in a data center and it seems to work fine. However 2 problems i’ve noticed and I don’t know if they are an issue with the later firmware or something going on with the script

I can’t seem to make either of them a preemptive Master. I’ve tried adjusting VRRP priorities but if I reboot A and then B takes over, A will never be master until B reboots. We would rather have A always be the active master if it’s online

I noticed the VRRP instance flaps a lot. I currently have B totally disconnected because it was flapping every few hours. We’ve tried changing ethernet cables and the same problem still happens. This is a big problem because these routers run BGP as well as PPPoE connections, resulting in extended downtime during a change over. Fine if we have an actual router failure, but not fine during normal day to day operation. There doesn’t appear to be a physical interface issue, i’m not sure if its VRRP or the script. Can I just increase the VRRP timers to start with? (Won’t break anything on the script or pairs?)

I also have another question regarding firmware updates. Is there any special care that must be taken? i.e. do I need to update both routers at same time or can I do 1, bring it online, reboot the other so the one with latest firmware becomes active, check everything is working fine and then update the backup?

hamster · October 3, 2018, 2:44am

I’ve just installed this on two x86, version 6.42.9… So far, so good. Thanks for this!

Quick question, if I may: why is it neccessary to reboot the standby router once it receives new configuration?

nathan1 · October 3, 2018, 12:07pm

“/system backup load” is used to keep the general configuration in sync, which requires a reboot.

millenium7 · February 21, 2019, 7:22am

So we have had a hardware failure on one of the routers and this script saved us a lot of downtime
However now comes the time to replace with another router. I have an identical model here

There are no instructions on what to do to bring a new standby router back into the mix (preferably without any downtime). Do I simply install the new backup router, connect the 2 via ether8 then run the ha_init script on the existing router once again and do through the same procedure?
Or is there something else I need to do only on the new backup to bring it in

Will it know to keep the existing primary config, and not override the primary with the backup?
Can this be done with little to no downtime?

Thanks

nathan1 · February 21, 2019, 10:00am

Correct, basically replace it and connect it physically like the old one. The replacement should be running the same RouterOS and reset-configuration per original docs. You will then $HAInstall like you originally did, changing the MAC of B (or A) and then following the on screen instructions for bootstrapping.

This can done live and with no downtime, the script should not do anything on the master when it discovers it is already master.

Do you have A or B alive right now? Assuming it is A, you can do something like this and follow the instructions:

$HAInstall interface=$haInterface macA=$haMacMe macB="[NEW MAC FOR B]" password=$haPassword

If it is B:

$HAInstall interface=$haInterface macB=$haMacMe macA="[NEW MAC FOR A]" password=$haPassword

This just pulls the global variables (the current config) for redeployment, you could also just populate them all again with constants like you originally did.

Try this just to see how your variables will populate (it only prints):

:put "interface=$haInterface macA=$haMacA macA=$haMacB macMe=$haMacMe password=$haPassword"

millenium7 · February 26, 2019, 3:20am

Awesome, i’ll give it a go next time i’m at the DC but backup beforehand. Thanks

raffav · March 2, 2019, 3:32pm

Wow, this project still alive…
Good I never had a chance to put it in production..

But very nice

Sent from my XT1580 using Tapatalk

christopherh · March 18, 2019, 8:10am

Hello All,

I’ve followed the instructions from 1 to 8 on the GitHub page, however before $HAInstall gives me the info to bootstrap the second router, it reboots and kicks me out.

How do I bootstrap the second router?

Thanks,
Christopher H.

**EDIT: I worked it out - had to re-run the $HAInstall command to generate the commands to bootstrap the second router.