Script to start stalled WireGuard interfaces after reboot

I don’t needing it myself but I often read that the WG interfaces are not becoming active when domains are used as endpoint of the WireGuard connection.

The code explains itself and you can define the number of retries. Also the waiting times time between retries and bot give the total retry time.

On completion, it will state the result and also states if there was domain resolving available. Without the last resolving does not work. You will also get that result if the entered domain is invalid.
It will only check and restart, if needed, active WireGuard connections and if the the profile in peers is active or not is not being checked. I replaced also /interface/ by /in/ to have a shorter line in the code.
I reused and altered a bit of code published here by Anav to fit this code.

I have not tested a outcomes of the script and if run into problems then let me know in this thread. Also for suggestions.

{; # BeginOfScript

# scripted by msatter
# function: bring up stalled WireGuard interfaces after restart of the router

:local timesRetried   15;    # how many times WireGuard is tried to be restarted
:local loopDelay      "10s"; # (loopDelay * timesRetried ) = total timeout
:local restarted      true;  # set default to true
:local domainResolved false  # also checking if the endpoint domain-names could be resolved
:local retried        0;     # set to starting value 

while ( $restarted  && ( $retried < $timesRetried ) )  do={ # loop till all Wireguard interfaces are working or there the maximum retries is reached
:set $restarted false
:foreach wg in=[/interface/wireguard/find disabled=no] do={ 
 :local peer [/interface/wireguard/get $wg name]
# scripted by Anav looking for domain names. Adapted by msatter. ( /in/ = /interface/ )
  :foreach i in=[/in/wireguard/peers/find interface=$peer endpoint-address~"[a-z]\$"] do={
   :if ([:resolve [/interface/wireguard/peers/get $i value-name=endpoint-address]]) do={ :set $domainResolved true
    :set $lastHandshake [/in/wireguard/peers/get $i last-handshake]
    :if (([:tostr $lastHandshake] = "") || ( $lastHandshake > [:totime [/in/wireguard/peers/get $i persistent-keepalive]])) do={
      /in/wireguard/ disable $peer; :delay 1s; /in/wireguard/ enable  $peer;	# restarting the WireGuard connection
      :set $restarted true
    }; # EndIf
   }; # EndIf
  }; # EndForeach
}; # EndForeach
:if (restarted) do={
 :put "Check loop: $retried"
 :set $retried ($retried + 1) 
 :put "Checking loop: $retried"
 :delay $loopDelay; # waiting time till following check
}; #EndIf
}; # EndWhile
:if ( !$domainResolved ) do={:put "One or more domains could not be resolved, all/some domain based endpoints could not be brought up in the set time of ($timesRetried * $loopDelay)"} else={
 :if ( $restarted  && ( $retried > $timesRestied ) ) do={:put "Not all WireGuard interfaces could be brought up in the set time of ($timesRetried * $loopDelay)"}
 :if ( !$restarted && ( $retried > 0 ) ) do={:put "No WireGuard interfaces are down, after $retried retries"}
 :if ( $retried = 0 ) do={:put "No WireGuard interfaces had to be restared"}
}; # EndElse
}; #EndOfScript

Nice script. I have something similar.
But I have to say, something seems to be very wrong if scripts are basically mandatory for a wireguard tunnel to be reliable and persistent. Resiliency should go without saying, so there’s got to be a better way. If Mikrotik don’t want to include something like tailscale, then they should probably do something to improve reliability or develop their own helper service, an optional failover UDP port for each wg interface & peers maybe, or some kind of backup TCP tunnel which can negociate working UDP ports / hole punching / whatever is needed to establish the reliable endpoint UDP tunnel, similar to what a solution like tailscale does … Otherwise this wireguard is not fit for anything enterprise/mission-critical and should basically be used just as a proof of concept unless accompanied by intricate scripts and condition handling

This script addresses only one aspect, of domains not being resolved just after startup.

This kind of script should be something MT should could have done to increase user expierence. But v7 being in devopment they having their hands full with other stuff.

Personally I am very happy with WG and have only a Netwatch running to control firwall lines to match the number of concurrent VPN tunnels.

I’m sure they will fix it eventually, because it’s closer to bug than to anything else.

Something more like hole punching is unlikely, at least until there’s some interoperable standard that all WG clients are able to use.

If not already done, someone ought to send a change request to MT regarding this (although this “someone” seems to be a quite mysterious fellow I’ve always wanted to meet ;- ).

Not bring resolved can have that many causes that not one solution will do.

Every service should be designed so when depending on DN…

…why not save DNS cache somehow on restart and reload stored cache, as soon DNS resolving is back then refresh the cache.

There should be an new set timeout on the restored cache so that a broken system won’t keeps working.

If you somehow manage to limit the number recalled entities to sevices, being depended of DNS you could give those refuge in static cache because that survives a reboot/restart.

They should have a special mark so that get erased as soon DNS resolving is back. In the meantime they can be used to resolve by any instance for a set time…their original TTL?

Then you will have a central location and erasing the static ones should start as early as possible and as latest as the TTL expires.

Then if RouterOS writes each resolve for a service depending on DNS to static DNS it could even survive a unplanned reboot. To avoid wear of flash memory then only changed (delta) records are written and this also avoids static entries used in those services being touched.

Those static entries are not marked as ‘emergency’ static DNS entry and so they should be ignored from the routine.

Because flash memory is used, user has to activate this ‘emergency’ routine…maybe even for each service separate. If none of the services depend on DNS then there is also no storing of delta resolves needed and so wear on the flash memory is none.

Update: I have just suggested this to Mikrotik to have a look if this a way to handle this.

The script is ready, I think about making a test environment to test it, doing reboots and emulating crashes.

It works very well and is more or less fool proof. The script runs if wished/needed after startup and can be stopped by removing it config line stored in global. It then removes all the entries it made till then and stops itself.

There are a lot of :do on-error in there to reduce the amount of code and it even allowed a stri-state variable (true/false/either) to be used.

“Revamped” for readability, some “errors” fixed, list on new post

{; # BeginOfScript

# scripted by msatter
# function: bring up stalled WireGuard interfaces after restart of the router

:local timesRetried   15    ; # how many times WireGuard is tried to be restarted
:local loopDelay      10s   ; # (loopDelay * timesRetried ) = total timeout
:local restarted      true  ; # set default to true
:local domainResolved false ; # also checking if the endpoint domain-names could be resolved
:local retried        0     ; # set to starting value 

:while ( $restarted && ($retried < $timesRetried) ) do={
    # loop till all Wireguard interfaces are working or there the maximum retries is reached
    :set $restarted false
    /interface/wireguard
    :foreach wg in=[find disabled=no] do={ 
        :local peer [get $wg name]
        # scripted by Anav looking for domain names. Adapted by msatter.
        :foreach i in=[peers/find interface=$peer endpoint-address~"[a-z]\$"] do={
            :if ( [:resolve [peers/get $i value-name=endpoint-address]] ) do={
                :set $domainResolved true
                :set $lastHandshake [peers/get $i last-handshake]
                :if ( ([:tostr $lastHandshake] = "") || ($lastHandshake > [:totime [peers/get $i persistent-keepalive]]) ) do={
                    disable $peer; :delay 1s; enable $peer ; # restarting the WireGuard connection
                    :set $restarted true
                }; # EndIf
            }; # EndIf
        }; # EndForeach
    }; # EndForeach
    :if ( $restarted ) do={
        :put "Check loop: $retried"
        :set $retried ($retried + 1) 
        :put "Checking loop: $retried"
        :delay $loopDelay ; # waiting time till following check
    }; #EndIf
}; # EndWhile
:if ( !$domainResolved ) do={
    :put "One or more domains could not be resolved, all/some domain based endpoints could not be brought up in the set time of ($timesRetried * $loopDelay)"
} else={
    :if ( $restarted && ($retried > $timesRestied) ) do={:put "Not all WireGuard interfaces could be brought up in the set time of ($timesRetried * $loopDelay)"}
    :if ( !$restarted && ($retried > 0) ) do={:put "No WireGuard interfaces are down, after $retried retries"}
    :if ( $retried = 0 ) do={:put "No WireGuard interfaces had to be restared"}
}; # EndElse

}; #EndOfScript

“Errors” finded on original on OP:

from string
:local loopDelay “10s” […]
to seconds
:local loopDelay 10s […]

from
:local domainResolved false […]
added missing “;”
:local domainResolved false; […]

on script better use full path for prevent future problems…
all occurrences of
/in/wireguard
to
/interface/wireguard

from
:if (restarted) do={
added the missing “$”
:if ($restarted) do={

I have not seen anything else.

Thanks for the share.

Bug off! Quoting Anav here.

Bugoff? No I use Deet for bugs! :slight_smile:
Thanks for the amendments rextended, much more readable now, not that I understand any of it LOL.

@msatter:
question / personal comment:
why make this so complex ?
The problem with this resolving is usually on the “client”-peer and equally usually there is only 1 peer for 1 WG interface to be found there ??
So why search all possible peers on all possible interfaces ?

Just a personal remark from my side… maybe I missed some other reason for using this ?

Which does not take away it’s still a very nicely crafted script, hats off to you for that !

I have more than on VPN and even mixing WG and IPSEC and muliple of each. If it works with many then it certainly handle just one connection.

The scipt I am working on was already backing up IPv6 entries but then I discovered that the IPv6 address-list has no date/time of creation for each entry RouterOS and that mixes up the order of the list. Making it to have more code that have not written. So I have to strip the code for IPv6 back out of it and rerun the tests.

Keeping the code as simple and at the same time robust takes time and a lot of testing.

It uses a config/startup config line written to Global and as soon you remove that Global line, the script that loops endles will stop itself and clean up after itself. Just the absence of that Global line triggers and the values contained in that line are only read once a starting and not anymore in side the loop.

DNS (previous) resolved will be always available on startup and when router crahes and reboots itself to running status.

Update: wrote and bit of script that put the needed address-list into an array. Now that can be also used for IPv6 overcoming the order in that address-list.

Update 2: that helped and only a few lines extra and IPv6 is now also saved and restored on reboot/crash. Some extra checks before I can publish it and I think in a two or three days.

I have published the script I worked on for a time now. It is still an test version.

Link: http://forum.mikrotik.com/t/survive-script-to-restore-resolved-domains-on-restart-or-crash/157749/1

Hi smatter,
Good approach, but if would nice if it was in bite siz chuncks that even a moron like me could stand a chance at understanding.
ScriptS


Module 1—> Check for IP address (ping check)
Module 2—> Check netwatch status maybe…
Module 3 → toggle script
Module 4–> Check DNS is working
Module 5- → Check NTP is up.
Module 6—> Add time delay. where necessary).

I think you get the idea. Also agree with hoelve, make one example which suffices for Wireguard and which is good enough!!
I do like the thorough approach that looks addresses
a. lag due to dyndns URL not being resolved quickly
b. issues with DNS in general
c. issues with NTP
d. potential outage of endpoint for longer than anticipated (lets say 2 minutes vice 30 seconds)…