At 7:40am a remote hEX dropped off the wireguard tunnel. I was still able to ping the router from its public IP address and the users there still had Internet access.
I tried pinging the both the WG and the LAN private addresses and did not get a response. Unable to use Winbox. Local hEX stopped WG handshaking.
I had someone there (100 miles away) reboot and it worked for 40 minutes, then the same symptoms.
Had them reboot again and it stayed working.
Now, 6 hours later, same exact symptoms at a different site (2 miles from the first remote site).
I can ping the public address and the devices at the site still have Internet access, but I have no access to the hEX.
When I regained access to the first hEX, I opened an SSH session to it I saw startup messages that the DHCP-Client lost its IP address on ether1. Ether1 is the WAN port on the hEX and connected to the cable internet provider’s modem.
Any suggestions on how to continue troubleshooting this?
So when the HOST router has connection issues ( host= server for initial handshake ), what happens is that the client router will attempt to re-connect with the host router.
THis also happens when a dynamic WANIP changes.
So the Mikrotik client attempts to find the endpoint again.
The problem is if the endpoint address is not available right away. The wireguard client stops trying to connect. One has to literally turn off the wireguard interface at the client and turn it back on again.
This also happens if the wireguard router is to slow to resolve the new DYNDNS name ( ie wireguard connection attempt happens first).
I have asked MT to address this internally on any MT device acting as a client. In other words if the MT device has keep alive set on a peer, then they should attempt to reconnect to the peer not just once but on a scale of time ( right away, after 1min, after 5min, after 30min, after 1 hour, after 6 hours etc… and probably stop at 24 hours).
In any case you can find scripts people have written to overcome this known issue (that sadly MT refuses to deal with and thus eventually hits everyone like yourself like a slap in the face!!)
See Para 6 - https://forum.mikrotik.com/viewtopic.php?t=182340
:foreach i in=[/interface/wireguard/peers/find where disabled=no endpoint-address~"[a-z]\$"] do={
:local LastHandshake [/interface/wireguard/peers/get $i last-handshake]
:if (([:tostr $LastHandshake] = "") or ($LastHandshake > [:totime "5m"])) do={
/interface/wireguard/peers/set $i endpoint-address=[/interface/wireguard/peers/get $i endpoint-address]
}
}
I added a line to make a log entry, but $i is just a number, so I’m trying to figure out how to reference it back to something recognizable like the contents of comment.
I added 2 lines to create a log entry for when the reentry of the endpoint address due to last handshake being more than 5 minute occurs:
:foreach i in=[/interface/wireguard/peers/find where disabled=no endpoint-address~"[a-z]\$"] do={
:local LastHandshake [/interface/wireguard/peers/get $i last-handshake]
# Added this:
:local endpoint [/interface/wireguard/peers/get $i endpoint-address]
:if (([:tostr $LastHandshake] = "") or ($LastHandshake > [:totime "5m"])) do={
/interface/wireguard/peers/set $i endpoint-address=[/interface/wireguard/peers/get $i endpoint-address]
# Added this:
:log info "WG-iface-restart script found WG peer with last handshake greater than 5 minutes; then reset the endpoint-address to reload dns of endpoint: $endpoint"
}