Wireguard stops handshaking out of sudden - Change of port (only) solves it for weeks

I am having an issue where wireguard just stops handshaking. Despite that I see packets arriving to the wireguard “server” it won’t handshake.
It is weird because it may be working fine for weeks and suddenly it stops exhanging data. All I do to “solve” it temporarily is to change the port and immediately it is back for weeks again.
What could I possibly check to prevent it from happening?

it is still doing it. It has happened something like 5 times since my initial post.
It randomly stops communicating and all I have to do is change the port.
To make it faster I have made a NAT rule that receives packets on 23231 and translates them to 13231 so the only thing I have to change is the peer port from 13231 to 23231.
It works for something like 10 days and then it stops working. I then change the port back to 13231 and it starts again. After a few days that it stops again I change it to 23231 and it works again.
Any ideas?
I don’t want to go back to sstp/l2tp which I have done already to some installations!

Not enough information.
no config,
no network diagram
no understanding of what is at the two ends of the wireguard connection
etc
etc

network config is huge.
Let’s say that the wg “server” is on the datacenter with a static IP and the client is behind CGNAT.
Everything is working fine up until some point where I start getting Handshake for peer did not complete after 5 seconds, retrying (try 2).
The server is listening at port 13108 and it is working fine up until the error comes and it won’t handshake any more.
I can see packages arriving at that port from the correct source ip but I get this error. Then what I do is I have a dst-nat from port 23108 udp to destination port 13108.
The moment I change the peer in the client and set port 23108 it starts working fine… up until after days where it starts doing it in the 23108 port.
Then all I do is change the port to the peer of the client back to 13108 and immediately it starts handshaking/working fine. Up until the customer calls me again to tell me that his VPN is not working :frowning:
It is not happening only in one customer. I have this probably in 10 different setups. Some of the I moved them back to l2tp because it has been breaking my… a lot.
One of the things that I have tried is to disable and enable both the peers and the wg interfaces and clean all network connections of both end but it still didn’t solve it.
Other that that bloody wg works like a charm!

now on one that I had the issue I managed to make it working again by disabling the peer on the client wg and enabling it again without changing anything else.
It just started working.
Other times I even try to restart the client mt but it doesn’t work. Now I disabled/enabled the peer and it started handshaking again

What version of RoS on the router?
Does it resemble this error…
http://forum.mikrotik.com/t/weird-wireguard-issue/176028/1

7.14.3

Are you completely sure that the network between both WG peers is as transparrent as you’d want it to be (i.e. the only thing playing games with packets is the CG NAT on the “client” side)? I can imagine some network operator which would drop any long-lasting connections due to some reason (1. just because they can; 2. these don’t seem “legit” connections so let’s save our users from being extensively hacked; 3. because some government agency tells them to do so; etc.)

There was an ISP in my country, which provided (and still does) internet via PPPoE and it used to drop PPPoE connection every night just to force customer getting another IP address. Because static IP address is either a) function of business internet service or b) provided to residential customers for a surcharge …

I have had my router to router wireguard connection stop working.

Simple fix was to send pings across the link every so often. Hasn’t dropped in months.

??? That is called persistent keep alive ???

In all setups in the clients I do have persistent keep alive to 25" (seconds)
I only skip it in the “server” end that is has external IP in the datacenter. Still in that case the “client” that connects to it has the keep alive set to 25"

I have persistent keep alive as default.

But here is the netwach I added months ago.

/tool netwatch
add disabled=no down-script=":log info \"!!!Warning VPN Down!!!\"" host=\
    172.16.33.1 http-codes="" interval=1m packet-count=5 test-script="" \
    thr-loss-count=5 type=icmp up-script=":log info \"VPN to Home Up\""

Gotsprings, is that on a Router (client peer for handshake)??

Wireguard hAP AC2 to Another hAP AC2 that is behind carrier grade NAT.

I VPN to the office… The route lets me jump across the VPN to the other branch.

Why would you have the MT router (server for peer), be monitoring the client peer behind CGNAT???

I have same issue on multiple locations every 1-2 days handshake stops, disabling/enabling interface doesn’t help, only thing that helps me changing to any random port(client side) and immediately starts working, i can even return very same port which was before and still it will work fine.
So i have script which checks last handshake and sends /interface/wireguard set 0 listen-port=0, which sets random port and fixes the tunnel..

And i know this is ISP issue because i have this issue ONLY on that specific ISP, we even changed ISP in same cases without changing anything on router and issue disappears.

I also thought at first that it was the ISP but I checked the incoming traffic to the “server” wireguard and the I could see traffic at that certain port and if I would disable the client’s wireguard it would stop.
Unless the ISP modifies something to the packets while it routes to the destination which would be really weird.

But indeed I am facing this exact issue where unless I change the port it won’t handshake. Except 1/10 times (or less) that restarting the peer “fixes” it

Confirmed, I have a Mikrotik with 150 peers and every day one (or more tan one) of the peers loses connectivity and the log only says “Handshake for peer did not complete after 5 seconds, retrying (try 2)” infinitely.

I have searched for the reason for the problem without success, I only know that it does not depend on the ISP, and I think it does not depend on the server either since the others peers remain connected without problems.

It is only solved by restarting the peer device or assigning a 2nd port for the same interface on the server and using this new port on the peer and if loses conectivity again return to the primary port.

MK guys, please review this problem…

Just been investigating a similar issue.

Central RB5009 with multiple WG peers, on an AC3 250km away WG connection dropped dead, already some days ago.
All other peers on that RB5009 remain functional.
Both RB5009 and AC3 are on 7.15.1.

That WG connection is only for management purposes, no signals from users something is wrong on that site (or I would have known earlier).

When going to that AC3 (there is another tunnel from that device via IPSEC to Azure, from there I can get back in as well), no way to revive the WG connection EXCEPT for removing listen port on interface setting (on a remote peer ??) and setting it back as it was (which on itself is useless since it’s an outgoing connection for that setup).
And then traffic starts flowing again …
Restart probably could have done it as well but it’s a rather busy device, so can’t do that during business hours.

Did anyone ever create a support ticket on this ?
I am going to set up extra monitoring and alerting so I get informed sooner about this issue on that device.

The network at the office is INSIDE another companies network. So I can’t do anything with it.

The owners home has a public IP. So I set up a router with Wireguard there. Its the “server”. The router at the office is a client. It calls the router at home. And the router at HOME is sending the pings to “keep the connection alive.”

VPN to the house… can reach the office.

If he is home… he has access to the office resources by IP address.