In order to resolve this issue completely, RouterOS would have to modify the Wireguard behavior to fix 3rd party issues, causing headache to unaffected people because the necessary behavior would not seem logical.
If you read the thread above, what happens is that the Wireguard peer keeps updating a pinhole in a UDP firewall that has been created during some non-standard state of the network, preventing the traffic from reaching the remote peer or preventing the traffic from the remote peer from reaching the local one. The solution is to create a new pinhole while the network state becomes correct again; for that, it is necessary to either change one of the ports or to let the incorrectly learned pinhole to time out by squelching the traffic for a sufficiently long time.
The newer versions have done a good job in this without changing the behavior too radically, in terms that you can now set a local peer on a public address as “passive” whenever the remote peer is behind a NAT so the local peer is unable to initiate a connection to it and must wait until the remote one drills a pinhole in its NAT, so there is no point that the local one would keep trying to connect to an endpoint address that does not respond. This setting alone resolves the issues caused by the existence of incorrect pinhole on the side of this “passive” peer - once it stops receiving handshake packets from the remote peer, it becomes silent, so the incorrect pinhole dies off and the traffic from the remote peer can get through. But this only covers part of the possible scenarios.
So if you want to find the best solution for your particular case, you have to analyze it first, to find out at which device (or devices) on the path between your Wireguard peers the issue of incorrect pinhole appears. The best solution is to prevent it from happening; if it is not possible for any reason, you have to choose between using a script to squelch the right peer (or even both of them) for long enough time to allow the incorrect pinhole to die off and using a script to change the port at one of the peers while running the other one in the “passive” mode so that it would accommodate to the change. In either case, the script needs some time to detect the outage, so once the external event that causes creation of the incorrect pinhole happens, your tunnel will always stop transporting data at least for the time of the detection.
I don’t think the reason of it getting worse are the newer versions; instead, I’d suspect a change in the ISP network to be responsible.
Sindy, first of all, thanks so much for your insightful and complete replies. Second of all at this point there’s no doubt in my mind that you’re completely correct about everything you posted. This is a combo of network glitch crashing or skipping a handshake, client notices and tries to reconnect from scratch, server side firewall keeps track of UDP “connections” with some sort of a timer (it has to, no other way to do it) and since it receives traffic from the same port and IP it keeps the “connection” “up”. The wireguard “server” then doesn’t handshake et voila, the whole show ends with everyone booing.
But I have to disagree on the part that this isn’t fixable. Let me explain why.
I am and always have been a member of the SIP, IPSec and L2TP crowds. I run or ran several of these protocols and I’ve never really had this issue before. Never ran bare L2TP though I did on top of IPSec (as to have an interface to work with). Currently I do IPSec on IKEv2 at some clients’ and I port forward a SIP server for roaming cell phones at another shop and while we did see hung connections on occasion, they always cleared within minutes. So I always attributed this to temporary network issues, rather than RouterOS or other implementation bugs. The client I’m experiencing the most issues with (the one with the really bad link) was on IPSec with L2TP on top before. It worked. Ish. A lot better than on Wireguard at least. So if no other UDP based protocol hangs as epically as Wireguard, then it does something different about it’s traffic flow in RouterOS. Be it in the firewall, in the implementation interface, CPU, libraries, whatever.
Further, a very basic functionality of the firewall facility in RouterOS is mangle. And an awesome one at that. But if RouterOS is capable of tagging all sorts of traffic it sure as hell is capable of tagging Wireguard stateless UDP “connections” in a way that sets it appart from other UDP traffic. In this way it can be (and I suspect it already is) treated differently from all other UDP traffic. So the argument “it can’t be fixed because this is how all UDP traffic gets treated and fixing it means breaking things elsewhere” doesn’t really fly with me. And while I realize that this isn’t a high priority problem, they have limited resources and much bigger fish to fry, this has to be fixable. We basically already did all the legwork. It might just come down to “set wireguard connection timeout to=30s” down from 5m or something.
PS: yeah I am naive and hopeful like that, thanks for nothing. =) No seriously, thanks again Sindy and Anav for your time, support and patience.
PS2: if I find the time (big IF, actually. I know I said I wouldn’t negotiate with terrorists but… everything is a negotiation with these two. 3 and 6. The little one is still in kindergarden for heaven’s sake and this is supposed to get worse???), I’ll try to figure out how long it takes for RouterOS to clear the dead “connection” from the firewall. This would make scripting a little easier.
I’m afraid that indeed any tracked UDP connection behaves the same if hammered from either side. It is a matter of the application whether it stops hammering for long enough to let the connection time out.
With RouterOS, you can not only figure out but even affect that, by setting the udp-stream-timeout value under /ip/firewall/connection/tracking (for connections that already got a response) and/or the udp-timeout value (for connections that haven’t been responded yet). Since the “burnt-in” default value for the latter is 30s but in the few latest RouterOS versions I’ve noticed it was set to 10s, it is well possible that this is the way how Mikrotik attempts to address this issue - the retry period of the IPsec initiator is slightly more than 10s, and I have just checked that with udp-timeout set to 10s, each new IPsec connection attempt creates a new tracked connection because the orig-packets counter is always 1.
Wireguard’s behavior is more complicated, it resends the first handshake packet every 5-6 seconds; if there is no payload traffic and the persistent-keepalive is set to more than 30s, it takes a 30 second break now and then, but if there is a payload traffic initiated at the Wireguard initiator side, the handshake attempts are continuous.
There is a slightly crazy way to overcome this - you can use a drop rule in chain output in /ip firewall/raw that drops (using the nth match condition) two out of three 176-byte packets from the local Wireguard port if their destination address is not on a “responding” address-list, which is populated by addresses of remote peers using a rule in raw->prerouting that matches on the local Wireguard port as destination.
However, the same issue (a “wrong” tracked UDP connection getting updated too frequently) may occur on some other device on the path between the initiator (client) and the responder (server), and you have no power over the udp-timeout setting on such a device. So you would have to use the above crazy way with the nth value adjusted to the udp-timeout of the external firewall.
Wow. Don’t want to repeat myself but after that reply I feel like carbon copying the first sentence of my last reply.
It’s sunday and the Mrs. is already pissed that I’m hanging out in keyboard vicinity so I’ll look into the the firewall UDP timeout settings tomorrow. But that we have control over that is just awesome!
Regardless, I now understand what you mean about all UDP traffic. It makes sense that Wireguard hangs due to continuous new handshake attempts when traffic or even watchdog pings are sent on a shorter interval than the udp stream timeout value. SIP connections certainly behave differently when no one’s talking and a failed IKE would make the IPSec connection start over regardless of firewall state.
As far as “solving it” by killing handshakes on the initiator side firewall, it certainly is an idea. Nothing crazy about it. I’ll see if I can run that.
If the mikrotik is a client peer for handshake then change the listening port on the interface as this should clear up the issue.
Dont laugh but here is a script that will do just that…
It should be paired with a route that checks if there is an address available on the remote server peer router ( use a gateway IP for example )
If third party, use a DNS like 9.9.9.9
Note replace wireguard1 with the name of your wireguard interface and the listen-port to whatever you wish to override the default port with…
The script will check the connection every 30 seconds it will increment the listening port 9 times and then fall back to the original port.
++++++++++++++++
I would prefer not to have random port selection as there is always the chance of duplicating a port being used somewhere on the router… or something fairly common…22, 80, 443 etc… but glad to hear this works!!
But port doesn’t matter, you can have webserver runing on 443 and open port for it and if you set wireguard peer to 443 it will work fine it’s not like you are changing open port for wireguard server side.
This is only true because TCP 443 (where the web server listens) and UDP 443 (where you let any UDP service listen) are actually unrelated. But there are other UDP ports where RouterOS is listening by default (like 53 for DNS, 1701 for L2TP, 500 and 4500 for IPsec) which you cannot simultaneously use for Wireguard - only a single process can listen on a single port on a single address, and many services cannot be configured to listen only on a particular local-address. So the question is whether setting the Wireguard listen-port to 0 makes it automatically bind to a random unused port or just to a random one without checking, and if it happens to be occupied, the binding of Wireguard fails. There is no way to know without a statement from Riga.
But what does actually happen if the “randomizer” comes out with (say) port 53 (that is already in use by DNS?
This can be manually tested setting intentionally the Wireguard port to 53.
Either:
Wireguard will take possess of the port
or:
the pre-existing DNS wins
If the latter, then it is maybe the case of using a more complex script that after having assigned the random port sending the 0 checks that it is not one in a list, and if it, is run itself again.
Since the number of used UDP ports should be very small compared to the available ports the second run should be a rare occasion and the case of a third run nearly impossible.
Concur with Sindy, admins job is not about random results LOL, I think most people would simply like certainty and KISS, which setting the initial wireguard listening port ( we are talking the client peer for handshake, so can be anything ) to a fixed number is not going to upset anyone.
What is cool is learning the random function one can access in scripting. Might be useful in some security aspect…
I have several sites interconnected with Wireguard. Each sites have ISP owned modem/router and behind that my Mikrotik router, which runs the Wireguard service, so in other words each Mikrotik device that runs WG are behind NAT. I had the same issue like you guys for more than a year and always had to manually restart WG each time while slowly narrowed the solution down to port change as described in the above posts. Very happy that i found this and wanted to say thank you to @anav and @ivicask for navigating me to the right direction. Finally i have solved the problem and would like to pay back the community for the great ideas with my solution, just in case it would be helpful for future WG troubleshooters… Here is what i did.
Set up email service on my Mikrotik router. You can find guides how to do that.
Created new netwatch script for each WG interface i want to monitor. Only those WG connections where this Mikrotik router is the initiator of the connection, in other words this is only needed on WG client side:
named it “netwatch-xxxxx” based on where it connects to (xxxxx is identifying the other site where i am connecting to)
host is set for the IP at the other end of the WG tunnel
interval 10s
up script: /tool e-mail send to=myemailaddress subject="WireGuard to XXXXX is up" body="WireGuard to XXXXX is up at $[/system clock get time]."
down script: /tool e-mail send to=myemailaddress subject="WireGuard to XXXXX is down" body="WireGuard to XXXXX is down at $[/system clock get time]."
Create system script to reset listen-port, this handles 2 WG interfaces but you can add more if needed. The name of the script is "wg-reset-if-netwatch-down". Replace xxxxx, yyyyy, XXXXX and YYYYY with your choice of id of your connection to sites, and myemailaddress with your email address.
:local statusxxxxx [/tool netwatch get [find name="netwatch-xxxxx"] status]
:local statusyyyyy [/tool netwatch get [find name="netwatch-yyyyy"] status]
:if ($statusxxxxx = "down") do={
/interface wireguard set [find name="wireguard-to-xxxxx"] listen-port=0
:set counterxxxxx ($counterxxxxx +1)
/tool e-mail send to=myemailaddress subject="WireGuard to XXXXX reset" body="WireGuard to XXXXX listen-port has been reset at $[/system clock get time]."
}
:if ($statusyyyyy = "down") do={
/interface wireguard set [find name="wireguard-to-yyyyy"] listen-port=0
:set counteryyyyy ($counteryyyyy +1)
/tool e-mail send to=myemailaddress subject="WireGuard to YYYYY reset" body="WireGuard to YYYYY listen-port has been reset at $[/system clock get time]."
}
Create scheduler, to run every 30 seconds
name: "check-netwatch-and-fix-wg"
start: now
interval: 30s
on event: wg-reset-if-netwatch-down
Test the scripts by disabling peers to xxxxx and yyyyy and watch if counters start to increase for the global counter variables. You can also check if port numbers are changing on the WG interfaces. If yes, you are done.
At the end it is checking every 30 seconds, if the outgoing WG connection is working. If not, it is randomizing the listen-port of the affected WG interface, so WG can reconnect even if it was stucked before. The global variables are storing how many times the connection was reset. I am receiving emails about every up, down and reset. Email is not necessary btw, but i like being notified when sg abnormal happens at my network. Hope i can help others with this, sorry for the long post.
I dont think randomizer will ever give 53 to port. Im pretty sure Mikrotik developers were careful enough and this function wont add any known or already used port to “listen-port”. Just consider we are using a functionality which is used by default anyway when you are creating a new WG interface and leave the listen-port field empty.
That is a large assumption bolka4u, and thus I prefer the use of lets say 10 different offsets from a starting wireguard listening port that recycle. The effect is the same and no assumptions need be made.
I have confirmed with MT, that indeed the random port selection=0 USES ports not already in use and they are restricted to the higher port numbers like above 40K or so.