Wireguard handshake doesn't constitute UDP stream

I’m having trouble with a Wireguard client behind a RouterOS NAT.

Background:
I have a Wireguard endpoint at a static IP (WG-SERVER), and I’d like to connect to it from a client (WG-CLIENT) behind a CRS328-24P-4S+ which is configured to masquerade behind a dynamic IP address. The WG-CLIENT configuration has “PersistentKeepalive = 150”, which appears to function correctly.

Problem:
When WG-CLIENT comes up and performs a Wireguard handshake because of the PersistentKeepalive setting, I’m sniffing three packets along the path: two bound for WG-SERVER, and one from WG-SERVER back to WG-CLIENT. The handshake works successfully, and I’m able to transfer data between those hosts. However, if no additional data is transferred after the handshake for 10 seconds, the RouterOS connection tracking never bumps the timeout for the NAT entry from the “udp-timeout” value of 10 seconds to the “udp-stream-timeout” value of 180 seconds, and the NAT entry dies before the next keepalive handshake.

If I ping between the hosts within the 10 seconds after the handshake, the NAT entry timeout is reset to 180 seconds as it should. Future handshakes (every 150 seconds) then reset the entry’s timeout to 180 seconds and things work well until the session terminates.

I expected that by sending packets in both directions, the Wireguard handshake should trigger the “udp-stream-timeout” reset, but it does not and I can’t figure out why. I tried adding “PersistentKeepalive” to the WG-SERVER configuration for that peer, and while an additional packet from WG-SERVER to WG-CLIENT is sent during the initial handshake, the NAT entry still times out after 10 seconds.

Relevant config:

# 2023-06-16 15:18:56 by RouterOS 7.10
# software id = DXQ4-4HPK
#
# model = CRS328-24P-4S+
# serial number = QQQQQQQQQQQQ
/interface bridge port
add bridge=bridge ingress-filtering=no interface=sfp-sfpplus1
/ip address
add address=172.16.8.1/16 interface=bridge network=172.16.0.0
/ip dhcp-client
add interface=ether1 use-peer-dns=no use-peer-ntp=no
/ip firewall filter
add action=accept chain=input connection-state=established,related
add action=drop chain=input connection-state=invalid
add action=accept chain=input in-interface=ether1 protocol=icmp
add action=drop chain=input in-interface=ether1
/ip firewall nat
add action=masquerade chain=srcnat out-interface=ether1 src-address=172.16.0.0/16

Connection tracking settings (all defaults):

                   enabled: auto
               active-ipv4: yes
               active-ipv6: yes
      tcp-syn-sent-timeout: 5s
  tcp-syn-received-timeout: 5s
   tcp-established-timeout: 1d
      tcp-fin-wait-timeout: 10s
    tcp-close-wait-timeout: 10s
      tcp-last-ack-timeout: 10s
     tcp-time-wait-timeout: 10s
         tcp-close-timeout: 10s
   tcp-max-retrans-timeout: 5m
       tcp-unacked-timeout: 5m
        loose-tcp-tracking: yes
               udp-timeout: 10s
        udp-stream-timeout: 3m
              icmp-timeout: 10s
           generic-timeout: 10m
               max-entries: 479232
             total-entries: 157

Handshake sniff:

Columns: INTERFACE, TIME, NUM, DIR, SRC-MAC, DST-MAC, SRC-ADDRESS, DST-ADDRESS, PROTOCOL, SIZE, CPU
INTERFACE    TIME  NUM  DIR SRC-MAC            DST-MAC           SRC-ADDRESS          DST-ADDRESS          PROTOCOL SIZE  CPU
sfp-sfpplus1 3.098   1  <-  XX:XX:XX:XX:XX:XX  YY:YY:YY:YY:YY:YY 172.16.8.2:39663     WG-SERVER-IP:51820   ip:udp    190    0
bridge       3.098   2  <-  XX:XX:XX:XX:XX:XX  YY:YY:YY:YY:YY:YY 172.16.8.2:39663     WG-SERVER-IP:51820   ip:udp    190    0
ether1       3.098   3  ->  YY:YY:YY:YY:YY:YY  ZZ:ZZ:ZZ:ZZ:ZZ:ZZ EXT-DYNAMIC-IP:39663 WG-SERVER-IP:51820   ip:udp    190    0
ether1       3.16    4  <-  ZZ:ZZ:ZZ:ZZ:ZZ:ZZ  YY:YY:YY:YY:YY:YY WG-SERVER-IP:51820   EXT-DYNAMIC-IP:39663 ip:udp    134    0
bridge       3.16    5  ->  YY:YY:YY:YY:YY:YY  XX:XX:XX:XX:XX:XX WG-SERVER-IP:51820   172.16.8.2:39663     ip:udp    134    0
sfp-sfpplus1 3.16    6  ->  YY:YY:YY:YY:YY:YY  XX:XX:XX:XX:XX:XX WG-SERVER-IP:51820   172.16.8.2:39663     ip:udp    134    0
sfp-sfpplus1 3.16    7  <-  XX:XX:XX:XX:XX:XX  YY:YY:YY:YY:YY:YY 172.16.8.2:39663     WG-SERVER-IP:51820   ip:udp     74    0
bridge       3.16    8  <-  XX:XX:XX:XX:XX:XX  YY:YY:YY:YY:YY:YY 172.16.8.2:39663     WG-SERVER-IP:51820   ip:udp     74    0
ether1       3.16    9  ->  YY:YY:YY:YY:YY:YY  ZZ:ZZ:ZZ:ZZ:ZZ:ZZ EXT-DYNAMIC-IP:39663 WG-SERVER-IP:51820   ip:udp     74    0

NAT entry:

6  SAC   s  protocol=udp src-address=172.16.8.2:39663 dst-address=WG-SERVER-IP:51820 reply-src-address=WG-SERVER-IP:51820
            reply-dst-address=EXT-DYNAMIC-IP:50167 timeout=8s orig-packets=2 orig-bytes=236 orig-fasttrack-packets=0
            orig-fasttrack-bytes=0 repl-packets=1 repl-bytes=120 repl-fasttrack-packets=0 repl-fasttrack-bytes=0
            orig-rate=0bps repl-rate=0bps

Hi,
I think the issue you’re experiencing is related to the way RouterOS handles UDP connection tracking and NAT timeouts.

By default, RouterOS uses a timeout of 10 seconds for UDP connections, which is quite low. This means that if no traffic is sent through a UDP connection for 10 seconds, the NAT entry will expire, and subsequent traffic will be dropped.
In your case, since Wireguard only sends keepalive packets every 150 seconds, the NAT entry is expiring before the next keepalive packet is sent. This is why pinging between the hosts within the 10 seconds after the handshake resets the NAT entry timeout to 180 seconds.

To solve this issue, you can try increasing the UDP timeout value to a higher value than 10 seconds, which will allow the NAT entry to remain active for a longer period. You can do this by setting the “udp-timeout” parameter to a higher value, for example, 60 seconds:

/ip firewall connection tracking set udp-timeout=60s

Alternatively, you can try disabling UDP connection tracking altogether, which will prevent NAT entries from expiring due to inactivity. You can do this by setting the “loose-tcp-tracking” parameter to “no”:

/ip firewall connection tracking set loose-tcp-tracking=no

Note that disabling UDP connection tracking may have security implications, so you should carefully consider the risks before making this change.
Finally, you can also try configuring a NAT keepalive rule to send periodic packets through the NAT entry, which will keep it active even if no other traffic is being sent. To do this, you can add a rule to the “nat” chain that matches the relevant traffic and sends a keepalive packet:

angelscript
/ip firewall nat add action=src-nat chain=nat out-interface=ether1 src-address=172.16.8.2/32 dst-address=WG-SERVER-IP/32 \
comment="NAT keepalive" disabled=no protocol=udp to-port=51820

This rule will send a UDP packet to the Wireguard server every 30 seconds (assuming the default “udp-timeout” value of 10 seconds) to keep the NAT entry active. You can adjust the frequency of the keepalive packets by changing the “udp-timeout” value or the interval between keepalive packets.

I faced a similar problem where IPsec IKEv2 exchange does not constitute UPD stream. ROS 7.11.2
Opened a support ticket SUP-128720

One of the first things I do is raise the NAT timeouts, even the Linux default of 30 seconds for UDP is way too low.

I found this Linux kernel change in 2018 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/net/netfilter/nf_conntrack_proto_udp.c?id=d535c8a69c1924e70186d80be0a9cecaf475f166 which adds 2s grace period to the UPD stream bump.

If I understand the change correctly, it means if the protocol finishes the handshake below 2s then goes silent, the conntrack timeout won’t be bumped to the UDP stream, instead it will remain be the shorter one.

I guess ROS7 upgraded to recent kernel this inherited this change.

Possible solutions:

  1. Set longer UDP timeout. Doesn’t work if the middle boxes also have short timeout.

  2. Set more aggressive keep alive or DPD timeout if the protocol supports it. 10s seems to be a good number for both IKEv2 and wireguard. Con: not good for battery life.