Trouble with wireguard and asymmetric routing

To understand the behavior, it is necessary to know a couple of features of RouterOS firewall and Wireguard:

  1. the Wireguard stack itself never changes the local port, so the reason why the “responses” from 172.21.16.80 can be seen on the wire with a source port other than 51822 must be someaction=masquerade (oraction=src-nat) rule.
  2. unless expicitly configured otherwise, the src-nat and masqueraderules in RouterOS firewall only change the source port if keeping it unchanged would cause the resulting socket pair to be equal to one representing an already existing connection
  3. unlike all the other services that use UDP transport (DNS, NTP), the Wireguard stack does not respond from the same socket (address:port) at which it has received the corresponding request; instead, it handles every packet it sends as a standalone one not related to any existing “connection”. In particular this means that it first looks for a gateway for that packet, and only depending on theout-interfacechosen by routing it assigns a source address to it. To make things even more complicated, it even used to ignore thepref-srcvalue if any was assigned to a route - I’m not sure whether it is still the case several months later.

Taking all the above into account, the following is what I suspect to happen:

  • the initial handshake packet from 192.168.1.16:49932 arrives to 172.21.16.80:51822 via whichever of the two interfaces, creating a tracked connection without any local NAT treatment between these two sockets in the conntrack module of the firewall
  • the Wireguard stack sends a response packet to 192.168.1.16:19932, but since the ECMP chooses a route via another out-interface than the one to which 172.21.16.80 is attached, the source address of the response is taken from that interface and the source port remains 51822
  • from the point of view of the conntrack module, this response packet does not match any existing connection, hence it is treated as an initial packet of a new connection, thesrcnatchain is checked, and asrc-nat(masquerade) treatment is applied to the new connection; however, since the reply-src-address would become 172.21.16.80:51822 and thus a duplicate row in the connection matching table would be created, the port gets changed to 38151.

Such a scenario expects, however, that you do not useaction=masqueraderules in thesrcnatchain but ratheraction=src-natones specifying a particular address usingto-addresses, as an action=masqueraderule would choose the same address like the routing; but in such case, already the original packet would match the existing connection created by the incoming request, so it would not be treated as an initial one and thus it would not be sent through thesrcnatchain at all. So either your configuration differs from the one suggested by the manual or something behaves different than I expect.

So rather than referring to a manual page and stating that you have configured your machines according to it, post complete (anonymized!) exports of the actual configurations of routers A and B for analysis.

In general, the best way to deal with the issue in point 3. above in scenarios where there is external NAT on the path between the Wireguard peers is the one described by @lurker888: RouterOS blatantly ignores pref-src. Can this really be a bug? - #72 by lurker888

If there is no external NAT in your setup as the diagram suggests, you may try to exclude thesrcnathandling from your existing configuration, meaning that the response from router B will leave with the other address as source, so the Wireguard stack on router A should adapt to it and continue talking to that address rather than the 172.21.16.80 used initially, because Wireguard is designed to seamlessly accommodate to changes of peer addresses (which is also the reason why it treats each packet individually rather than as a part of a pre-existing UDP “session”). But that’s just an experienced guess, I haven’t set up any testbed to check that.