Inconsistent WG connection via 2 ISPs

ok, I’m stumped on this.
I have a situation where I’ve had WG working on a specific 5009 for months. The site added a 2nd ISP. Wanted to be able to use WG on both connections.
I enabled ECMP on both default routes, and added the following mangles to keep ISP1 and ISP2 connections to their own lane.

ip firewall mangle
add action=mark-connection chain=input comment="Keep eth1 inbound connections to eth1" connection-mark=no-mark in-interface=ether1 new-connection-mark=ISP1-IN passthrough=yes
add action=mark-routing chain=output connection-mark=ISP1-IN new-routing-mark=ISP1 passthrough=yes
add action=mark-connection chain=input comment="Keep eth2 inbound connections to eth2" connection-mark=no-mark in-interface=ether2 new-connection-mark=ISP2-IN passthrough=yes
add action=mark-routing chain=output connection-mark=ISP2-IN new-routing-mark=ISP2 passthrough=yes

If I try to connect to this WG instance via ISP1, via my wired ISP, handshake fails. If I try to connect to this WG instance via ISP1, via my LTE data, handshake works everytime.

If I try to connect to this WG instance via ISP2, via my wired ISP, handshake works. If I try to connect to this WG instance via ISP2, via my LTE data, handshake works everytime.

To make sure I did not have a WG config error or any kind of other user overlap. I created a brand new WG instance, seperate subnet, allowed it out, allowed the WG ports in… and the exact same symptoms.

I’m struggling to find why this would block me on WG on my hard wired ISP vs say my mobile data.
Any suggestions on what to check?

Since the mangle alone do not do all the works, all the other part of the config must be analyzed, like missing interface on WAN list, error on routing table, etc. etc.

If you need to have your car checked out, do you just bring only the tank cap to the mechanic?

Read…http://forum.mikrotik.com/t/vpn-port-forward-through-1-interface/176216/2

I can confirm that this is a bug reproducible on multiple installations using ECMP. And the workaround is the dstnat rule as posted by @anav.

Unlike a car, I can’t bring you the physical device. There’s a lot of superfluous config on this device that is outside the purpose of WG and I’d be gladly posts the parts that are requested. To sanitize the entire config would take hours.

Both ISP interface are on the WANlist. I’m not flawless and I’m more than willing to admit its possible I made a config mistake. If you have specific sections you’d like to see, I’ll sanitize those and post them.

Thank you kind sir! Is this an issue with Wireguard itself, or just the way RouterOS implements it?

I have a question about this.

/ip firewall nat
add action=dstnat chain=dst-nat in-interface=ether2 dst-address-type=local dst-port=13231 protocol=udp to-address=172.16.0.1

172.16.0.1 looks to be the default gateway for ISP1.

Why would DST natting a connection coming in ISP2 (eth2) to the gateway on ISP1 (eth1)?

The gist of is, that the response to a query is already bleeding out of WAN1 instead of WAN2, for the initial handshake.
Therefore we tell the router that all traffic coming from ether2 (wan2) is destinatted to WAN1.

Thus when WAN1 incorrectly replies to traffic from that port ithe router sends the traffic UN Destinationatted back via WAn2. (ether2).

This breaks my brain. WoW. How did anyone figure this out? Lots of packet captures?

In my case, where I have 2 WG instances. One designated for WAN1 and one WAN2 (different ports). Should I do the criss cross of its?

Example:

/ip firewall nat
add action=dstnat chain=dst-nat in-interface=ether2 dst-address-type=local dst-port=13232 protocol=udp to-address=WAN1Gateway

/ip firewall nat
add action=dstnat chain=dst-nat in-interface=ether1 dst-address-type=local dst-port=13231 protocol=udp to-address=WAN2Gateway

Now, here’s a thing what we’ve noticed in our testing. Reboots cause cause the ‘active’ WG interface to change. So for my issue, yesterday WAN1 would not work hardly at all for WG. However after a reboot last night, WG works fine on WAN1, but doesn’t work at all for WAN2 today.

No this only affects the WAN that is second in natural priority ( the failover wan so to speak).
You still need some mangling going on, and so I would have to see the config to comment further

What parts of the config do you need and I’ll gather those?
My thoughts are…

interface wireguard (i'll need to hide the private keys)
ip route
ip addresses (i'll change the public IP)
ip firewall filter
ip firewall nat
ip firewall mangle

Anything else?

/export file=anynameyouwish ( minus router serial number, any public WANIP info, keys etc.)

Understood. I’ll start working on that.

Here you go. I tried to organize it a bit with some spacing.

https://pastebin.thenetwork.pro/?a116c907f7648138#3jTuSLJQkzsgpyBWWrdu1mZnAgkwJKDbRe6rcnA5FqDk


I have tried this

/ip firewall mangle
add action=mark-connection chain=input comment="Keep eth1 inbound connections to eth1" connection-mark=no-mark in-interface=ether1 new-connection-mark=ISP1-IN passthrough=yes
add action=mark-routing chain=prerouting connection-mark=ISP1-IN new-routing-mark=ISP1 passthrough=no
add action=mark-connection chain=input comment="Keep eth2 inbound connections to eth2" connection-mark=no-mark in-interface=ether2 new-connection-mark=ISP2-IN passthrough=yes
add action=mark-routing chain=prerouting connection-mark=ISP2-IN new-routing-mark=ISP2 passthrough=no

instead of

/ip firewall mangle
add action=mark-connection chain=input comment="Keep eth1 inbound connections to eth1" connection-mark=no-mark in-interface=ether1 new-connection-mark=ISP1-IN passthrough=yes
add action=mark-routing chain=output connection-mark=ISP1-IN new-routing-mark=ISP1 passthrough=no
add action=mark-connection chain=input comment="Keep eth2 inbound connections to eth2" connection-mark=no-mark in-interface=ether2 new-connection-mark=ISP2-IN passthrough=yes
add action=mark-routing chain=output connection-mark=ISP2-IN new-routing-mark=ISP2 passthrough=no

but everytime I do, all external connections drop to the router. Can’t ping, can’t ssh, can’t vpn, winbox etc.
Not sure why

Any thoughts?

Tested the fix… didn’t work for me.

/ip firewall nat
add action=dstnat chain=dst-nat in-interface=ether2 dst-address-type=local dst-port=AllWGPorts protocol=udp to-address=ISPGateway1
add action=dstnat chain=dst-nat in-interface=ether1 dst-address-type=local dst-port=AllWGPorts protocol=udp to-address=ISPGateway2

Any chance you’d be willing to share the rules that you used? I can’t seem to make it work with my understanding.

Why did you completely disregard the advice provided??

No this only affects the WAN that is second in natural priority ( the failover wan so to speak).
You still need some mangling going on, and so I would have to see the config to comment further

To be fair, I didnt look at your whole config but will look tonight.

(1) Why the four or five wireguard interfaces. I like simple and clean. Unless there is a reason to have four or five you only need one interface!

You can actually define and use multiple IP subnets to a single wireguard interface.
The only reason you would need multiple INTERFACES is if there was any router traffic that needed to go out the internet on the remote end.
Then you would have to use 0.0.0.0/0 on the allowed peer settings, and thus not possible to have more than one peer effectively.

(2) What a mess, you have bridge but dont use bridge vlan filtering… as I stated simplify.
Nothing to be gained by the over complex structure.

(3) For example why do you have two management subnets…makes no sense
/ip pool
add name=dhcp_pool_mavico ranges=10.1.10.50-10.1.10.254
]add name=dhcp_pool_administradores ranges=10.1.60.2-10.1.60.254
add name=dhcp_pool_management ranges=10.1.50.10-10.1.50.253
add name=dhcp_pool_alquiler1 ranges=10.1.1.2-10.1.1.254
add name=dhcp_pool_VPN ranges=192.168.89.10-192.168.89.254
add name=dhcp_pool_servidores ranges=10.0.0.2-10.0.0.254
add name=dhcp_pool_alquiler2 ranges=10.1.2.2-10.1.2.254
add name=dhcp_pool_alquiler3 ranges=10.1.3.2-10.1.3.254
add name=dhcp_pool_invitados ranges=10.1.100.2-10.1.100.254
add name=dhcp_pool_wifi_mavico ranges=10.1.80.2-10.1.80.254
add name=dhcp_pool_ILO ranges=10.0.1.2-10.0.1.6
add name=dhcp_pool_can&t ranges=10.1.15.2-10.1.15.18
add name=dhcp_pool_management_network ranges=10.0.2.2-10.0.2.254
add name=dhcp_pool_MINER ranges=10.0.250.2-10.0.250.4
add name=dhcp_pool_productora ranges=10.1.150.25-10.1.150.254

(4) Rp filter strict is a no no especially in multiwan scenario, should be set to LOOSE

++++++++++++++++++++++++++++++++++++++++++++

This is a client router and I’m coming in after its been deployed and their last network guy left.
I agree the vlans need to be cleaned up, its a mess along with a bunch of other things.
I’d like the is thread to focus on the WG issues with multi wan.

After our last discussion, I did find and change the RP filter to loose.

The client has had multiple WG interfaces since before I took them on.
wireguard1 - has both roadwarrior and a site to site tunnels. Meant to come in on WAN1
wirenetlife - has both roadwarrior and a site to site tunnels. Meant to come in on WAN2
wiregrelive - has only roadwarrior tunnels, meant to come in on WAN1.
wireguard2-test - was set up as a brand new test for me to see if the problem stemmed from prior setup. This will be deleted

While I’m confident I can merge wiregrelive and wireguard1 together, that doesn’t solve the core issues I’ve reported in this thread.
There are bandwidth and latency reasons why certain connections need to come in WAN1 and not WAN2, and vice versa. Once I have this issues sorted, WG ports will only be allowed in via their respective WAN interfaces.

You asked for the config, I’ve provided it.
I listed the mangles above, and in the config. I’ll list them again in case you missed them.

/ip firewall mangle
add action=mark-connection chain=input comment="Keep eth1 inbound connections to eth1" connection-mark=no-mark in-interface=ether1 new-connection-mark=ISP1-IN passthrough=yes
add action=mark-routing chain=output comment="Keep eth1 connections to eth1 - mark route" connection-mark=ISP1-IN new-routing-mark=ISP1 passthrough=no
add action=mark-connection chain=input comment="Keep eth2 inbound connections to eth2" connection-mark=no-mark in-interface=ether2 new-connection-mark=ISP2-IN passthrough=yes
add action=mark-routing chain=output comment="Keep eth2 connections to eth2 - mark route" connection-mark=ISP2-IN new-routing-mark=ISP2 passthrough=no

I’ve added these DSTnat rules per your recommendation.

/ip firewall nat
add action=dst-nat chain=dstnat comment="WG Fix for Eth2" dst-address-type=local dst-port=23232-23239 in-interface=ether2 protocol=udp to-addresses=WAN1IP
add action=dst-nat chain=dstnat comment="WG Fix for Eth1" dst-address-type=local dst-port=23232-23239 in-interface=ether1 protocol=udp to-addresses=WAN2IP

During my tests, from my home ISP, I can handshake with all WG instances via WAN1 and WAN2, however, from my home ISP, I cannot route any actual traffic. (like checking ipinfo.io)
However, if I test using my cell phone with the exact same WG peer profiles, WAN2 instances work, but, and only wiregrelive works via ISP1.

It’s completely inconsistent.

My only guess is that there’s some kind of hashing algo related to ECMP that’s happening with each WG instance in RouterOS, and once that has has been made, it stay resident in memory until the unit is rebooted. If we reboot the device, WG could work fine via WAN1, or only work via WAN2.