Recently, I've been looking at what speeds can be obtained between two RouterOS instances over a Wireguard connection. The genesis of this was because I had set up a Cloud Hosted Router (CHR) instance to use as my own VPN end point. My VDS is on dedicated hardware and sitting on a 10 Gbit network link. My local router (5009) is on a 2.5 Gbps internet connection.
I was only getting speeds of maybe 300 to 400 Mbps at best between my 5009 and the CHR. I assumed it was an issue with my provider's network connection and I was really not sitting on a 10 Gbit link.
I had a lot of back and forth with a couple others here on the forums about this on another thread.
I got the idea on my head to set up my own test locally - set up my own CHR instance on my LAN where I could test LAN speed between the two instances as well as WIreguard speed. The results are pretty shocking.
Hardware:
5009 with a 10 Gbit connection to my LAN on the SFP port
Router OS running virtualized on a QNAP TS-873A with 10 Gbit Fiber Connection
So 10 Gbps between the 5009 and the virtual router instance on the QNAP. Last tests shown below utilized 6 CPU cores and 4 Gig memory on the virtual router. The number of cores really didn't make much difference beyond one core.
Results:
Testing over a "LAN" connection (no Wireguard) showed speeds of roughly 3 Gbps:
The 5009 IP address is 192.168.1.1. The "WAN" interface of the virtual router was 192.168.1.242. I also had another ethernet interface added on the LAN side, but speeds were unchanged. So there's not a speed drop going through the router interface.
A significant difference! 192.168.2.1 is the WG endpoint on the virtual router. No matter what I tried, I can only get 300 to 400 Mbps at best out of a WG connection between two Mikrotik router instances. This really dovetails with what I am seeing on my remote CHR. Very similar performance.
I don't think there's anything in the firewall or router settings that would alter this.
There is a bug in Winbox though. In the images shown, the TX and RX colors are reversed.
Wondering if anyone else has seen similar results. Is there a different VPN protocol that is faster between two Mikrotik instances?
Check profile tool on RB5009 while performing your test.
I suspect 1 core will be at 100% and then you're done.
Wireguard is single core, as far as I know.
But it remains the faster option in most cases ( unless you can use HW offloaded IPSEC on a beast device at both ends).
Here you go. This is the current setup. Please everyone, no comments, questions on the VLANs and why certain things are set up on the 5009 the way they are. It's for a specific purpose....
The CHR config is the one running right now on the VDS but the local one on my QNAP is basically the same thing.
I should state that when using my ProtonVPN connections, I get over a gigabit speed on those WG connections. So a 5009 is capable of much higher performance, it's just the link between two MIkrotik routers seems to be the issue.
Here's my take. Wireguard in multi-threaded from the start in the in-kernel implementation and it is so on Mikrotik as well from the initial release.
There are two possible problems. The first is that although bandwidth test was made multi-threaded some time ago, it still runs in user space. For a normal (forwarded) connection, the packets never leave the kernel. This should be expected to have a huge hit on performance.
The other is the mtu/mss. The value of 1420 and 1380 are for the case where the connection is done over (outer or underlay) is ipv4 with full 1500 mtu. If vlan tags have to be added, that pushes the frame size up many vps environments don't play nice with this.
So... try lower mss values. Also, your change-mss rules are only in the forward chain which means they don't affect input/output traffic, e.g. your bandwidth test, at all.
These together will likely explain the difference with the protonvpn result. Your measurements there are likely "through" the router (in the forward path) and I would be very surprised if protonvpn didn't clamp mss aggressively on their end.
I can also confirm that the rb5009 reliably does somewhere above 1Gbps. Not that much, and the exact number will vary with even small changes in configuration.
The mangle rules were added at the suggestion of @CGGXANNX as I had been trying out SurfShark and SurfShark definitely needs those. They make no difference in the speeds between the 5009 and the CHR if they are enabled or not.
I will try reducing the mss values in the mangle rules and report back.
Well, yeah, I see my 5009 doing over a Gb when connected on Proton. But two Mikrotik routers seem to have the issue.
Changing the MTU of the WG connection did nothing. I dropped it to 1000 and things got slower. I raised it to 1500 and speed was back where it was around 1380.
With an MTU around 1350, I see peaks at 600 but nothing sustained.
I see nothing that will double the speed of the connection.
I'm like you, very bad performance with WG site2site.
Trying lot and lot of things.... but having bad performance.
But it's when i'm routing through the tunnel with iperf on server routed on each sides.
It's comic but i've good performance when using integrated BW tester on each rb5009.
Would be nice to have a second RB5009 to try as opposed to a virtualized router. But still, it should not make that much of a difference.
What I don't understand is that I get over a gigabit on Wireguard over my ProtonVPN connection. So the 5009 is certainly capable of unwrapping the encrypted packets quickly enough. So perhaps the speed issue is due to the fact that it can't wrap the unencrypted packets fast enough?
But then why isn't the CPU usage higher? If there is spare CPU cycles, why is the router not using those to do the encryption...
Hi, if you're in France, i've a spare rb5009 and be able to lend it.
I've 2 sites, with same fiber connection, 5G in DL and 1G in UL.
WG is configured and running since fews years, it's not new for me ; i'm not setuping this now...
This is my perf with integrated bw tester running at the same time, so about 350Mbit/s "RAW" :
And when using servers and routing through the WG it's asymetric. I've already try to iperf3 directly over internet with PAT and it rocks well. In most case, rb5009 cpu is less 50/60%.
Vlan tags don't go over wireguard, ever. But whether the mtu is constricted inside or outside the tunnel ultimately doesn't matter. Virtual machines (virtual network drivers, vswitches, etc.) are notorious for this. There usually is some solution, you just have to find it. Worse: lot's of cloud hosts use some sort of overlay networking for private networks which makes it worse.
Your change-mss rules DO NOT affect the built-in bandwidth test. For it to work for it, you have to place the same rule in "output" as well. Changing the values on rules that have no effect... has no effect.
When you decrease either mtu or the clamp mss value, you have to do it on both sides.
Again, the built-in bandwidth test is not the best. Part of the reason is that it loads the device unnecessarily. While it probably doesn't halve the performance, it does have a dramatic effect. I also suspect that it has other secrets regarding tcp tuning, so if you use it at all, use udp and tune the target bandwidth manually. (This way, btw. you can also adjust packet sizes easily instead of relying on mtu/mss.)
Just an additional idea: recently there was quite a bit of development on veth and container stuff, some of which led to performance problems on bridges to which the veth interface was attached. (As far as I know, this affected traffic that didn't touch the veth interface at all and was only through the same bridge.) Just to rule out things, I'd do a test while the veth interfaces are detached.
Agreed a VM is not the best solution for testing but it pretty much confirms the same thing I am seeing when trying to test data between my 5009 and the remote CHR.
Yep. And I did that. Really made no difference. Change was made to both sides.
Actually the built in bandwidth test performs better than things like an Ookla speed test. I know this from my efforts with the CHR.
I can do that but I doubt it will make a difference. The veth interfaces were introduced because I wanted to see what sort of performance I was getting from the CHR outward and take the Wireguard link out of the equation. So I installed a container to utilize the Ookla cli speedtest. I set it up on my 5009 first and then on the CHR. I have no container set up on the local virtual router.
Here's results from Speedtest.
First testing from my 5009 out through the CHR, I got:
Speedtest by Ookla
Server: Kansas Research and Education Network - Wichita, KS (id: 20531)
ISP: Massivegrid
Idle Latency: 70.27 ms (jitter: 6.35ms, low: 62.84ms, high: 72.92ms)
Download: 107.04 Mbps (data used: 140.3 MB)
115.66 ms (jitter: 36.93ms, low: 69.62ms, high: 474.37ms)
Then from the CHR itself I got:
Speedtest by Ookla
Server: Kansas Research and Education Network - Wichita, KS (id: 20531)
ISP: Massivegrid
Idle Latency: 36.16 ms (jitter: 0.15ms, low: 36.07ms, high: 36.47ms)
Download: 3098.95 Mbps (data used: 3.8 GB)
So big difference.
I don't understand why there is such a huge difference between testing using the built in bandwidth test and what I am seeing when using Ookla. But it's a major difference.
I have not tried these tests over the virtualized instance here.
My theory is that RouterOS has an issue with encoding the packets for uplink. Why do I say this? Well, when connected to ProtonVPN and doing Ookla speed tests, I can get speeds over a gigabit. However, my uplink speed is limited to roughly 350 Mbps. This is inline with the maximum speed that I see over a WB link.
When I am connecting remotely to my home from one of my travel routers or my cell phone, I am either limited by the speed of the connection or by my upload speed since the maximum speed I can send from home is 350 Mbps. If I had a faster uplink speed, I would be able to confirm this but I don't. So the weakness of RouterOS is buried due to my uplink.
Downlink does not seem to really be a problem. But I can't tell if the speed I am seeing on downlink is limited by the VPN server or by RouterOS or both. Probably some of both given what I see here.
I am happy to talk about tweaks and adjustments. @lurker888 you have helped me a ton with my setup over the last year and I really appreciate that. Big shout out to you. I would be more than willing to sit and learn more for a Jedi master so I'm not trying to minimize your suggestions or what you are saying. I just have not found any tweaks so far that result in any sort of noticeable improvement. If I could get from say 300 Mbps to say 800 Mbps - OK that's still not great, but it's a step in the right direction. So far nothing.
I don't understand though why things are being limited as much as they are as the CPU loading of both sides of the connection is not near maximum. I know there's a difference between CPU usage percentage and CPU load (ie: the number of threads in queue) so maybe it's that there's too many threads to process and things are stuck there. But you would then expect that adding CPUs would help. It did not either on the remote CHR (same performance if I am using one dedicated Xeon core or 4) or locally on the VM CHR.