RouterOS Wireguard Performance (& Other Tunneling)

So to take things out of the 7.19rc thread and keep the discussion going, here are some facts that I can back up with evidence.

Hardware: CCR2216

  1. Wireguard on version 7.18.2 results in stable bi-directional bandwidth of roughly 1.5Gb/s for a single threaded iperf3 test from a client on the local LAN through the router WG tunnel to another client on a different VLAN.
  2. Same setup in point 1, on version 7.19rc, results in asymmetric bandwidth. Flows from the WG client to the client on the local lan work normally, roughly 1.5Gb/s, but in the reverse direction the bandwidth is wildly variable between 400Mb/s to 800Mb/s, never hitting 1.5Gb/s as it was before.

Other Notes:

  1. Max performance I can extract is roughly 2Gb/s with multiple threads (roughly 3-4), going higher than this doesn’t result in more performance on a single WG interface.
  2. Unique WG interfaces run on separate threads (this is the same on any arch) but allows you to do ECMP for more bandwidth. The highest I was able to get was a stable 3.1GB/s across 3 WG tunnels, more than that and there isn’t enough CPU.
  3. If you are running 10-15Mb/s traffic you will likely not ever see the slow downs or issues, although currently I am seeing another symptom of a problem where the Rx Error counter for the WG interface just keeps incrementing even at very low bandwidth loads (800Kbps) as well as at high throughput.

Feel free to post your own results on different setups. I’m going to try a container with wireguard in it to see if this is a “RouterOS implementation” issue.

Had issues with WG tunnel between two RB5009s documented here:

http://forum.mikrotik.com/t/disappointing-wireguard-performance/181394/1

TLDR; CPU frequency scaling was playing havoc with performance until I set both to run at 1.4Ghz

With regards to hardware offloading - if you are going to have multiple tunnels then it’s probably best to go with IPSEC over Wireguard because just one WG tunnel will hammer the CPU even under optimal conditions.

You wrote “& Other Tunneling” in the subject, but I see only a discussion about Wireguard.
Did you test other tunneling? Like plain IPsec tunnels, plain GRE tunnels, GRE/IPsec tunnels, etc?

I am currently focused on wireguard, but this discussion spawned from the 7.19rc thread where others using other tunnelling methods commented. I’m hoping they will bring their expertise, use-cases and performance results with those tunnel implementations into this thread.

Next steps for me is to push an alpine based container with a simple wireguard config onto my CCR2216 and check what the performance is there, to rule out if this is an issue with RouterOS implementation of wireguard which AFAIK should just be the linux kernel implementation.

I just wish MikroTik would expand its “Test Results” section to include VPNs beyond IPSec. Or publish some doc/white-paper on performance.

Part of the problem here is that each hardware platform may yield different result to which is “best”. And whether you’re use case is closer to “few-vpn-users + bandwidth-heavy” vs “many-vpn-users + light-traffic” case. Now obviously more practical concern come in, i.e. do you need L2 and/or multicast and/or existing topologies … that all MAY drive a VPN decision. And some data on VPN[/other things]'s performance help when buying new hardware. But some baseline for L3/IP unicast VPNs be good starting place if one did have a choice of VPNs.

The test results really could use some improvement. I mean, it is “a standard test” but it mentions nothing about the test environment.
(e.g. RouterOS version is not specified at all)

TL;DR - WireGuard encryption performance, general info about hardware acceleration and WireGuard throughput testing..

The WireGuard encryption

The ChaCha20 cipher used in WireGuard currently lacks hardware acceleration on most platforms and architectures, which means it runs entirely in software, including on ROS v7. Even if ChaCha20 is efficient, it’s still a CPU-heavy algorithm. When the CPU on either end of a WireGuard tunnel hits its limit, that becomes the bottleneck for throughput. This applies even to high-end devices like the CCR2216, which can only handle a limited number of WireGuard tunnels before the CPU maxes out, unlike IPsec with AES hardware offload, where the same device can handle hundreds of tunnels with virtually no CPU load at all.

Future hardware acceleration for ChaCha20

Some ARM chip manufacturers, like Ampere and Qualcomm, have implemented dedicated ChaCha20 instructions in certain models as part of the ARMv8.6-A architecture to enable hardware acceleration. To benefit from this feature, you need Linux kernel 6.2 or newer and ARM hardware with FEAT_CHACHA20 support. But since ROS v7 uses Linux kernel 5.6.3, it can’t support FEAT_CHACHA20. This means it can’t take advantage of ChaCha20 acceleration even if the hardware supports it. Important note: both ends of the tunnel must support it in order for the acceleration to be fully utilized.

IPsec AES hardware acceleration

IPsec uses AES encryption, which can be hardware accelerated. Most Intel CPUs and many ARM chips used in networking gear do support AES acceleration, but it’s often missing in cheaper devices.

Important note: both ends of the tunnel must use the AES hardware offload in order for the acceleration to be fully utilized. If AES hardware acceleration isn’t an option, WireGuard is often the better choice, since ChaCha20 in software is lighter on the CPU and typically performs better, especially on low-end devices.

Testing WireGuard throughput

If you want meaningful test results that clearly show a performance change in WireGuard, for example after RouterOS updates, all tests must be performed in exactly the same way. That means the exact same configuration, the exact same test environment, the exact same method, including things like single or multiple streams, CPU usage etc. Otherwise, any difference may not be related to ChaCha20 at all.

(same claim for ChaCha20)

Can you explain why that is the case? I would think that as long as the acceleration instruction or dedicated hardware implements the published standard, it does not matter what the other side is doing.
Especially in the case where there is a “central router” that has many tunnels, and many other routers that each terminate one of these tunnels, it should work well when the central site has the hardware offloading and all of those client routers do not.

Not really IMHO.
You always have to look to these end-to-end connections as point-to-point.

If you try to squeeze a lot more in the pipe then the other end can pull out, you get a bottle neck.
So the throughput will settle on the lowest value capable by all ends.

If your central router can blast through 1Gbps but the other end can only digest 100Mbps, the latter is what the central router will also have to use for THAT connection.

And then you need to take into account all concurrent connections, if there are multiple.
So it can (theoretically) be possible your central router can handle 1Gbps with 10 concurrent connections of each 100Mbps.

If two of those connections are communicating with each other via the central router and one is only able to get to 75Mbps, that’s also what the other is going to have to use, regardless of the capabilities of the central router.

When you have one central router that has to push out 100Mbps to 1000 client routers, it does not matter that each client router is CPU-limited, as long as your central router can feed to 100Gbps (for which it has to use acceleration).

Correct when you look at it that way.

@pe1chl; While it’s true that hardware acceleration generally offloads a VPN concentrator itself, the speed of each connection is still limited by its slower end-point. In other words, to achieve maximum throughput for a IPsec tunnel, both ends still need to utilize AES hardware acceleration.

What if the other end is a 112-core x86 server (just as an example)? “Maximum throughput” is a relative term. Hardware acceleration also has its limits for a given device.

Yeah, or maybe the new IBM z17 or why not the El Capitan? You see my point?

Another potential factor is internet traffic is often asymmetrical flows (i.e. consumer WAN typically have more download than upload).

This leads to a related question… is there any difference in CPU usage between upload and download?

IDK with WG, but typically there is a difference in CPU load between encoding/encrypting (upload) than decoding/decrypting/download… Anyone know which direction is “faster” (i.e. lower CPU), or if they are same. Download uses less CPU than upload be my bet, but IDK.

WireGuard’s ChaCha20 doesn’t differentiate between ingress and egress traffic, so performance in either direction just boils down to traffic density and normal stuff like queue depth, tx/rx ring size, etc. Parallelism, on the other hand, is purely an implementation detail, unless you’re running the standard WG Linux kernel module.

I was simply saying that such a blanket statement is incorrect. My point (and the point of some others) is use what you have available to achieve the necessary throughput. A CCR2004 can do around 2 Gbps per single tunnel in hardware. If the other end is a beefy server, not only it does not need hardware acceleration, it may do a lot better than 2 Gbps without any acceleration, making the HW-accelerated end the bottleneck.

On a router that also does a lot of other things other than VPN, I agree, acceleration is a valuable asset and scales much better.

What I was trying to say, if my comment wasn’t clear, is that you’re mixing apples and oranges.

A more useful comparison might be something like USD per WG Mbit or along those lines. Since ChaCha20 throughput depends directly on raw CPU power, more power means more $$$, simple as that.

Have a look at any of the the VPN service providers offering WireGuard and you’ll see what I mean.

Last but not least, anyone claiming WireGuard is a drop-in replacement for IPsec in an enterprise environment is way off track.

If the CPU on one side takes a piss on the other side’s hardware accelerated performance, that statement becomes invalid fast. Why would you even write crap like that.

The weaker side will always set the pace no matter what, especially when you’re talking about low-end devices used by hobbyists.

In that case, I wouldn’t even consider IPsec. I’d go with WireGuard instead, like I mentioned in my first post.