Wireguard number of tunnels - scale?

TLDR: What hardware for 400 (largely idle) Wireguard tunnels?

I've been asked to set up a bunch of little MCUs each connected back to home base via a Wireguard tunnel. Normally there would be very little traffic down each tunnel other than the keepalives and ping monitoring - they would be accessed by web interface if things went wrong where they were monitoring.
Thing is, they're talking about maybe 400 of them....so my initial idea of a HAP AX2 may not be to scale!
What do I need to handle that sort of number of WG peers?

Many thanks,
Gareth

I've actually done some setups like the one you propose. Wireguard itself is pretty well behaved, and it doesn't have any problem with a large number of tunnels. Basically only traffic counts. At larger scale, the number of handshakes to perform may come into play.

An ax2 is easily capable of handling this.

For me at least, other factors were of much more concern than raw power. If there is a central device that handles this many clients/branches, then I'd like something that's highly reliable. Even if a hardware device is chosen, due to how the device is built, the ax2 doesn't necessarily win. An rb5009 might be a better choice. With a better power supply and a backup battery.

Another concern is the overall management (automated if possible) and having appropriate backups in case of failure. In some scenarios a hardware device is clearly preferred, but when one has so many devices, there's often already a solution for virtualizing appliances in place. I really don't know anything about your situation, but using a Linux VM as the concentrator solves quite a few issues.

1 Like

Cheers Lurker, that's much better news than I was expecting to be honest! :smiley: There's a lot of possibilities, an RB5009 could replace their existing router, could go CHR as a VM on their new Windows server or WG on a Linux VM. Time is short (aren't these things always in a rush?!) so I was thinking simply what was tested and familiar - still working on PCB design (remote operated relays) currently!
Automated bulk deployment definitely going to be more than a nice-to-have!
MCUs can only do a single WG tunnel (power draw, thermals and cost all a consideration) so I was thinking DR failover by pointing DNS to another host.

In that case, an ax2 is completely fine. Although whatever device you decide on, I would insist on a dedicated device to act as the Wireguard concentrator. If for nothing else than for ease of management, updating the software only when you decide, etc.

If failover (connecting to another server when the primary fails) on the device side is an option, it's definitely worth implementing. However, I would suggest failover on the server side, either via failing the IP over, or with DNS.

Managing Linux automatically is much nicer than doing the same on Mikrotik. I mean things like adding the wireguard peers based on some sort of standard database and things like that.

Actually, Hyper-V has some really nice high availability features, so I'd look into HA virtualizing a Linux on Hyper-V... Even a non-HA VM has many better features than pure HW, so if it's not a requirement, for the longer term, I'd skip that.

EDIT: There's also something else to pay attention to. Wireguard has a not so widely known requirement to include a pseudo-timestamp in handshake requests. This is a number (usually derived from the system time) that has to monotonically increase for each subsequent handshake. This is done to prevent replay attacks. For hardware devices that might be restarted at any moment, this is something that has to be considered at design time.

1 Like

Failover on the device side isn't an option as it can only have one peer defined and one endpoint address (I just checked that in case it could round-robin) but setting the endpoint as a DNS address and just re-pointing that if there's going to be extended outage is definitely plausible - just need a 2nd site to point to for DR if that's in budget.

Timestamp: that would explain WG component's requirement for SNTP then. No RTC on the device

HA: it's essentially a remote reboot system....a get-out-of-jail-free card. Any time a remote reboot fixes the problem, it's a site-visit saved. So there's going to be a trade-off between theoretical uptime and cost.....but not my trade-off to make. I'm at least trying to make things cheap enough that failover/DR is plausible though :smiley: MCUs are ÂŁ9 a piece (single piece price), already with case, button, RGB LED - just flash and drop on PCB pins.

Well, if it's not an online control situation, but more of an opportunistic thing, I wouldn't focus on HA too much, and a hardware device makes a lot more sense. I'd still go for a having a device dedicated to this function.

Using DNS for the endpoint is simply good practice to avoid endpoint reconfigurations.

If you're so inclined, you can have a look at the REST API interface of the router. You can use it to issue CLI commands on the Mikrotik. This allows you to write some simple scripts to keep the list of peers updated.

Cheers Lurker. I'll have a look at that next.

Given the size of your project, is an extra ÂŁ80-90 to get an RB5009 over an AX2 an issue? I see the rb5009 is ÂŁ164 at amazon UK right now. the rb5009 is a bit beefier, with multiple powering options.

For this sort of application, I think a new hEX / hEX S is ideal. Of course one can never go wrong with an rb5009 :slight_smile:

I can't see there being a problem (not my budget mind!) going for that over the AX2. The question was more that given the large number of peers, do I need an AX2, an RB5009, a CHR, a ROSE? Just for a change, it isn't about throughput numbers it's about number of concurrent connections....and that's not something I've had prior experience of or have any good way of testing.

One thing to note is that MikroTik’s current WG implementation appears to be single-threaded. A friend has been testing throughput capabilities on various high end routers and found the 2GHz 16-core CCR2116/2216’s max out at about 1.2-1.5Gbps of throughput, pegging a single core while the rest twiddle their thumbs. As his comparison, he loaded up VyOS on a Minisforum MS-01 (16ish core Intel x86 box) and is able to push 2.5Gbps on that setup. We are both going to to more testing to figure out ways to spread the CPU load across multiple cores, since WG is supposed to be multithreaded.

In a WG mesh network that I’ve built, we have redundant endpoints for about 30-40 sites. The site routers include hAP AC2’s, 2011’s, 4011’s, and 5009’s, with redundant tunnels back to a CCR2116 and a x86 CHR VM running on ESXI with a 3.47GHz Xeon CPU. It’s also a low-bandwidth solution (2-way radio, 56Kbps streams at most), so we’re not taxing any of the equipment by any means. We have a dedicated WG interface in RouterOS for each remote endpoint so we can properly run OSPF and route any subnet anywhere we need.

For your concentrator(s), with 400 tunnels, I’d recommend a router with a high CPU clock speed, such as a CHR VM on a PC, or a purpose-built WG concentrator (vanilla Linux distro with some open-source WG management tool). Not so much because you need the throughput, but to ensure that if things are single-threaded, you’re not overwhelming a small router.

2 Likes

I’ve explored this path as well although not with 400 tunnels, but closer to 25 per container instance. I initially tried with RouterOS and I also encountered the single-threaded bottleneck on the CCR2216. My solution was to move WireGuard (WG) termination off the router and into a dedicated container that runs on some other hardware. There I simply map /etc/wireguard (or /etc/wireguard/config, depending on your setup) from the host into the container, where all configuration files reside. The container then acts as the WG terminator.

With this approach, I can reuse the same container across different hardware configurations (some are running on some Zimaboard 2, and some on higher end consumer chips for maximum CPU performance). For visibility and logical separation, I created VLANs on my firewall to represent each “interface,” then mapped the container’s network to use a specific VLAN per instance.

Aside from more performance I also get upstream fixes and improvements as soon as they’re released rather than waiting for them to be backported to the RouterOS kernel. In testing, I’ve achieved ~11 Gbps on a single WireGuard tunnel over LAN using an Intel Core i9-14900 on a Lenovo P3Tiny.

1 Like

Oh also just to clarify Wireguard afaik is single-threaded in terms of encryption/decryption for a single client stream. This is how it has to be because it does stream cipher encrypt/decrypt. But the issue with the CCR or RouterOS implementation is that all the other steps like moving packets in/out of cpu and routing end up being localized to the same core that is doing the encrypt/decrypt so you get suffering performance. If those ancillary tasks which can be multi-threaded are moved off of the same core then we would get max performance within the limitations of the hardware.

So... first of all: OP wants to keep 400 tunnels alive and idle, so is not really interested in the maximum throughput that can be achieved. A quick calculation tells us that such a router must support 2-3 handshakes / second and (depending on the keepalive timer) 30 packets of encryption/decryption per second. Even the cheapest Mikrotik can do handshakes in the thousands and packet crypto operations measured in the tens of thousands.

So OP's question is settled.

Mikrotik uses the Linux in-kernel implementation in Wireguard. It's true that a newer version was recently backported, and it does contain some limited performance improvements, they're nice to have, but ultimately not really significant.

As the Linux in-kernel Wireguard module is multithreaded, so is Mikrotik's.

Wireguard uses Chacha20 as its cipher, which is indeed a stream cipher. This has no bearing on parallelism. Each packet is encrypted as a new "stream" with a new nonce, therefore, yes, the encyrption of a single packet is nearly impossible to parallelize, but multiple packets can be handled in parallel just fine. BTW it's the same with AES-CBC, where because of the CBC (cipher block chaining,) each (block) operation can only commence once those preceding it are complete and their results are available.

The standard wireguard module works like this:

  1. packets are received and some networking operations are done
  2. in the wireguard module, certain preliminary operation are performed
  3. the packet is distributed to different cores for the cryptographic operations
  4. encryption/decryption happens in parallel
  5. the packets are collected and sequence is restored sequence
  6. some further networking is needed to emit them

What leads people to believe that it's single threaded is that: even Mikrotik's "most powerful" cpus are in fact objectively weak (you can look it up on cpubenchmark), so the part where the packets are received, distributed, collected, sent, i.e. the single threaded part, can easily block throughput. The other reason is that Mikrotiks contain NICs that distribute received packets based on a packet hash, which is calculated for the packet keader of the encapsulating wireguard packet.

Yes, Mikrotik's hw devices have very nice integrated NICs, but they absolutely do not come with beefy cpus. But that's basically the whole story.

1 Like

Yes. The title of the post, however, does not mention that, and so you know people will be coming here looking for all the same things the rest of us did. But agreed, OP’s issue is (mostly) addressed. I still think at least a CCR2004 or RB5009 would be a good option for a concentrator, or a CHR VM, just in case.

Since the topic mentions something I’m looking at doing (X00 WG tunnels to a host, with multi gig needs) and the bottleneck issue came up in a conversation I had just yesterday, I decided to share what had been observed.

I personally had planned to run some tests of my own with CCR2116’s and CHR’s, both on ARM64 and x86, to see for myself what else can be done on the performance front, either with RouterOS or with vanilla Linux installs. The feedback on this thread has certainly helped there.

That was my initial suggestion as well. If you want a vpn concentrator, do what people who do vpn concentrators do. Which is to get a mainstream CPU. Virtualized or non-virtualized. If you want the max cpu/USD, go x86-64, if you want small footprint or passive, go to the mini pcs, if you're all about efficiency, consider a server arm (if you can wrestle one from the greedy little claws of the cloud people.)

While you're at it, you'll need a solution to manage the connections, or maybe have some additional protocol like Tailscale or netbird, so why not go with one of the mainstream distros.

If you want to support this device in terms of networking, you can look at what Mikrotik offers, but they simply do not offer a useful device to use as the vpn concentrator itself.

EDIT: By the way, OP's use case is not really strange or unique. I've seen a lot of environments that require a higher level of security (thus wireguard) but only want some monitoring, telemetry, remote control, KVM, etc. traffic. So lots of devices but minimal traffic turns out to be pretty popular.

2 Likes

@cenedd let us know how it goes!

Will do. Just sorting out the PCB design currently - a few relays driven off GPIO pins essentially. Power goes through a Normally Closed pin and the relay can open the circuit for a power-cycle. Jack plug's easy but the USB-C (with PD) took a bit of reverse engineering and ignoring the docs - link the VBUS connections through the relay, link the grounds and then just straight through with the CC1 and CC2 pins; no resistors required... although the reference design had a couple of small caps between the power and ground, probably for smoothing.
Once I can get some prototypes ready, I can hit the networking side of it for some proof of concept. If all goes well, a bucket load of them to order!

How great it is to read about this situation. I was thinking of asking something like I have a similar scenario with several clients connecting via WireGuard to an IP PBX solution, using softphones, low traffic, voice only (RTP) and SIP packets. Each client accesses their own dedicated VLAN. It’s been running very well, I currently have 48 simultaneous connections using an RB4011, and CPU usage averages around 1%.

If I understand you correctly, you're operating the NC relay straight from the 5V of a USB C.

In this case, it's standard (i.e. polite) to include a capacitor. My suggestion would be a 22uF tantalum. It's proper to include a pair of resistors (from CC1 and CC2 individually to GND) of 5k1. A diode (or, more properly, a zener diode) to clamp the inductive back-EMF is also required.

Anyway, my point is that if you wish, you can post your schematic, and I can verify if there's something you've missed.

EDIT: Corrected resistor value.