Hi all, I'm looking to do some tunneling but my testing keeps finding limitations. Target is 2Gbps aggregate ie 1Gbps 'full duplex'
First note that I'm using the built in bandwidth test, so results will be a bit higher if it's just handling the tunnel, but te rb5009 can do 5Gbps on the TCP test and 8.7 on the UDP so the testing isn't going to steal away more than 10-20% performance .
First issue, I need to handle fragmenting seamlessly and transparently, the connections are over 1500 MTU commodity services and I need to pass full frame. I also don't need encryption, the data being carried is encrypted.
My test boxes and target platform is RB5009 though a CCR2xxx could be used also. One of the modern arm64 devices. For my tests I'm using the SFP+ 10G port so that's not a limitation.
I can get about 1Gbps aggregate in a bi-directional UDP test out of zerotier, about 1/2 my target. This is multi-threaded, but uses up 100% of 4 CPUs to get there. I tried on a CCR2116 and got similar results, looks like it wont scale past 2 cores per direction. One direction maxes 2 cores and ~418-506Mbps.
Next up is wireguard. Wireguard can do about 1.5Gbps aggregate with 1600 byte UDP packets (ie, all fragmented) and can get about 1G one way on the same test. Problem is that it doesnt do layer2 and stacking something on top takes this down to about zerotier level. It does appear to be multithreaded, pushes up CPU usage on all 4 cores.
l2tp is fast w/ excrytion off, but it wont get out of a single core. UDP encapsulation hits 4 cores but decapsulation only hits one. thats 4x 50% usage on the sending side and 1x 96% on the receiving side. TCP seems to be limited to 1 core on the sending side too. so that means a 700-800Mbps speed limit. I just don't know if there's a reasonable way around this. l2tp can't go into a bonding interface, which would have CPU cost anyway.
GRE seems to be doing much of what l2tp does. However, sending much over 1Gbps through GRE and it quits. Have to way 20 seconds or so for the tunnel to re-establish. Sending side at 4x50% usage.
IPIP is the same as GRE in that I can't get much more than 1G over it, but it does lean a bit more up against 1G one way and 1.6Gbps aggregate. (again, this is with fragmentation). IPIP appears to be single threaded.
Any ideas here? Not a lot of options for faster boxes, especially when single threading is part of the deal. Ideally I want to be layer2 and use the mikrotik to bridge the tunnel to an ethernet port so the IP only options aren't ideal. Wireguard seems to be able to extract the most from the hardware but needs a layer2 tunnel over the top.
Thanks.