Page 1 of 1

IPSec throughput

Posted: Thu Oct 25, 2018 11:07 pm
by merlinthemagic7
Hi,

We need help understanding why our IPSEC performance is taking such a dramatic drop when deploying hardware assisted IPSEC units.

We are getting ready to deploy ~300 hAP ac² units to remote offices around the country. They all trunk data via IPSEC to datacenters around the country, using CCR1036-8G-2S+ units as concentrators, but we are getting nowhere near the throughput we thought we would get. All units run 6.43.4, but the issue is the same with 6.42.5.

The AC2 units (single tunnel, AES-256-CBC, SHA256) should max out around 380Mbit/s. I looked at the configuration used to achieve this and know its routing only, stripped down to nothing so we expect a performance hit when adding any feature the requires connection tracking. We are a bit lost since we see the same drop when we strip our config down to IPv6 only and remove the 3 rules making up the default deny firewall and no connection tracking is taking place.

We started rolling out in a major metropolitan area and we have been testing sustained throughput using multi threaded file transfers from multiple hosts on both sides. With no IPsec, just plain routing on both IPv4 and IPv6 we get roughly 360Mbit/s sustained, switch on IPsec and throughput drops to ~120-140Mbit/s. Changing to SHA1 and AES-128-CBC yields the same result.

The bottleneck appears to be single core CPU on the hAP ac². We see 3 cores more or less idling and one core red lining. Keep in mind we are testing using multiple hosts on each side to eliminate any single thread limitations.

Our major suspect is the MSS rule for IPv4, but its flaw is how could that affect the IPv6 throughput.

How would you optimize the config to improve IPSec throughput?

/interface bridge
add fast-forward=no name=Loopback
add fast-forward=no name=SecureLAN
/interface list
add name=public
/ip ipsec peer profile
add dh-group=modp2048 enc-algorithm=aes-256 hash-algorithm=sha256 name=SecIPv4Client
add dh-group=modp2048 enc-algorithm=aes-256 hash-algorithm=sha256 name=SecIPv6Client nat-traversal=no
/ip ipsec proposal
set [ find default=yes ] disabled=yes pfs-group=modp2048
add auth-algorithms=sha256 enc-algorithms=aes-256-cbc name=Transport pfs-group=none
/ipv6 pool
add name=DefaultIPv6 prefix=xxxx:xxxx:xxxx:xxxx::/56 prefix-length=64
/interface bridge port
add bridge=SecureLAN interface=ether2
add bridge=SecureLAN interface=ether3
add bridge=SecureLAN interface=ether4
add bridge=SecureLAN interface=ether5
add bridge=SecureLAN interface=wlan1
add bridge=SecureLAN interface=wlan2
/interface bridge settings
set allow-fast-path=no
/ip neighbor discovery-settings
set discover-interface-list=!public
/ip settings
set allow-fast-path=no
/interface list member
add interface=ether1 list=public
/ip address
add address=192.168.0.1/24 interface=SecureLAN network=192.168.0.0
add address=xxx.xxx.xxx.xxx interface=Loopback network=xxx.xxx.xxx.xxx
/ip firewall address-list
add address=10.0.0.0/8 list=RFC1918
add address=192.168.0.0/16 list=RFC1918
add address=172.16.0.0/12 list=RFC1918
add address=192.168.0.0/24 list=LanSubnets
add address=xxx.xxx.xxx.xxx list=TransportIPv4
/ip firewall mangle
add action=change-mss chain=forward dst-address-list=!RFC1918 new-mss=1382 passthrough=yes protocol=tcp src-address-list=LanSubnets tcp-flags=syn tcp-mss=!0-1382
/ip firewall nat
add action=src-nat chain=srcnat dst-address-list=!RFC1918 src-address-list=LanSubnets to-addresses=xxx.xxx.xxx.xxx
/ip firewall service-port
set ftp disabled=yes
set tftp disabled=yes
set irc disabled=yes
set h323 disabled=yes
set sip disabled=yes
set pptp disabled=yes
set udplite disabled=yes
set dccp disabled=yes
set sctp disabled=yes
/ip ipsec peer
add address=xxx.xxx.xxx.xxx auth-method=rsa-signature certificate=ipsecCert.pem_0 profile=SecIPv4Client
add address=xxxx:xxxx:xxxx::xxxx/128 auth-method=rsa-signature certificate=ipsecCert.pem_0 profile=SecIPv6Client
/ip ipsec policy
set 0 disabled=yes
add comment=IPv4 dst-address=0.0.0.0/0 level=unique proposal=Transport sa-dst-address=xxx.xxx.xxx.xxx sa-src-address=0.0.0.0 src-address=xxx.xxx.xxx.xxx/32 tunnel=yes
add comment=IPv6 level=unique proposal=Transport sa-dst-address=xxxx:xxxx:xxxx::xxxx src-address=xxxx:xxxx:xxxx:xxxx::/56 tunnel=yes
/ip service
set telnet disabled=yes
set ftp disabled=yes
set www disabled=yes
set ssh address=xxx.xxx.xxx.xxx/32
set api disabled=yes
set api-ssl disabled=yes
/ipv6 address
add address=::234 from-pool=DefaultIPv6 interface=SecureLAN
/ipv6 firewall address-list
add address=xxxx:xxxx:xxxx:xxxx::/56 list=LanSubnetsIPv6
/ipv6 firewall filter
add action=accept chain=forward connection-state=established,related
add action=accept chain=forward src-address-list=LanSubnetsIPv6
add action=drop chain=forward
/ipv6 nd
set [ find default=yes ] disabled=yes
add advertise-dns=yes interface=SecureLAN other-configuration=yes
/system clock
set time-zone-name=Europe/Lisbon


Re: IPSec throughput

Posted: Fri Oct 26, 2018 12:45 pm
by nescafe2002
There is a related post from emils (MT support):

viewtopic.php?t=97880&p=688672#p688540

[..] Please check the IPsec tunnel performance test manual page to see how maximum throughput numbers are achieved for each product. Adding or enabling any additional RouterOS feature apart from IPsec policies can reduce the throughput significantly. This includes EoIP, L2TP, queuing, firewall, connection tracking, bridging and so on. [..]

Re: IPSec throughput

Posted: Fri Oct 26, 2018 2:29 pm
by merlinthemagic7
There is a related post from emils (MT support):

viewtopic.php?t=97880&p=688672#p688540

[..] Please check the IPsec tunnel performance test manual page to see how maximum throughput numbers are achieved for each product. Adding or enabling any additional RouterOS feature apart from IPsec policies can reduce the throughput significantly. This includes EoIP, L2TP, queuing, firewall, connection tracking, bridging and so on. [..]
Hi,

I read the tunnel performance page and i agree there will be a hit to performance, but I do not believe that is what we are seeing.
It appears far more likely the bottleneck is the fact that only a single core is utilized on the device.

Maybe I should have asked, how are connections distributed across CPUs in routerOS? Or maybe are IPsec policies handled per CPU? each SA? i cannot see why that would be the case, but it would explain what we see.
Everything seems to end up on cpu1, while 0,2,3 are <10% use. This is with 1 IPv4 transfer and 1 IPv6. Everything ends up on CPU1 for some reason.
cpuUtil.png
Does not matter if we push traffic from a bunch of hosts to another set, across both protocol families. But it would help if we understood how traffic is distributed, then we can focus our efforts on a config that alleviates that single core from red lining.

BTW, each of our SA's show they are taking advantage of the hardware assist:
0 HE spi=0xXXXXXXX src-address=xxx.xxx.xxx.xxx dst-address=xxx.xxx.xxx.xxx state=mature auth-algorithm=sha256 enc-algorithm=aes-cbc enc-key-size=256 authkey="xxxxxxxxxxxxxxxxxxxxx"

Here is an example when running without IPSec, load is nicely distributed:
cpuutil2.png


MM

Re: IPSec throughput

Posted: Sun Oct 28, 2018 9:03 pm
by merlinthemagic7
We did a bunch more tests and found it appears CPUs are assigned based on a hash of the src and dst peers addresses.

For example:
Concentrator: yyy.yyy.yyy.253

Client WAN interface IP:
xxx.xxx.xxx.252 results in CPU3 redlining under IPSec load.
xxx.xxx.xxx.254 results in CPU0 redlining under IPSec load.
xxx.xxx.xxx.253 results in CPU2 redlining under IPSec load.
xxx.xxx.xxx.246 results in CPU1 redlining under IPSec load.

I have not been able to determine the hashing algorithm that locks an established peer to a specific CPU, but its clear that is whats happening.

It makes perfect sense since each block is depending on the previous block when using AES-CBC. However how on earth can someone ever reach the full potential of multi core router-boards? We tried switching to AES-256-CTR, as it is supported in hardware as well and should allow for parallelization across CPUs, however no change, single core still red lining and the rest are idle.

Anyone have experience that might help. We are not keen on the idea of creating 4 IPsec tunnels to 4 different endpoints, just to bond them back together.

MM

Re: IPSec throughput

Posted: Mon Oct 29, 2018 5:39 am
by Paternot
You saw one core at 100% - but did You found what was using it? I mean, we know there is something saturating one core - but we don't know what it is. Can you test again, and post the usage by core/process?

Re: IPSec throughput

Posted: Mon Oct 29, 2018 2:26 pm
by merlinthemagic7
You saw one core at 100% - but did You found what was using it? I mean, we know there is something saturating one core - but we don't know what it is. Can you test again, and post the usage by core/process?
Here you are. This is the profile for a CPU that is handling an IPSec peer, total IPSec throughput is roughly 140Mbit/s.
profile1.png
If i then add a second peer, one where the src / dst hash places the encryption process on another processor i get roughly double the throughput @ 280MBit/s and the profile looks like this:
profile2.png

Re: IPSec throughput

Posted: Mon Oct 29, 2018 2:39 pm
by emils
That is just how IPsec is processed by this driver. It is not feasable to make a single IPsec stream/policy multithreaded as it will introduce latency, packet reordering and other unnecessary issues. You should still be able to achieve the advertised throughput if certain conditions are met, such as connection tracking should be disabled.

Re: IPSec throughput

Posted: Mon Oct 29, 2018 3:13 pm
by merlinthemagic7
That is just how IPsec is processed by this driver. It is not feasable to make a single IPsec stream/policy multithreaded as it will introduce latency, packet reordering and other unnecessary issues. You should still be able to achieve the advertised throughput if certain conditions are met, such as connection tracking should be disabled.
Hi,

Thank you MT for chiming in :)

Now that is settled I have one last question for you:

What are the metrics that are used when determining the CPU affinity? You mention "IPsec stream/policy", but our tests show the only thing that seems to matter are the source and destination IPs per peer (seem to hold true for both IPv4 and IPv6). Once we have the metrics, what is the algorithm for determining CPU affinity? We will need it so we can work out which endpoints a specific CPE needs to bind to so each peer gets its own CPU. So far we have been doing it trial and error.

One real world example:
CPE IP: 10.16.97.254

Connecting IPSec peer on:
10.16.97.128 = CPU0 pinned on CPE
10.16.97.131 = CPU1 pinned on CPE
10.16.97.124 = CPU2 pinned on CPE
10.16.97.11 = CPU3 pinned on CPE

If we then change the CPE IP, but keep the peer addresses the same, we get a very different CPU affinity distribution:
CPE IP: 10.16.97.253

Connecting IPSec peer on:
10.16.97.128 = CPU1 pinned on CPE
10.16.97.131 = CPU0 pinned on CPE
10.16.97.124 = CPU4 pinned on CPE
10.16.97.11 = CPU2 pinned on CPE

We would need to write software that changes the CPE provisioning based on the IP address given to the CPE once deployed. The software would then pick the appropriate endpoints for that specific CPE. Unfortunately we need connection tracking for IPv4 NAT.

Re: IPSec throughput

Posted: Mon Oct 29, 2018 4:07 pm
by emils
I might be wrong here, but I believe crypto driver tries to use the same cpu core to process each packet/stream to which it was assigned by ethernet driver. If not mistaken, ethernet classificator for IPQ4018 was changed in the latest beta versions in testing channel to take in action source and destination ports as well. I have not tested it myself, but if you have the setup ready, you might as well test it.

Re: IPSec throughput

Posted: Tue Oct 30, 2018 1:31 pm
by merlinthemagic7
I might be wrong here, but I believe crypto driver tries to use the same cpu core to process each packet/stream to which it was assigned by ethernet driver. If not mistaken, ethernet classificator for IPQ4018 was changed in the latest beta versions in testing channel to take in action source and destination ports as well. I have not tested it myself, but if you have the setup ready, you might as well test it.
I see there was some work done on IPv6 forwarding for IPQ4018 based devices. But I have been unable to find documentation that explains how connections are allocated to CPU's for processing.

Where would i look for this information?

MM