Mikrotik hAP ac2 fasttrack speed limit ?

I have Mikrotik hAP ac2 doing load balancing between two ISP providers. Setup is fairly straightforward, doing load balancing only for some servers/networks (like Steam/Blizzard servers or some Speedtest server ). All other connections go via default WAN1, but for those selected servers Mikrotik uses Nth algorithm to make every 2nd connection goes over WAN2. It only do connection-mark (and later route-mark) for those going over WAN2, so in theory fast tracking will always work for majority of connections over WAN1.

It all worked just as expected when ISP1 was 300 Mbs download link and ISP2 was 200 Mbs - aggregate speed achieved for those ‘loadbalanced’ servers like Steam or specific Speedtest server was 500Mbs,

PROBLEM I have is that after both ISP recently increased download speed ( 500Mbs ISP1 and 400Mbs ISP2), aggregate speed is not 900Mbs … most often it is just 700Mbs , rarely going up to 850Mbs.

Initially I had even lower speed ( max ~600Mbs), and I noticed that while average CPU usage was decent during such downloads ( <50% ), single CPU core was at almost 100% ( with other 3 cores at under 30% ) . I assumed issue was due to inability to fasttrack toward WAN2 - since mangle need to apply route-mark on per-packet basis , and fasttrack used to skip mangle altogether, so I had to add “connection-mark=no-mark” to "/ip firewall filter add action=fasttrack-connection " rule. It meant half of connections were using fastpath, but other half were going over slowpath and probably resulted in single core overload .

Then I decided to check if maybe Mikrotik improved their fasttrack detection, and removed that “connection-mark=no-mark” limit in fasttrack filter. Lo and behold - it works !
Aggregate CPU usage dropped to <5% , with most stressed core not going over 30% . BUT … aggregate download speed increased from max 600Mbs to max ~800Mbs and mostly to ~700Mbs - still
far away from expected 900Mbs.

Since those speeds are close to ‘wire’ ethernet speed, I tested those with iperf3, and same PC that I used for speed test was able to get 945Mbs download over same ethernet - indicating that speed bottleneck is probably Mikrotik.

QUESTIONS:

  1. is there some known speed limit for hAP ac2 fasttrack ?
  2. any idea how Mikrotik managed to make fasttrack works even for connection-marked / route-marked connections ?
  3. is there some way to set connection route once ( on new connection) and avoid need for route-marks on per-packet basis in prerouting chain ?
  4. if CPU is under 5% overall ( and under 30% single core), what else can be cause of this limit (inability to reach 900Mbs ) ?
  5. since route-marks are only needed for outgoing packets, is there some better place to put that mangle check, instead of prerouting chain (which is checked on incoming packets too )?


    For question #4, I wonder if fasttrack still uses CPU on each packet when processing mangle rules in prerouting chain ? I made quick calculation that shows how (at 900Mbs) if router need even 3.5 microseconds ( just 2.5 CPU clocks for single core, or 10 clocks per core if all 4 cores work - CPU is at ~800MHz ) to determine if incoming packet is to be allowed, it would result in drop from 900Mbs to 700Mbs. So if Mikrotik does not use some hardware switching and uses CPU even for fasttrack, it may be that 900Mhz is not achievable at all on hAP ac2?


    BTW, simplified Mikrotik config for above load balancing ( with removed VPN setup, netwatch for routes, and other parts not directly related to load balancing ):
# model = RBD52G-5HacD2HnD

/ip address
add address=192.168.1.3/24 comment=defconf interface=ether2-LAN network=192.168.1.0
add address=192.168.2.1/24 comment="WAN ISP 1" interface=ether3-adsl network=192.168.2.0
add address=192.168.0.3/24 comment="WAN ISP 2" interface=ether5-WAN network=192.168.0.0

/interface bridge port
add bridge=bridge comment=defconf interface=ether2-LAN
add bridge=bridge comment=defconf interface=wlan1-2G
add bridge=bridge comment=defconf interface=wlan2-5G

/ip firewall address-list
add address=155.133.224.0/19 comment="Steam servers" list= Aggressive_LB
add address=8.248.141.0/24 comment="Blizzard servers" list=Aggressive_LB
add address=a.b.c.d/24 comment="Some Speedtest server" list=Aggressive_LB

/ip firewall filter
add action=fasttrack-connection chain=forward comment="defconf: fasttrack"   connection-state=established,related      /*  REMOVED connection-mark=no-mark  */
add action=accept chain=forward comment="defconf: accept established,related"   connection-state=established,related
add action=drop chain=forward comment="defconf: drop invalid"   connection-state=invalid
add action=drop chain=forward comment="defconf:  drop all from WAN1 not DSTNATed" connection-nat-state=!dstnat connection-state=new in-interface=ether3-adsl
add action=drop chain=forward comment="defconf:  drop all from WAN2 not DSTNATed" connection-nat-state=!dstnat connection-state=new in-interface=ether5-WAN

/ip firewall mangle
add action=mark-routing chain=prerouting comment="Early route marking for already marked connections" \
	connection-mark=MWAN2 in-interface=bridge  new-routing-mark=RWAN2 passthrough=no
add action=mark-connection chain=prerouting comment="LB - Load Balancing using NTH 2,1" \
	connection-mark=no-mark  connection-state=new dst-address-list=Aggressive_LB \
	in-interface=bridge new-connection-mark=MWAN2 nth=2,1  passthrough=yes
add action=mark-routing chain=prerouting comment="Late mark for routing, for first packet in new MWAN2 connection" \
    connection-mark=MWAN2 in-interface=bridge new-routing-mark=RWAN2  passthrough=no
	
/ip firewall nat
add action=masquerade chain=srcnat comment="defconf: masquerade"   out-interface=ether5-WAN
add action=masquerade chain=srcnat log-prefix=NAT- out-interface=ether3-adsl

/ip route
add check-gateway=ping comment=WAN2 distance=10 gateway=192.168.0.1  routing-mark=RWAN2
add check-gateway=ping comment=WAN1 distance=20 gateway=192.168.2.3
add check-gateway=ping comment="Failover WAN2" distance=30 gateway=192.168.0.1

The problem with low-end devices is exactly what you found out: processing takes some tiny amount of time and thus limits throughput of a single connection even though there’s ample CPU power available. If you would test with multiple parallel connections, you might find out that cumulative throughput reaches the maximum (in case of hAP ac2 that would be wire speed).

In addition: ROS will process packets of a single connection using same CPU core … probably to avoid potential misordering due to variable delays of different CPU cores. So when observing load on multi-core device, average load does not matter much, if one core gets highly loaded (and high load starts at 50%, not at 99%), device is already blocking traffic.
When observing CPU load, it’s always good to run CPU profile to see which process uses CPU. Sometimes it’s single process which consumes most CPU time, in that case there’s clear room for improvement. If it’s mix of processes (ethernet, routing, firewall) using up CPU in comparable shares, then it’s not so easy to guess which part of config can be improved.

Yes, I noticed that multi threading was not evenly distributed across cores, especially when slow path was used. As I mentioned in my post, one core was at 100% utilization while others were at 30% or so.

But when I enabled fasttrack for both ISP links, CPU utilization dropped dramatically - as mentioned, it was under 5% aggregate for entire CPU. And while it still was not evenly distributed across cores, worst core was now under 30% even at max, and under 20% most of the time, while other cores were much less loaded.

In both cases processes that used most of that loaded core were firewall, then networking , then ethernet. But in fasttrack case they were 15% max for firewall and 10% max for other two, and averages much lower so that average core utilization was around 20% during download for that single busy core. So no single process, or even single core, were bottleneck in fasttrack case.

Related to Mikrotik using same core for same connection, that is understandable. But Speedtest uses 6 separate connections in browser version, or 4 separate connections for windows app. I expected those would be evenly split among all cores, but still Mikrotik load one core more than the others. Other possibility would be that Mikrotik ties one core per one destination address since, even in case with 6 connections for Speedtest, they all go to same IP address.

Steam download (my 2nd test case, which is one of real-life reasons for load balancing) uses not only multiple connections (20+), but also multiple download servers (6 servers). And CPU usage during Steam download is more balanced ( each core fluctuate between 10-20% , split mostly between firewall/ethernet/networking ). But Steam download was also at around 700Mbs instead of 900Mbs.