CCR1009-7G-1C-1S+ single stream TCP performance limit with queues

I finally got 1gbps uplink to my ISP, and after setting up queue trees on my CCR1009-7G-1C-1S+ single TCP streams never seem to be able to go past 600-700mbps. Disabling the queues immediately allows full 1gbps throughput. The queue is limited at 950M and never drops packets so the queue itself shouldn’t be the limiting factor, I tried different queue types as well as different interface queues with no effect. Multiple TCP streams work fine and push 1gbps no problem, so it seems like something is bottlenecking single TCP performance. CPU seems to be balanced according to profiler.

Anyone have any ideas?

Speed of single stream depends on:

  • speed of channel
  • windows size
  • latency of connection

by introducing the queue, the latency is affected, and may just be visible in your case.
To reduce latency, use hardware only queues and no other buffering.

Window sizes are fine, test server is 3ms away so latency shouldn’t be an issue. I want to use queues for traffic shaping so a single HTTP download doesn’t starve more important traffic, so only using hardware queue is not really an option.

After further testing, even with queues disabled there is still a bottleneck. Removing some queue mangle rules improves speed and turning on fasttrack goes even faster, so I guess there is some single stream / single core limit in the firewall engine that is limiting TCP speeds.

And that objective has been achieved, so maybe you should just leave it at this?
This situation should be fine for typical usage and maybe you should not be benchmarking and speedtesting but make actual normal use of the connection?

There will be plenty of high bandwidth TCP connections in real world usage (lots of large file uploads for example). If they can’t use the full connection capacity that’s a bit disappointing.

Looks like you’re using queue trees. How does it do with simple queues?

Try changing default queue type to “sfq”

But they can… only not in a single TCP session. When many users are trying to up- and download lots of things, it will work just fine.

The queue type is irrelevant since the throughput limitation still happens even without queuing. It seems something in the netfilter processing causes a single TCP stream to become bottlenecked since removing mangle rules improves speeds, as does using fasttrack, despite the CPU showing only 6-7% load.

pe1chl, for my situation there will often be single high bandwidth TCP streams. Yes, most people can ignore this since with enough users and flows the capacity will easily be maxed, but we have a video department that is uploading very large raw 4k footage over a single connection. I want them to be able to take full advantage of the 1 gbps connection in such a case.

I can reproduce this on my CCR1009. I tested using iperf between two VLANs. Here are my results.

  1. With firewall rules, mangle rules disabled, queues disabled, fasttrack disabled: 925Mbps.
  2. With firewall rules, mangle rules disabled, queues enabled (but logically disabled since no mangle rules), fasttrack disabled: 800Mbps.
  3. With firewall rules, mangle rules enabled, queues enabled, fasttrack disabled: 575Mbps.
  4. With firewall rules, mangle rules enabled, queues enabled, fasttrack enabled: 925Mbps.

In test 3), there is very little CPU usage: < 20% (~10% is in firewall and ~6% is in networking). There does appear to be something bottlenecking single streams, but it is not evident what is causing it using any of the available profiling tools.

That’s good to hear it is reproducible. I will contact Mikrotik support and hope for an explanation.

Always check the detailed load of the CCR in tools->profile by selecting CPU: all.
When you get 10 or 20% CPU load on a CCR1009 it can mean that one or two cores are fully loaded and the others are almost idle.
The CPU is still the bottleneck in that case because it apparently is a single-threaded task.

As R1CH showed in his image, no single CPU is bottlenecking. The load is nicely distributed across all cores.

Well, that can still happen when the task is single-threaded and limited by CPU.
The immediate performance is limited by the single CPU, but the actual CPU running the code is switched a few times per second, so you still see evenly loaded processors in the profiling.

If no single core goes over 60%, how exactly is it limited by CPU?

The load figures you see in profiling are averages and the CPU limits are instaneous.

Describe the scenario where an average load of 60% would result in a throughput of 62% of max.

Apparently you have it sitting in front of you…

I was hoping you could explain it. Going through 2 mangle rules and 2 queue checks (but not actually queuing) and maxing out the CPU seems not right.

There are two distinct things that you need to consider separately:

  • how can a single-threaded process on a multi-core system be processor bound even when the system load appears to be low
  • how can such a comparatively simple task take up so much resources on a supposedly powerful router

My connections are not that fast that I can enter in this area on the CCR’s that I manage.
But I can theoretically explain the first point. Only a single processor can be active for the thread at one time, when the thread is
scheduled on a different processor regularly, the average load of the processors can be low even when the thread is CPU-bound.