FQ_Codel and Mikrotik CCR CPU Utilization

discoengineer · February 10, 2023, 12:12pm

Hi Guys,

Ive recently upgraded one of my CCR1036-2S+ to version 7 which is being used as a router for my clients. Im performing traffic policy on it using simple queues (Majority CIR Connections) for each customer. Approx. have 350-400 simple queues on it. Ive tested fq_codel for simple queues to counter bufferbloat on 4 simple queues and it has worked as intended but Im a bit confused to implement it for all the simple queues just because Im not sure how will the CCR’s CPU behave if fq_codel algo is selected in all simple queues. Can anyone share his/her experience of using this algo for such a large number of simple queues and the impact it has on the CPU of the CCR? There are no PPPoE sessions on this CCR. Just SVIs for each customer and in simple queues the target is set to the each customer’s IP. Will using this algo for such a no. of queues increase the CPU utilization/choke any core during peak traffic (max. 2 Gbps in total)?

dtaht · February 10, 2023, 8:17pm

I would just implement it and go measure, and be ready to roll back. Pound it flat with artificial traffic at 4am?

(I am one of the authors of fq_codel and cake, but I do not have enough data either, on how well this stuff scales on given bits of mikrotik hardware). In most cases it is the per customer shaper that dominates the cpu by a factor of 9, and the underlying queue type be it a fifo, sfq, fq_codel or cake, adds only a tiny bit (cake is about 2.5 times as slow as fq_codel but does more, and again, it is the shaping cost that dominates). If you are low on memory, for folk running at less than a gbit, you can use memlimit 8mbyte rather than the default 32Mbyte.

On x86 gear we have individual implementations of cake scaling to 10Gbit/core. On 500mhz single core mips, cake scales to only about 80Mbit. your mileage will be somewhere between those.

See also.

http://forum.mikrotik.com/t/some-quick-comments-on-configuring-cake/152505/1

discoengineer · February 11, 2023, 10:54am

Thank you for the detailed explanation. You are right! I will implement and check what impact it has on the CPU. I read your post and my my, what a gem of information it is. Ill test it out and share results as well. I have 15.2 GB memory free on the CCR so I think wont be needing to change memlimit. Ill share my findings in a few days.

dtaht · February 11, 2023, 12:11pm

I do hope a table of performance metrics appears for more mikrotik gear. Sometimes I just hope they will put the fq_codel or cake algorithms more directly into their PCQ implementation. I am blogging more and more, and hoped that someday that insanely long thread on cake for mikrotik got boiled down into a compact usage guide. I look forward to hearing about your results!

https://blog.cerowrt.org/post/ has a few pieces on flent and such…

dtaht · February 17, 2023, 2:19pm

I am really intensely curious as to how well this went? Did you try it, your router melt down, and lose your life to a lynch mob of unhappy users? Or… ?

Kevo · February 18, 2023, 1:27am

I recently added cake as an interface queue on one of my CCR routers that tops out at about 700Mbps of traffic. I left the bandwidth limit off as a test, and I haven’t noticed any change and I haven’t had any user complaints. I set it up using the dual srchost and dsthost options. The CPU usage on this CCR barely ever breaks 0% even with cake running. I have only 3 raw firewall rules since it’s just a router, so there’s not much extra taxing the CPU. I’m curious to know if cake will actually do anything without the bandwidth limit. This router is fed by wireless so I’m not sure the bandwidth limit would help all that much anyway since there is likely to be less bandwidth when there are issues and that’s when I’d hope to see cake help keep things running well. So now I’m just waiting to see how it goes the next time we have a bad storm and the link drops modulations.

sirbryan · February 18, 2023, 4:27pm

Generally, queues need to get full before they become effective. If there’s no pressure on the traffic, all the packets get thrown out there as fast as possible (FIFO). The bottleneck becomes the final delivery mechanism, i.e. loop to the house or WiFi to the end-user device.

The only place where I’ve seen QOS/COS useful without the queue being filled was on Cisco 3550’s feeding Motorola (now Cambium) Canopy AP’s. When we deployed VoIP, we were seeing (hearing) poor audio quality. Employing COS rules on the switch to get the DSCP-tagged VoIP packets out first, in addition to Canopy’s QoS rules, allowed the VoIP packets to always get priority, regardless of how lightly loaded the AP/SM were.

Now (showing my ignorance), if Cake/fq-codel do reorder high-priority packets regardless of queue pressure, then I can see leveraging those queues on interfaces without shapers. In my simple WiFi speed testing at home, however, I don’t see any results/benefits until I start maxing out the radio’s capacity and apply a shaper that matches.

dtaht · February 22, 2023, 6:55pm

The word priority does not apply to fq_codel or cake (in besteffort mode). A better word would be “sparsity”, the relevant paper on this is here: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8469111 - but in a nutshell it translates to flows having an arrival rate of less than the departure rate of all the other flows, go out first. This is a natural fit for voip (20ms interval), in particular, but applies to many other forms of req/response traffic. Only flows that build a queue are (slightly) deprioritized, interleaved over all the other flows.

As for the effectiveness of the algorithms on less than full bandwidth, voip is very subject to very hard to see in the aggregate tiny bursts of latency - a 60ms delay or loss is audible, and 100ms, very much so. If you are not measuring your link at a nyquist frequency (half the samplerate, e.g. for voip that would 10ms), and instead looking at 5 minute averages, everything might seem fine, but the voice quality still be poor due to microbursts elsewhere in the network. So in general, always apply FQ in some form, AND classification if you can do it, even if your link never appears to saturate over a 5 minute interval - and someday, if you can, find a way to sample via nyquist’s methods!

“you will see things, you people won’t believe…
sawtooths flying through a gate…”

We are effectively sampling these days using ebpf at 10ms and I hope the technology makes it into more devices such as mikrotiks. Here’s a demo (click run test, then select one of the slower sites)

https://payne.taht.net/

A much better metric for voip would be something like Glitches Per Minute - where someday, if we could get calls and videoconferences down to just glitches per hour, it would be a better world.

dtaht · February 25, 2023, 6:28pm

I am, btw, fairly certain that more than a few > 1gbit ISP uplinks to the internet are behaving this badly:

https://blog.cerowrt.org/post/juniper/

sirbryan · February 27, 2023, 3:43am

I loaded just a few queues on my CCR1036. Initially they were all set as Cake with 600M to 2G for each queue, with a total of 6 or 7 queues.

Today, as the total utilization approached 2Gbps, the CPU load jumped to 100%. This is with all packets being tagged and assigned to one of the half-dozen queues. I changed the main queue to fq-codel and the utilization drops to 18-20%.

I’ll have to try this on an ARM router (CCR2116) to see if it can handle the Cake work better than the 1036.

sirbryan · February 28, 2023, 7:28pm

Update:

I also put some rules on an RB4011 running 7.4.1, initially with Cake shaping all traffic going out to about six AP’s, each on their own VLAN on the same ethernet port to the AP switch. Each queue was Cake with Internet RTT, Ethernet overhead, Ack-filtering, and diffserv4 (all else defaults). The queues themselves rarely hit 50-70% (only one ever hit 90%). CPU utilization was around 20%.

Today I got a call from a customer who complained that the last four days he’d been seeing packet loss, but hadn’t made any changes on his end. I ran mtr to trace the path to his radio and found that I was losing packets at the router itself. After a few minutes of tinkering with the backhaul to rule it out, I disabled the queues on the router. I saw an immediate improvement, both to the router and to his radio. He likewise saw improvements in whatever tools he was using to test from his PC.

I changed the queue to fq-codel, and while the CPU utilization was lower, I still saw packet loss to/through the router, just not as much. As of now, I’ve disabled all queueing on the RB4011, turned fasttrack back on, and we’re humming along at 6% load with 200-300Mbps continuous traffic.

sirbryan · February 28, 2023, 7:49pm

Update update:

For both the 1036 and the 4011, I disabled the shaper queues and created basic queues for the interfaces. The CPU load was causing more problems than perceivable benefits on both routers.

Even with basic settings for Cake on the 4011, I still saw packet loss increase. With fq_codel on the interfaces, it’s working as expected.

On the 1036, Cake besteffort on the two SFP+ interfaces pushing 1Gbps has the CPU at 2-3%. With fq_codel on the interfaces, it’s at 0%. No perceivable losses either way.

syadnom · February 28, 2023, 9:45pm

CCR2116 is DRAMATICALLY better than the tile platform. Tile is very slow for general purpose compute. Having a ton of crappy CPUs is ok… The ARMv8 cores in the CCR2116 must be like 10x faster per core for general purpose compute. I can get over 8Gbps one way on a single cake shaper between two CCR2116 with onboard mikrotik speed test.

dtaht · March 1, 2023, 12:25am

Packet loss is not a particularly good metric to use against a cake or fq_codel instance, as it uses packet loss to control congestion if RFC3168 is not enabled by the endpoints. In exchange for decreased latency, you get more packet loss. So in both fq_codel and cake you should have seen an increase in packet loss, and an improvement in network latency. Did the network “feel” better? Did videoconferencing and voip work better? are better questions to ask.

Now, if cake is actually mis-behaving due to cpu load or otherwise, and randomly inducing packet loss rather than intelligently dropping flows, then you are right to disable it.

What I try to do in a circumstance like this is to get a packet capture of a few test flows, and look at them via wireshark. It could very well be that your queues and offloads and/or applications are behaving better in this case with the fast path enabled, or there is a bug related to the offload + cake, or that the customer finds the range of latencies they currently experience acceptable.

syadnom · March 1, 2023, 1:08am

Packet loss is not a particularly good metric to use against a cake or fq_codel instance, as it uses packet loss to control congestion if RFC3168 is not enabled by the endpoints. In exchange for decreased latency, you get more packet loss. So in both fq_codel and cake you should have seen an increase in packet loss, and an improvement in network latency. Did the network “feel” better? Did videoconferencing and voip work better? are better questions to ask.

Now, if cake is actually mis-behaving due to cpu load or otherwise, and randomly inducing packet loss rather than intelligently dropping flows, then you are right to disable it.

What I try to do in a circumstance like this is to get a packet capture of a few test flows, and look at them via wireshark. It could very well be that your queues and offloads and/or applications are behaving better in this case with the fast path enabled, or there is a bug related to the offload + cake, or that the customer finds the range of latencies they currently experience acceptable.

Dave, I see dramatically less packet loss with cake. Cake may be dropping packets as part of traffic control, but by managing the congestion other flows drop a lot fewer.

But back on topic, that CCR1036 has too slow single core performance IMO to do large shapers.

To the OP, as long as it’s individual shapers and NO top level shaper, then CCR1036 will spread the shapers out on separate CPUs. If you do a top level shaper, everything in that shaper tree will be trapped on a single core.

chechito · March 1, 2023, 3:40am

i agree about limited single core performance on Tile Architecture, but in the test realized by sirbryan he replicated the situation in a rb4011 which has much better single core performance (it has OoO A15 CPU) at a rate normal for that router doing shapping 200-300mbps

syadnom · March 1, 2023, 4:14am

Rb4011 isn’t dramatically faster. It’s a 32bit arm v7.

That said, I have rb4011 running up to about 650Mbps aggregate with cake. I can get that reliably. I use them for backup backhaul links so I have fq-codel or cake handling the constriction.

sirbryan · March 1, 2023, 9:52pm

To the customer, the feel was horrible, as he was no longer able to reliably play his games. He spent the previous four days testing things with friends, updating drivers, twiddling with settings on PC, etc. to no avail. Latency was all over the place and just in my ping tests we were losing one or two every 10-15 seconds. MTR reported 6-8% packet loss to the router and 4-5% to the radio.

I’ve reconfigured the queue setup on the 4011 and will report back. After some more digging, I think I “overconfigured” it initially, burdening the CPU with unnecessary tasks.

syadnom · March 1, 2023, 9:55pm

Feel fry to post a sanitized config.

also note, you cannot use fasttrack with shaping. I’ve seen it a few times that someone has the fasttrack firewall rule and shaping and it’s not behaving as desired, only shaping things that have hit other firewall rules and so on.

sirbryan · March 1, 2023, 10:03pm

Here’s what I’ve whittled it down to:

/queue type
add fq-codel-flows=10240 fq-codel-limit=1024 fq-codel-memlimit=320.0MiB fq-codel-quantum=300 kind=fq-codel name=fq-codel
add cake-diffserv=besteffort cake-mpu=84 cake-overhead=38 cake-overhead-scheme=ethernet cake-rtt-scheme=internet kind=cake name=cake-interface
/queue tree
add limit-at=180M max-limit=180M name="LTU 192 " packet-mark=no-mark parent=vlan4001 queue=cake-interface
add limit-at=180M max-limit=180M name="LTU 203" packet-mark=no-mark parent=vlan4002 queue=cake-interface
add limit-at=180M max-limit=180M name="LTU 110" packet-mark=no-mark parent=vlan4004 queue=cake-interface
add limit-at=200M max-limit=200M name="LTU 227" packet-mark=no-mark parent=vlan4003 queue=cake-interface
add limit-at=150M max-limit=150M name=Airmax packet-mark=no-mark parent=vlan4010 queue=cake-interface

There’s only one firewall rule presently, which is to drop any traffic destined for the public IP addresses assigned to the router from non-internal IP’s.

The queues show the same amount of traffic whether or not fasttrack is enabled. Since these queues aren’t tied to “global”, I imagine that fasttrack should work. Either way, it’s not impacting the CPU much at this point as traffic is just under 100Mbps right now.

(To clarify, this particular config is working with zero packet loss, so far. So the problem I was seeing was likely due to mangling everything coming into the router.)