David,
Thank you for your thoughts, I think info like this really helps people understand where your project is coming from and working towards. I am hoping to pick your brain some more though.
There are plenty of us running Mikrotik all the way from the customer to our core and back out to the internet. I have ROS v7.1 running for 90% of my customers and working on the last couple hundred. It's been a rough couple days dealing with v7.1, but I have been working on understanding CAKE and fq_codel for most of the last year and how to go about implementing it.
Right now I am running fq_codel on each of the customer facing interfaces with Mikrotik's simple queues, below I have the config I am using (for everyone's reference)
(Basically the Mikrotik default config)
/queue/type/add fq-codel-ecn=yes fq-codel-flows=1024 fq-codel-interval=100ms fq-codel-limit=10240 fq-codel-memlimit=32.0MiB fq-codel-quantum=1514 fq-codel-target=5ms kind=fq-codel name=fq_codel
/queue/simple/add bucket-size=0.1/0.1 burst-limit=0/0 burst-threshold=0/0 burst-time=0s/0s disabled=no limit-at=0/0 max-limit=6200k/15400k name="CustomerXXX" packet-marks=" priority=8/8 queue=fq_codel/fq_codel target=CustomerVLANXXX !time
So right now my customer's are being bandwidth limited by fq_codel + HTB, which is where I would see the vast majority of people being able to deploy this tech. Building a communication system between CPE and a customers router, an ISP's billing system and and and, just seems impractical without a unified standard (that doesn't exist) or a unified system which few people will be willing to enslave themselves to (and Mikrotik probably won't build anyways). Sure in theory you could force your customers to run your router, then we could tie everything together with some scripts and a decent billing system, but many of us are techies and don't like putting that kind of requirement on the people we serve. I also don't have a decent billing system, so I am going to build for what I can handle right now.
Notes for those who are reading, 7.1 seems to have a lower CPU load compared to 6.48.x. fq_codel also puts less strain on the CPU then my previous ethernet-default setting, which was a simple FIFO + HTB that is default from Mikrotik (Yes I know it was terrible, but it was the most stable one on my 1009s and I had fewer complaints with it then some of the other queues I tried in 6.x.x and I didn't have much knowledge of queues when I originally set this up.) RAM utilization is significantly higher with fq_codel though, routers with ~100 active customers went from ~380 MB of ram usage to over 560 MB of ram. There's plenty of room for that in my 1009s, but I figure it would be a good FYI for people.
At rates below 1gbit, you can safely reduce the fq_codel memlimit for packet buffering to 8MB (or less if you want to do some math). fq_codel's basic overhead is 64 bytes per queue; where you are most likely using up the most more in your configuration is in coping with misbehaved traffic, which will hit the memlimit before being dropped robustly via the "drop_batch" facility. In general fq_codel by itself barely shows up as a blip on the cpu meter compared to a simple fifo and by having shorter queues in general, you can actually keep memory "hotter" in the L1 or L2 cache.
I keep hoping to hear of a successful application of fq_codel + bql at line 10Gbit rate on mikrotik.
Cake has about 2.5 times more cpu overhead than fq_codel does. My hope was that if you ended up using less tc filter rules with it that it would
be comparable in many scenarios to a complicated htb + fq_codel setup, and that (selfishly) people would prefer the relative ease of cake setup and additional benefits, and throw hardware at the problem until it subsides.
HTB is is the 95% of the overhead killer, both in cpu locks and general context switching overhead, no matter the sub-qdisc. We've been trying to kill off the locks for years but the major innovations there are now happening mostly in the ebpf world. I don't currently understand what the bucket-size parameter you are using here translates to. Another optimization for your max-limit=6200k/15400k is you can use a smaller quantum 300 and
get less latency for the smaller packets at a cost of up to 6x more fq-codel-related cpu in some cases. My 2012 "wall" for when to use quantum 300
or not with 1989 style mips hardware was about 40Mbits or less. Cake uses its own curve for the quantum.
Which leaves me with some questions I am hoping you can enlighten me and others on:
You mention running fq_codel with a HTB, do you have any thoughts about the size/settings of the HTB?
The current state of the art is embedded in the sqm-scripts. They have a debug mode where you can put in your parameters for bandwidth and
then see the params for htb. Cake will also output its calculations. Having a separate htb config "calculator" would make some sense but its relative to the speed of the cpu, the context switch time, the other workloads. Right now the ebpf work is easier to reason about. I guess my rule of thumb is
basically start worrying when "top" shows more than 50% of a single core in use.
What do you mean by "fq_codel everywhere at line rate"? Do you mean setting the bandwidth that should be possible on each interface of a router/switch?
You should be able to set the line rate bandwidth (e.g for ethernet 10Mbit/100Mbit/1Gbit/2.5Gbit/10Gbit) and get adaquate backpressure from the
the device to be able to apply an aqm without a shaper.
I've been getting this a lot, where people think this is a shaper-only needed technology. It's certainly a major use case but totally not the intent of algorithms which were designed to intelligently step down from any fast -> slow transition, everywhere, and especially on wireless tech, as we thought would happen in 1992 when RED was standardized. I have high hopes for wider deployment in 2.5gbit to 1gbit interfaces, for example. Nowadays, fq_codel is the default qdisc at ethernet line rate for most of linux, and all of ios and osx now. It just sits there, breaking up microbursts, keeping queue lengths short, pretty invisibly. OpenWrt actually removed pfifo-fast from their codebase entirely. It would be great if mikrotik did that too. The biggest application for fq_codel is it's on by default, in hundreds of wifi products more or less silently "doing it's job". If it wasn't in there, wifi mesh routers like eero's in particular, would have been still-born. I'm rather proud of the wifi work we did as finally solving the p2mp problem wifi has always had as well as the wifi "performance anomaly. Essentially had been working that problem for 16 years.
http://the-edge.blogspot.com/2010/10/wh ... based.html
fq_codel and pie work beautifully with ethernet pause frames as another example and would massively improved ethernet over powerline.
Although that "line rate" wifi implementation got folded into linux 4.12 I don't know to what extent that too made mikrotik's products at this point either! Qualcomm did do a pretty decent out of tree firmware implementation eventually I'm told, but the outputs of the codebase we created are shown over here:
https://forum.openwrt.org/t/aql-and-the ... vely/59002 but right now it's the latest mediatek mt7x chips that are really rocking the world latency wise with their wifi 6 implementation of our stuff.
anyway to sum up: fq_codel at the native line rate of the intereface is useful all by itself. You can see it in action on nearly any linux box today via tc -s qdisc show. On osx it's netstat -qq -I the_interface_name. On linux wifi on the supported chipsets if you see an "aqm" file in sysfs it's there. I'd really like mikrotik (and the whole world) to kill off their per-packet fifos entirely, as openwrt did, and the internet everywhere, would get subtly better for all applications, especially when under load.
It's setting up the shapers at something other than the native line rate that's so darn finicky.
Is there some other phrase I could use besides line rate? native rate? that is in common use in the world?
Arguably what I would design for an ISP <-> multi-customer interface would be a "veth"-like interface fq-codel or cake on it) with a bql-like multiplexor at the bottom (in hw), (not a token bucket), and I'd punt the uplink shaping to the customer device via some message, and use a routing table for their IP address ranges.
I suspect that Mikrotik's Switches will not be able to handle this, not enough CPU and there's no hardware offload support at the moment. Rather unfortunate, but if we can have a router on each side of the switch that might be transitioning from 10G to 1G this could be mitigated by running fq_codel there correct?
Would you recommend CAKE being run with an HTB as well? From my initial testing having the HTB helps a lot with the CPU load on a router, but now that I have flent working I think CAKE without the HTB performs better.
That said, I have not been able to sit down and get a full understanding of what one is looking for with flent testing or what "good" is. At the moment my work has been to get the Ping CDF plot with a RTT Fair Realtime Response Under Load test to be as vertical as possible. This weekend I should be able to get some testing done and maybe setup my own server to contribute to the greater effort.
I imagine most of mikrotik's switches are short buffered and lack advanced features, maybe not even red?
There is a role for transparent middleboxes like preseem's in networks. It's also pretty easy to roll your own... but (I have no business relationship
with preseem) I adore having the great statistics they provide. LibreQos has a long way to go there, and so far, I still don't know what sort of stats
can be usefully obtained from mikrotik.
in general, no, run cake without an htb and use the native integral shaper. Two exceptions - if there's a hardware offload for the htb, use that, and if you have a complex sharing setup where you are trying to multiplex a ton of users onto a link htb might be a better choice than what would be
my preferred choice of per customer DRR + cake bandwidth X.
PS aiming for a perfectly flat CDF is futile and makes me nervous about people "engineering to the test'. I've ranted already about speedtest but didn't point out that if you optimize for the first 3 seconds or so of speedtet and ignore the rest, it's still a bit useful. I'd really hate to wake up 10 years from now to find that all the new hardware offloads "optimized for rrul"... by offloading exactly 4 up and 4 downloads, and having 3 channels for udp traffic and one for icmp, for example! Feel free to get to the upper range of the cdf with say 10ms extra latency at the 98% percentile at 10Mbit on that test and move on. Also try to get a feel for the behavior of that test at 100Mbit rather than 10 (as the latency will get flatter) but hit THAT with like 128 flows to watch it go to hell in its own way.... and look at both the long term stats for other notable glitches like some we've encountered on the other thread... the best result we had from the rtt_fair test was really pleasing... rrul + PLT is a great benchmark....
aggh... the prospect of a network world designed for the rrul test series would be a much better one than the speedtest world we are in now, but I can think of all sorts of traffic types (like videoconferencing and videocamera feeds) that are worth having more information about than what those tests currently do! I sometimes wake up at night terrified someone will find an exploitable flaw in fq-codel or cakes behaviors and have 4B+ users descend upon me and my project for not being able to fix it quickly on the infitesmal budget we have had.