[Release Candidate] wanctl v1.0.0-rc4 — Adaptive CAKE Bandwidth Control for RouterOS

Hi all,

What is wanctl?

wanctl is a host-side controller that connects to RouterOS and adjusts CAKE queue bandwidth limits every ~2 seconds based on measured congestion.

It assumes CAKE is already configured correctly on RouterOS and operates on top of it by dynamically tuning bandwidth limits within defined bounds.

Optional support exists for multi-WAN traffic steering during severe congestion.


Why this exists

Static CAKE bandwidth limits are always a compromise:

  • Too high → queue growth and latency under load

  • Too low → unused capacity when the link is clean

On variable-rate links (especially DOCSIS), available throughput changes throughout the day, and manual tuning can’t keep up.

wanctl attempts to adapt CAKE bandwidth limits in real time using measured signals instead of fixed values.


How it works (concrete)

Every ~2 seconds, wanctl:

  1. Measures RTT to multiple reference hosts (1.1.1.1, 8.8.8.8, 9.9.9.9)

  2. Maintains a slow EWMA baseline RTT (updated only when idle)

  3. Computes RTT delta = loaded RTT − baseline RTT

  4. Evaluates congestion using:

    • RTT delta

    • CAKE drops

    • Queue depth / backlog indicators

  5. Transitions a state machine

  6. Adjusts CAKE queue bandwidth limits accordingly

  7. Enforces state-dependent bandwidth floors

Communication with RouterOS is via SSH (API support is optional, depending on config).


State Machine

wanctl uses a four-state congestion model:

           delta ≤ 15ms
    ┌─────────────────────────┐
    │                         │
    ▼         15–45ms         │
  GREEN ───────────────► YELLOW
    ▲                         │
    │         45–80ms         ▼
    │     ┌───────────── SOFT_RED
    │     │                   │
    │     │        >80ms      ▼
    └─────┴───────────────── RED
         (recovery requires
          sustained GREEN)

State-dependent bandwidth floors

These prevent bandwidth collapse during congestion:

  • GREEN: High floor (e.g. ~550 Mbps)

  • YELLOW: Moderate floor (e.g. ~350 Mbps)

  • SOFT_RED: Aggressive floor (e.g. ~275 Mbps)

  • RED: Emergency floor (e.g. ~200 Mbps)

All thresholds and floors are configurable via YAML.


Real-World Test: Second-by-Second Behavior

The following is actual output from a stress test on a 940/38 Mbps Spectrum DOCSIS connection.
Eight parallel netperf streams were used to induce congestion.

Time      State         Delta    Upload BW   RTT      Event
────────────────────────────────────────────────────────────────
00:00:37  GREEN/GREEN    2.2ms   38M         26ms     Idle baseline
00:00:44  GREEN/GREEN   10.8ms   38M         70ms     Load increasing
00:00:52  YELLOW/RED    62.6ms   34M        295ms     Congestion detected
00:00:59  YELLOW/RED    60.8ms   31M         79ms     Backing off upload
00:01:06  SOFT_RED/RED  47.9ms   28M         21ms     Continued reduction
00:01:18  YELLOW/YELLOW 29.3ms   28M         21ms     Recovering
00:01:31  YELLOW/YELLOW 17.9ms   28M         22ms     Almost recovered
00:01:56  GREEN/GREEN    7.1ms   28M         26ms     Fully recovered

Observed behavior:

  • RTT spiked from ~26 ms to ~295 ms under load

  • wanctl reduced upload shaping from 38 Mbps → 28 Mbps (~26%)

  • RTT delta fell back under 10 ms

  • System returned to GREEN without user intervention

  • Bandwidth then recovers gradually while remaining GREEN

The entire event completed in under ~90 seconds.


Production Observations (18 Days)

Single RouterOS rb5009, Spectrum 940/38, wanctl enabled.

Monitoring Period:    Dec 11–28, 2025
Control Interval:     ~2 seconds
Total Cycles:         231,208
Log Data:             ~488 MB

Download state distribution

  • GREEN: 89.3%

  • YELLOW: 7.8%

  • SOFT_RED: 0.7%

  • RED: 2.2%

Mean RTT delta across the period was ~4.7 ms, with p95 ~10.2 ms.

This is observational data from one environment, not a claim of universal results.


Multi-WAN Steering (Optional)

For dual-WAN setups, wanctl can optionally enable policy-based steering of latency-sensitive traffic during severe congestion.

  • Steered: VoIP, gaming, DNS, SSH, interactive traffic

  • Not steered: bulk downloads, streaming, background traffic

Steering uses the same congestion signals with hysteresis to prevent flapping.


What this is / is not

Is:

  • External CAKE bandwidth controller

  • Config-driven and explicit

  • Designed for RouterOS 7.x

  • Intended for power users

Is not:

  • A RouterOS script

  • A replacement for proper queue placement

  • A guarantee of zero bufferbloat

  • Enterprise-grade software


RC status & feedback

This is my first open-source project and a release candidate.

The structure and behavior are stable in my environment, but I’m looking for:

  • Review from RouterOS power users

  • Feedback on thresholds and state logic

  • Edge cases on non-DOCSIS links

  • Bugs I haven’t encountered

Issues and PRs are welcome.

— Kevin

wanctl – FAQ / Design Decisions

Q: Why not just do this in RouterOS scripts?

Short answer: complexity and safety.

RouterOS scripting is excellent for:

  • simple automation

  • event-driven tasks

  • configuration glue

wanctl’s logic includes:

  • multi-signal evaluation (RTT + drops + queue depth)

  • EWMA tracking with persistence

  • hysteresis across multiple states

  • rate-of-change control every ~2 seconds

  • structured logging and state files

Implementing this cleanly in RouterOS script would be:

  • harder to reason about

  • harder to test

  • harder to evolve safely

  • significantly more fragile to RouterOS changes

The project deliberately keeps RouterOS as the data plane and moves decision logic off-device.


Q: Isn’t an external controller fragile? What if it dies?

If wanctl stops running:

  • Nothing breaks

  • CAKE remains configured at the last known safe bandwidth

  • No RouterOS state is left half-applied

wanctl only adjusts numeric limits; it does not modify routing tables, mangle rules, or queue topology.

In other words: fail-stop, not fail-open.


Q: Why use SSH / API instead of RouterOS scripting hooks?

Two reasons:

  1. Clear control boundary
    RouterOS executes commands; wanctl decides when and what to change.

  2. Transport flexibility
    The backend interface is abstracted:

    • SSH works everywhere

    • RouterOS API can be used for lower latency

    • Other platforms can be added later

This avoids tight coupling to RouterOS internals.


Q: Why not rely only on CAKE drops?

Because drops are a late signal.

wanctl primarily tracks RTT delta, which correlates directly with user experience.

Drops are still useful:

  • to confirm hard congestion

  • to distinguish RTT-only vs drop-based congestion

  • to gate transitions into RED

This mirrors how humans diagnose bufferbloat: latency first, loss second.


Q: Why measure RTT to the internet instead of internal hosts?

Bufferbloat matters at the WAN bottleneck.

Measuring RTT to multiple stable external targets:

  • captures queueing delay at the bottleneck

  • avoids false positives from LAN load

  • provides redundancy if one target is slow or unreachable

Baseline RTT is only updated when idle, preventing pollution from transient congestion.


Q: Isn’t adjusting bandwidth every 2 seconds too aggressive?

Not in practice.

Mitigations include:

  • EWMA smoothing

  • hysteresis between states

  • state-dependent floors

  • gradual recovery ramps

Bandwidth does not oscillate wildly.
Most adjustments are small and monotonic during an event.


Q: Why not just use sqm-autorate / LibreQoS?

Those are great projects — but they solve different problems.

  • sqm-autorate targets OpenWrt

  • LibreQoS targets ISP / large-scale deployments

wanctl exists because:

  • RouterOS has CAKE but no adaptive controller

  • RouterOS scripting is limiting for this use case

  • Some users want a small, explicit, single-router solution

wanctl is intentionally narrow in scope.


Q: What happens if RTT spikes due to upstream routing issues?

wanctl reacts to persistent signals, not single samples.

Short-lived upstream spikes are absorbed by:

  • EWMA smoothing

  • state transition thresholds

  • recovery requirements

If RTT stays elevated long enough, wanctl will reduce bandwidth — which is still a reasonable response when queues are not draining.


Q: Does this work on fiber / DSL / symmetric links?

Yes, but config matters.

The controller is generic; behavior depends on:

  • thresholds

  • floors

  • recovery rates

Example configs are included for:

  • DOCSIS

  • fiber

  • DSL/VDSL

This is not plug-and-play — it’s a power-user tool.


Q: Is this safe to run on production networks?

It’s been safe on my production network.

But:

  • it’s a release candidate

  • it’s not enterprise software

  • it assumes you understand CAKE and RouterOS

wanctl is conservative by design and easy to disable.


Q: Why release this instead of keeping it private?

Because RouterOS users keep reinventing partial versions of this logic, and sharing real-world data benefits everyone.

This project stands on a lot of community knowledge — especially from the bufferbloat and CAKE ecosystem.

1 Like

Interesting. Do you have a link also?

Apologies — forgot to include the repository link in the original post.

Dave Täht (1962-2023) - In Memoriam

Ehm, that's completely wrong. If Claude is not even getting this right, I am afraid I am not brave enough to test this app. I would need a dedicated machine just for that wanctl thing anyways - which I do not have spare.

https://en.wikipedia.org/wiki/Dave_Täht

Remember that changing the RouterOS configuration means writing to the NAND/XOR Flash drive.

Changing the configuration every 2 seconds is therefore equivalent to writing to the Flash drive every 2 seconds.

Therefore, the long-term effects must be considered.

While the v6/v7 "user-manager"/"user-manager" and The Dude can be stored on an "external drive",
the RouterOS configuration is only written to the internal Flash drive, with all that this entails.

And this. Too much flash writes.

I’m not one to shame someone for trying to bring value to a project, but this very strongly smells of A.I. assistance in the worst possible ways.

On top of that, as has been mentioned previously, if this code is writing to flash every two seconds, then in a one year timespan (365 days) it will write to flash more than ~15.7 million times. Considering even industrial flash cells are not rated for this, anyone running this code could find their router out of commission in a matter of months if not sooner.

You’re absolutely right — that was an error on my part.
Dave Täht’s birth year was incorrect in the acknowledgment, and I’ve fixed it in the repo.

Thanks for catching it.

Good point, thanks for flagging it.

wanctl has been updated to only apply changes when limits actually change (state tracked and persisted), so there are no repeated RouterOS updates every cycle. The fix is now in the repo.

This is much, much better. The RB5009 uses a Macronix MX60LF8G18AC (I believe) which is SLC-based and rated for 100,000 P/E cycles, so I doubt you personally would have had any issues for a long while, but not every device has the same hardware under the hood. It’s also not good to rely solely on wear levelling being perfect.

1 Like

As a follow-up, I added profiling and ran a 24-hour baseline on both WANs.

In steady state, >99.7% of control cycles issue no RouterOS updates at all.
Router changes occurred in 0.3% of cycles, only when bandwidth limits actually changed.

Average control loop time is ~30–45 ms vs a 2-second interval, so there’s significant headroom.

This confirms the flash-wear protection and event-driven behavior are working as intended.

1 Like