Community discussions

MikroTik App
 
Frostbyte
Frequent Visitor
Frequent Visitor
Topic Author
Posts: 92
Joined: Mon Dec 25, 2017 1:42 am

Queue Trees, CPU Utilization and Watchdog reboots

Thu Nov 08, 2018 11:14 am

I have a hAP AC unit (RB962UiGS-5HacT2HnT) which has a single-core CPU at 720 MHz.
I am trying to make a configuration that would allow me to divide the WAN1 traffic to three categories (ordered by priority):
  1. All other internet traffic (inbound and outbound) from and to the Streaming PC (i.e. Online Games, Spotify) [This is determined after filtering out the traffic for item #2 on this list]
  2. Outgoing traffic from the Streaming PC to Twitch ingest servers (i.e. live-ams.twitch.tv)
  3. All other internet traffic from and to other hosts on my network

I currently have a script in place that:
  • Resolves and populates/cleans-up the address list entries for the aforementioned ingest servers.
  • Toggles on/off all mangle rules containing the keyword "LVSTRM".

Then there are some permanent address lists, such us: STREAM which is the streaming PC and LOCAL which contains all of my internal subnets (which comes into play on the first rule, so everything below it are WAN interactions).
/ip firewall mangle
add action=accept chain=prerouting comment="Exclude LAN traffic" dst-address-list=LOCAL src-address-list=LOCAL
add action=mark-routing chain=prerouting comment="[LVSTRM] Route all internet traffic from the streaming PC through WAN1" new-routing-mark=ToWAN1 passthrough=yes src-address-list=STREAM
add action=mark-connection chain=prerouting comment="Mark inbound WAN1 connections" connection-mark=no-mark in-bridge-port=ge0 in-interface=bridge1 new-connection-mark=WAN1 passthrough=yes
add action=mark-connection chain=prerouting comment="Mark inbound WAN2 connections" connection-mark=no-mark in-interface=vlan2 new-connection-mark=WAN2 passthrough=yes
add action=mark-packet chain=prerouting comment="[LVSTRM] Mark all packets destined to twitch ingest servers through WAN1 from streaming PC" connection-mark=WAN1 dst-address-list=TWITCH new-packet-mark=QOS-GE1-STREAM packet-mark=no-mark passthrough=yes src-address-list=STREAM
add action=mark-packet chain=prerouting comment="[LVSTRM] Mark all other packets destined to WAN1 from the streaming PC" connection-mark=WAN1 new-packet-mark=QOS-GE1-STREAM-OTHER packet-mark=no-mark passthrough=yes src-address-list=STREAM
add action=mark-packet chain=prerouting comment="[LVSTRM] Mark all other packets destined to the streaming PC from WAN1" connection-mark=WAN1 dst-address-list=STREAM new-packet-mark=QOS-GE1-STREAM-OTHER packet-mark=no-mark passthrough=yes
add action=mark-packet chain=prerouting comment="[LVSTRM] Mark all packets destined to WAN1 from other hosts" connection-mark=WAN1 new-packet-mark=QOS-GE0-CATCHALL packet-mark=no-mark passthrough=yes
add action=mark-routing chain=prerouting comment="Attach routing marks to already marked inbound WAN1 connections" connection-mark=WAN1 in-interface-list=hosts new-routing-mark=ToWAN1 passthrough=no
add action=mark-routing chain=prerouting comment="Attach routing marks to already marked inbound WAN2 connections" connection-mark=WAN2 in-interface-list=hosts new-routing-mark=ToWAN2 passthrough=no
add action=mark-routing chain=output comment="Attach routing marks to already marked outbound WAN1 connections" connection-mark=WAN1 new-routing-mark=ToWAN1 passthrough=no
add action=mark-routing chain=output comment="Attach routing marks to already marked outbound WAN2 connections" connection-mark=WAN2 new-routing-mark=ToWAN2 passthrough=no

/queue tree
add max-limit=8800k name=GE0-TX parent=ge0 queue=default
add max-limit=55M name=GE1-RX parent=ge1 queue=default
add limit-at=6250k max-limit=6560k name=GE1-STREAM-TX packet-mark=QOS-GE1-STREAM parent=GE0-TX priority=2 queue=default
add limit-at=1290k max-limit=1430k name=GE1-STREAM-OTHER-TX packet-mark=QOS-GE1-STREAM-OTHER parent=GE0-TX priority=1 queue=default
add name=GE1-STREAM-RX packet-mark=QOS-GE1-STREAM parent=GE1-RX priority=2 queue=default
add name=GE1-STREAM-OTHER-RX packet-mark=QOS-GE1-STREAM-OTHER parent=GE1-RX priority=1 queue=default
add max-limit=55M name=GE0-RX parent=global queue=default
add name=GE0-CATCHALL-RX packet-mark=QOS-GE0-CATCHALL parent=GE0-RX priority=3 queue=default
add limit-at=400k max-limit=480k name=GE0-CATCHALL-TX packet-mark=QOS-GE0-CATCHALL parent=GE0-TX priority=3 queue=default
The additional mangle rules I have included are the only ones which are 24/7 active, since we are in a Dual WAN environment and we'd want everything to exit the way it entered.


Now onward to the actual problem(s):
  1. WAN1 is a 100/10Mbps VDSL connection and while my hAP AC can handle the upstream prioritization/shaping and everything works perfectly.. the problems start when someone utilizes the downstream at full blast. The CPU utilization immediately spikes and remains at 100% and significant amounts of packets start to drop from the queues.
  2. If the high CPU utilization is sustained for long enough, watchdog will kick in and reboot the device. (Has already happened 5 times in 3 days)

This is what MikroTik support had to say about the watchdog reboots:
Each parent queue (with child queues if there are any) works on single CPU core. All packets that must be processed by this queue go through it in the order and if such queue manages to load single CPU core to 100%, then other cores must also wait for these packets to be processed since packet processing is the highest priority task on router. At that point other CPU cores gets stuck on 100% (basically on waiting state) and as a result single queue may overload router completely. Such situation may lead to a Watchdog timer.

And some questions:
  • Is there a way I can do this a little bit more efficiently and overwhelm the device less, besides limiting the overall downstream to 40Mbps? I clearly must be doing something wrong, there's no way the device can't handle more than that.
  • Why on earth isn't watchdog prioritized over packet processing? I believe it's more catastrophic if the device reboots, than to lose/delay a packet or two.

Thoughts?
Thanks guys!

PS: Stroke through some information that might've been invalid or irrelevant.
Last edited by Frostbyte on Mon Nov 19, 2018 5:02 pm, edited 2 times in total.
 
Frostbyte
Frequent Visitor
Frequent Visitor
Topic Author
Posts: 92
Joined: Mon Dec 25, 2017 1:42 am

Re: Queue Trees, CPU Utilization and Watchdog reboots

Mon Nov 19, 2018 4:59 pm

Anyone? :(
In the event that the rules cannot be further optimized/simplified, looks like I will have to opt for a more powerful router. (RB4011iGS+5HacQ2HnD-IN comes to mind)


In regards to the watchdog reboots, I'm currently troubleshooting this with MikroTik support and so far we've established the following (without achieving a resolution yet):
  1. The router will experience a watchdog reboot even if I just let it sit and avoid hammering it down with utilization. (I have also tested with the mangle rules disabled entirely)
  2. Tried upgrading to v6.44beta31 which supposedly fixes a very similar problem, based on their lab tests. (The reboots occurred every 3-4 days instead of 1-2 days)
  3. Tried swapping the power supply unit with another one (exact same model).
Honestly I believe it has something to do with v6.43.4 and the fact that it's not as stable as it seems. I do believe my eye caught a few posts of users that complained about the stability of said version.
For the time being I have downgraded back to v6.43.2 and I will observe for a little while. This specific version had been serving me well for at least 3 weeks, before I took the plunge and upgraded.
 
User avatar
sebastia
Forum Guru
Forum Guru
Posts: 1782
Joined: Tue Oct 12, 2010 3:23 am
Location: Antwerp, BE

Re: Queue Trees, CPU Utilization and Watchdog reboots

Mon Nov 19, 2018 10:33 pm

Hi

You could start investigating where the most of your cpu goes to => cpu profile.
But that will probably be firewall... please confirm.

To be honest, single core @720 is not that much, given what you want to do: load balance & prioritise -> both require mangling on packets which is cpu intensive.
What you could try is:
* define the one wan where bulk of your traffic goes to -> that will be your default route, other will be secondary
* within that route define bulk of your traffic as default class -> that will be your "no mark" class which can be FASTTRACKed

Fasttracking will reduce load on cpu, but it bypasses mangling, hence you need the above assumptions.

Or just buy new router and use that one as AP only, hexgr3 will do what you want and isn't expensive.
 
Frostbyte
Frequent Visitor
Frequent Visitor
Topic Author
Posts: 92
Joined: Mon Dec 25, 2017 1:42 am

Re: Queue Trees, CPU Utilization and Watchdog reboots

Tue Nov 20, 2018 5:40 pm

Hi

You could start investigating where the most of your cpu goes to => cpu profile.
But that will probably be firewall... please confirm.
Hi,

I can confirm it's firewall.

To be honest, single core @720 is not that much, given what you want to do: load balance & prioritise -> both require mangling on packets which is cpu intensive.
What you could try is:
* define the one wan where bulk of your traffic goes to -> that will be your default route, other will be secondary
* within that route define bulk of your traffic as default class -> that will be your "no mark" class which can be FASTTRACKed

Fasttracking will reduce load on cpu, but it bypasses mangling, hence you need the above assumptions.
Figures.. Apparently I'm asking too much from a device that might've not been designed with that kind of workload in mind.
I already have fasttracking in place, which helps with the LAN to LAN traffic.

My network topology and requirements are not really compatible with the above suggestions.
But it's helpful to know that this could've been an option, so thank you for the information.

I really only wanted to focus on the configuration excerpt that I provided, to see if I could simplify it further.
When I created those rules my logic procedure was the following:
  1. I have a rule on the top of the Mangle table that accepts LOCAL to LOCAL traffic, so all the LAN <-> LAN traffic does not get processed by any other rules.
  2. Since I have a dual-WAN setup, I can utilize the already existing connection-marking rules, that enforce traffic exit from the point of it's entry (WAN1 or WAN2).
  3. Then I would focus the already marked WAN1 connections and sniff out the traffic category that I know both the source and destination for (Streaming PC -> Twitch ingest servers)
  4. Whatever escapes #3 (still has no-mark) but originates-from/goes-to the same source (Streaming PC) can be classified as the WAN1 gaming traffic, since it's not possible to know every IP address and Port for every single game and game server out there.
  5. Whatever escapes #4 (still has no-mark) can be classified as WAN1 traffic originating-from/going-to other Hosts/VLANs.
Obviously #4 and #5 require pairs of rules so we can capture both ways. For #3 we are only interested in the upstream, so all we really need is one.

Could I somehow achieve this with less or simpler rules? I believe it might not be possible.

Or just buy new router and use that one as AP only, hexgr3 will do what you want and isn't expensive.
Well, I'd rather not introduce additional hops and devices (which can be potential points of failure) within the network.
Getting a RB4011iGS+5HacQ2HnD-IN unit could be a good drop-in upgrade on all aspects. I could then decommission and sell the hAP ac units.

Thanks!
 
ivicask
Member
Member
Posts: 422
Joined: Tue Jul 07, 2015 2:40 pm
Location: Croatia, Zagreb

Re: Queue Trees, CPU Utilization and Watchdog reboots

Tue Nov 20, 2018 6:16 pm

I actually have the same issue with exact same router, got 3 random watchdoog reboots so far in past 10 days, but this first time ever happen to me since latest update (44beta28), but didint had much time to debug it or change versions..
 
Frostbyte
Frequent Visitor
Frequent Visitor
Topic Author
Posts: 92
Joined: Mon Dec 25, 2017 1:42 am

Re: Queue Trees, CPU Utilization and Watchdog reboots

Wed Nov 21, 2018 4:47 pm

At the very least, that's a sign that my hardware isn't failing.. they'll eventually fix it, so it's gonna be a-okay.

A colleague of mine was also experiencing random reboots on v6.43.4 with an RB3011UiAS-RM unit, but they opted to send it back to MikroTik.
v6.44beta31 still causes reboots for me though, it's just that they're not as frequent.
 
User avatar
sebastia
Forum Guru
Forum Guru
Posts: 1782
Joined: Tue Oct 12, 2010 3:23 am
Location: Antwerp, BE

Re: Queue Trees, CPU Utilization and Watchdog reboots

Wed Nov 21, 2018 5:30 pm

If these reboots are just because router is slow to respond due to high cpu load, but does respond, you could disable watchdog for time being...
 
ivicask
Member
Member
Posts: 422
Joined: Tue Jul 07, 2015 2:40 pm
Location: Croatia, Zagreb

Re: Queue Trees, CPU Utilization and Watchdog reboots

Wed Nov 21, 2018 5:34 pm

If these reboots are just because router is slow to respond due to high cpu load, but does respond, you could disable watchdog for time being...
I did that, than router froze and was not accessible for 5mins and until I force rebooted him via power, it still did switch traffic to my acces point connected to it, but him it self was dead.

So Atleast watchdog reboots it, so better not turn that off.
 
Frostbyte
Frequent Visitor
Frequent Visitor
Topic Author
Posts: 92
Joined: Mon Dec 25, 2017 1:42 am

Re: Queue Trees, CPU Utilization and Watchdog reboots

Thu Nov 22, 2018 11:06 am

If these reboots are just because router is slow to respond due to high cpu load, but does respond, you could disable watchdog for time being...
We concluded that the reboots are a totally unrelated issue (mostly a bug in ROS v6.43.4), hence why I stroke through the lines in my original post.
They can occur even when the mangle rules are disabled entirely and there's abysmal load (4-5%) on the router.

At this point I should clarify that these reboots are the result of watchdog defibrillating the device from actual software crashes.
As ivicask has already pointed out, disabling it will leave the router in an unresponsive state, when a crash comes around the corner again.
So unless debug operations are being carried out, to find the cause of said crash, it's best advised to keep it enabled.

MikroTik support has already been given access to my device and are currently investigating it. They also said they will provide a fix once they're done.

To get back on the topic, given the configuration and limitations I have already supplied - I suppose these rules are as good as it can get, right?
(Not fussed, new devices soon :) )
 
User avatar
Steveocee
Forum Guru
Forum Guru
Posts: 1120
Joined: Tue Jul 21, 2015 10:09 pm
Location: UK
Contact:

Re: Queue Trees, CPU Utilization and Watchdog reboots

Thu Nov 22, 2018 11:23 am

Your rules don't look especially heavy, I've had similar amounts running on an RB951Ui which on paper has a worse CPU (read that "on paper" part though). Saying that the RB951Ui often in my testing performs better with Btest then the hap AC which should be better due to the newer and faster CPU.

You shouldn't be stressing the router to 100% with what you have unless you are running very connection heavily applications (possibly gaming may do this) but P2P can often ramp up like this.
 
Frostbyte
Frequent Visitor
Frequent Visitor
Topic Author
Posts: 92
Joined: Mon Dec 25, 2017 1:42 am

Re: Queue Trees, CPU Utilization and Watchdog reboots

Mon Dec 10, 2018 11:42 am

Thanks for the insight Steveocee, unfortunately the high utilization can be observed even in simple bandwidth tests - which I do believe are single connection.
I should point out however, that the CPE and the PC in question are in different VLANs, and this definitely puts burden on the CPU.. If it's abnormally high or not, that I do not know.

An update to the reboot situation now:
With RouterOS v6.43.7 that released last week, I've been able to survive for 7 days without a watchdog-timer reboot.. but sadly it happened again.
It's more stable, in regards that I'm not getting 3 reboots per day as with v6.43.4, but it's still annoying..
EDIT: I take back what I said there, I just got another one within 30 minutes...
EDIT2: And another two, a few hours later..

Even though support had asked for remote access to monitor my device, they were just opening an SSH connection every other day in the morning and they never managed to catch anything useful to the case. Nobody was actively monitoring the device (I'm not unreasonable, but why not check back every 1-2 hours?) or care that it was suffering 3-4 reboots daily, to help figure out the problem. They relied on getting the information they needed first try, every time. When I asked for confirmation if anybody was bloody looking at these, I was told that "MikroTik support usually reply within three business days" and was ultimately left with an unstable device.
Last edited by Frostbyte on Fri Dec 28, 2018 10:48 am, edited 1 time in total.
 
Frostbyte
Frequent Visitor
Frequent Visitor
Topic Author
Posts: 92
Joined: Mon Dec 25, 2017 1:42 am

Re: Queue Trees, CPU Utilization and Watchdog reboots

Fri Dec 14, 2018 1:17 pm

Support contacted me once again, they told me that they managed to get a reproduction in their lab environment.
They suspect that it's highly possible that their findings, my issue and the issue denoted by others here, may in fact be the same thing.
They stated that they'd rather not jump to conclusions until they find the root cause and confirm the above statement.

Will keep you guys posted.
 
Frostbyte
Frequent Visitor
Frequent Visitor
Topic Author
Posts: 92
Joined: Mon Dec 25, 2017 1:42 am

Re: Queue Trees, CPU Utilization and Watchdog reboots

Fri Dec 28, 2018 10:58 am

Well, it's been two weeks of silence.. I wanted to chime in and say that I ultimately solved my issues by migrating over to 2x RB4011iGS+5HacQ2HnD-IN units.
I've sent another email to the support, letting them know that I'd be willing to test any potential fixes - until I sell my hAP ac units, that is.

Steveocee: You're definitely right that the rules aren't particularly heavy.
With the hAP ac units, the CPU would max out when the Queues were being hit by >=40Mbps traffic.
With the RB4011 units, the CPU doesn't even get past 6% even when the Queues are being hit by the full 90Mbps traffic that my primary line can offer.

Who is online

Users browsing this forum: Bing [Bot], f008600, phascogale, worm and 64 guests