We have several CCRs and noticed small levels of packet loss (anywhere from 0.3% upwards of 2%) whenever CPU utilization is above 5%.
One thing we noticed is the only scenario the packet loss doesn’t happen is when the CCR is using Fast Path for all its traffic. If Fast Path is disabled (firewall rules, QoS, etc) the packet loss will happen at pretty much any CPU level, starting at about 5% CPU, even with Fasttracking all connections.
Typically such levels of packet loss wouldn’t be such a huge deal, but when you have 3 or more CCRs making up the backbone of your network and packets going through all of them give you collectively 2%+ packet loss that’s a big problem for us.
Does anyone know why this micro packet loss happens? Is there a way to even avoid it?
To give you some more info, we even tried a CCR that had Fast Path enabled and 1-2% CPU at 0% packet loss and simply adding a firewall rule for
ip firewall filter add chain=forward action=fasttrack-connection
Under System->Resources->CPU, see whether a specific CPU is hitting 100% at the time of the packet loss
The biggest challenge with the CCR’s is that they have a large number of relatively weak CPUs, so a single CPU can get overloaded well before the entire unit does. Usually this is because something is configured in a way that is not optimal for multi-threading, and there is often a different configuration that will work that is more friendly for multi-threading.
From checking the logs this might indeed be the problem (the current RouterOS version doesn’t support this, I’m yet to update it).
Also, I noticed that disabling IP Flow in this CCR (1016) reduced CPU usage from ~20% to ~15%, while on another CCR (1072) enabling it had no notable change in the CPU which was still at 1-2%, both with about the same amount of traffic.
Additionally, this CCR had no firewall nor queues up and had Fast Path enabled for all its traffic. The only configurations are VLANs, bridges, bonding, and OSPF routing.
Is there any way to prevent individual CPUs from reaching 100% usage with this kind of setup? Such poor multi-threading seems to me like a fault of either the Linux kernel used or RouterOS itself.
Yes, certain ways of configuring things result in excessive load on a single core. The first thing to determine is what element of your config is the culprit - use the profiler tool in RouterOS to determine what process is responsible for the high load on one CPU, that’s a good place to start.
Once you know what is putting a large load on one CPU you can look at how to reconfigure it to make it more multicore friendly.
For instance, one of the common ways of setting up queueing is to do a queue tree with parent global, there are many scripts out there that do this, works great on a MikroTik home router. For CCR it is terrible because all queue trees with a common parent are processed on a single CPU core, and it is better to use other configurations for queueing to accomplish the same task but better distribute it across the cores. There are other processes that similarly can bog down a single core. You just have to track down what is causing it.
I need to update RouterOS, the current version in that CCR doesn’t support per-core profiling.
Will do that later tonight and come back with more info.
OK.. also BTW I think you misunderstand fastpath - fastpath is automatically active when it is enabled and you have no firewall rules.
Fasttrack is kindof a ‘fastpath-lite’ where you can fastpath some traffic in situations where you need to have firewall rules and other such things. It is not as efficient as just having everything fastpath’ed.
That is why adding the rule:
ip firewall filter add chain=forward action=fasttrack-connection
actually increases your CPU usage, because by doing that you are disabling full blown fastpath (as soon as you have firewall rules, it disables fastpath) and enabling fasttrack
Yes, I am fully aware about all of that. That firewall rule was for testing purposes to help identify why I noticed such packet loss especially when the firewall was enabled (which means Fast Path disabled), and just to make sure it wasn’t my filtering rules causing the issue I experimented with that simple rule and even then I had the exact same packet loss.
OK.. also, good idea to check for certain bad settings that can kill your performance, for instance turning on “Use IP firewall” in the bridge settings kills the CPU, good to identify whether there might be some issue like that.
Do you have any masquerade rules? What is your OSPF topology like, is there anything that would cause frequent LSA’s like /32 routes for PPP tunnels?
Also I am wondering about the bonding - fastpath apparently on works with bonded interfaces on receive, and apparently even then only since RouterOS 6.30
Regarding NAT, like I said I tried with the firewall completely empty save for that fasttrack rule and the problem still occurred.
Our network has 700ish OSPF routes, no other routing protocols besides OSPF.
There is no flapping LSA advertisements that I can see. Also no PPP tunnels.
RouterOS version on this one is 6.34.6, not sure if Fast Path is actually active on the bonding interfaces since there is no indicator, but there is basically no traffic on them to begin with, 200 Mbps up+down tops.
Do you have connection tracking turned on? how big is your connection tracking table? I found previously that port scanning created many connections to port 445 filling up connection tracking table and causing CPU spikes. Not sure if this is exposed to the Internet or not and whether you are blocking those.
Also what about your ARP table? If you have an interface name instead of IP as a next hop, ARP entries for remote hosts (ex. on the internet) accumulate in the ARP table on the local device, often it is just due to a mistake.
conntrack is set to auto, and since I currently have no filter/NAT/mangle/raw rules it is off.
when it is on there are probably anywhere from 150k to 500k connections.
all routes in the routing table have IPs as the gateway (nexthop), with the exception of connected routes.
by bridge table do you mean bridge>hosts? currently 464.
Recently I had an issue where I found I needed to disable IP Route Cache under ip settings.routers would hang for no reason with mutiple ospf adjacency changes. Ip route cache normally sitting on between 90 to 400 would quickly climb and cpu usage would be high. it would then lock up. I have since tunred off ip Route cache on all my routers. no problems since. cpu usage down as well.
Hi Railander, I see same behaviour on my network.
On CCR1016-12S-1S+ I’ve small packet loss, and on next hop with other 1016 packet loss is adding
No problem on CCR1072.
I try to record with Camtasia the CPU usage but I never see a core go to 100%
See blu lines on smokeping
pl.png
Have you find a solution?
Are you sure that the 1016’s themselves are the cause of the loss, and not the intervening connectivity? In general, loss is associated either with a single core being maxed out or close to maxed out (90% or higher), or congestion.