We are currently stress testing our CCR1016-12G in order to see where its limits are. We are sending more than 1M PPS (64-byte UDP packets) from a host directly connected to the CCR. The router happily routes all this traffic, CPU load is between 0 and 10%.
Very suddenly (after a random time of some seconds or minutes) the CPU load jumps to 100% and stays there. When this happens, the router forwards only 250K PPS, the rest ist lost. The router then stays in that situation as long as we continue sending that much traffic. When we stop and start over again, the router routes the full 1M PPS without losing anything.
We have observed this with RouterOS 6.34.2 and 6.35rc29.
If we enable profiling we see the following when the router routes 1M PPS:
NAME CPU USAGE
firewall-mgmt all 0%
www all 0%
ethernet all 2.6%
console all 0%
networking all 18.8%
management all 0%
idle all 74.9%
bridging all 3.3%
unclassified all 0%
When the CPU goes to 100%, profiling shows the following:
NAME CPU USAGE
firewall-mgmt all 0%
www all 0%
ethernet all 0.5%
console all 0%
networking all 95.9%
management all 0%
profiling all 0.1%
queuing all 2.5%
bridging all 0.5%
unclassified all 0%
Any ideas on what could be causing this? Why would the router route up to 1M PPS and then drop to 250K and stay there?
There are any firewall filters, mangling or QoS being done?
I’d generate a supout while passing 1M, and another one while the problem occurs, then send an email to mikrotik support describing the situation (or linking to this post) and including both supout, as this could be a hardware problem.
As you aren’t providing any export or configuration details here, all I can suggest in the meanwhile is trying to set a different queue type in Queues > Interface queues for the involved interfaces, e.g. changing from only-hardware-queue to ethernet-default, etc. and see if it makes a difference that may provide leads about the nature of the problem.
Enable debug log also for interfaces.
Analyzing a packet capture while the problem is happening may also lead to where the problem is.
There is no firewalling, mangling or QoS configured, and I haven’t changed any of the queueing settings. We use LACP, bridging, VRRP and BGP.
I have the feeling that the problem is less likely to happen when the router is rebooted. Before the last reboot, the problem arrived after some seconds and now after rebooting it we are routing 1M PPS for a couple of minutes without any problem at all.
I will observe this further and try to play with the different queue types before contacting Mikrotik support.
No, unfortunately not. I haven’t been able to reproduce it reliably in order to file a bug report.
I still have this on my to-do list, but at this moment my customer doesn’t want me to conduct more tests with his infrastructure since they will go into production soon and they consider that the risk of being impacted by that problem is very modest.
Does this only occur during testing with a testing tool or also in production environment?
It could be that the testing tool sends traffic with random port numbers and/or addresses and the router chokes on the
ever increasing number of session tracking entries.
In normal use that would not happen so easily.