I think this problem appeared in some recent versions of ROS.
It definitely exists on 6.35.4 and 6.38.1.
When we have network outages, some PPPoE sessions are disconnecting with ‘peer not responding’ errors.
In this moments CCR CPUs are 100% utilised, so than router almost stop passing any traffic.
This can continue for some minutes. Router looses OSPF neighbors, and it’s all becomes a catastrophe!
It number of disconnecting sessions is over 200-300 it collapses 100%.
If you are using Masquarade on the router, that is the problem.
When using Masquarade, RouterOS has to do full connection tracking recalculation on EACH interface connect/disconnect.
So if you have lots of PPPoE session connecting/disconnecting, connection tracking will constantly be recalculated which will can high CPU usage.
Solution:
Stop using Masquarade on routers that have a lot of dynamic interfaces.
Either use srcnat, or fix your architecture (use routing).
In my case, I do not use masquerading, just src-nat.
And even selecting sessions ( i.e. 20/30 of them ) that uses Public IP address from the pool, the CPU raises, and tooks a while to go down.
Months ago, I did use Interim Update. Was a disaster.Lot’s of load to the CCR.
Adding 1+1, ‘maybe’ it’s related to the accounting. Or the packets count that the sessions sent to the Radius ( the count process itself )
Let’s think that the support/development staff can dig deeper and shows us a fix.
Yes Masquerade and Routing both in one box.
Scenario is something like below …
4 wan dsl links configured with PCC using SRC-ADDRESS approach. for specific group of users.
1 wan link for public ip users routing. for users with public ips.
Hello, we have the same problem, using SRC and not MASQUERADE. Already we see our architecture and we are not finding problem. In a box with 1000 session when falls between 100 or 200 session, falls all the routing and tunnels, CPU rises to 90 / 95%.
I have the same problem , and don’t using any masquerade .
My pppoe clients are connected via VPLS and each vpls handle about 100~200 client and if one of them disconnect because of network fault , CPU will increase to 100% usage and freeze router .
ROS : 6.38.7
Devices : 3x CCR1036
Each Device handle about 1K PPPoE Client
*** If I disable the connection tracking , problem will solve but then I can’t use src-nat .**
cpu_usage_pppoe_disconnect.png anybody know what’s the problem ?
If your clients are on public IPs for the most part, you can have connection tracking turned off for some things and on for other things, controlling that with the Raw table.
If your clients are on private IPs, a good workaround might be to do the NAT on a separate router rather than the same device.
Yes, also note that if you want some contents of your public_pools to still be processed by connection tracking, you can “accept” that traffic above the notrack rule, accepted traffic is still processed by the main firewall as well, so accept doesn’t mean “I trust this”, it means “I want to track this”