Community discussions

MikroTik App
 
flameproof
Frequent Visitor
Frequent Visitor
Topic Author
Posts: 80
Joined: Tue Sep 01, 2015 3:17 pm

CCR1072 running out of CPU, what next for a PPPoE ISP?

Tue May 05, 2020 12:18 pm

Hi all,

I'm writing to pick the collective mind on our conundrum. We have been in contact with Mikrotik support, due to a problem we are having in some of our networks (we're a small ISP in Kenya), where under 1400 to 2000 CPEs connected, over PPPoE, cause "flaps", where the CCR gets in a state where:

- PPPoE sessions get dropped
- RADIUS timeouts increase drastically, even though RADIUS is functioning fine (we use RADSEC)
- WebFig and Winbox both work, but all sections of the configuration are empty - no interfaces, no firewall rules, nothing
- Generating a supout.rif sometimes results in an "empty" file - it's still 5MB in size, but there is zero configuration details
- The CCR is unresponsive over API, SSH or telnet, where login prompts "freeze" after connection

The official answer from Mikrotik is, essentially, "you have run out of CPU".

Thus, I'd like to get people's views on:

- Are we really maxing out a CCR1072 with 1500-2000 PPPoE clients, each on a 4Mbps with burst to 5Mbps, each having a simple queue?
- Has anyone experienced the "freezing" issue before? It causes all traffic to drop at once, and slowly build up again over the course of 10 minutes or so
- Can anyone recommend PPPoE concentrators in the "low cost ISP" range?

We have contacted a couple of Mikrotik consultants who came up empty, if anyone has the experience and wants to take a stab at improving our configs & solving the issue, we're happy to setup a consultancy gig.
 
sindy
Forum Guru
Forum Guru
Posts: 5394
Joined: Mon Dec 04, 2017 9:19 pm

Re: CCR1072 running out of CPU, what next for a PPPoE ISP?

Tue May 05, 2020 1:01 pm

First of all, two cases are possible:
  • you are cutting the edge all the time, which /tool profile or, even better, CPU usage graph should show,
  • you use an action=masquerade rule to src-nat your clients (speculating, as you gave no details), and this issue is triggered by a glitch of your uplink interface (or a change of its address - again, details missing), which causes all the existing connections to get removed from the connection tracking table at the same time, which causes a huge CPU load.

If it is the first case (constant load), 4 Mbit/s per each of 2000 clients means 8 Gbit/s in total, which is an order of magnitude below the declared 75 Gbit/s for 512-byte packets.
However:
  • there is the PPPoE header manipulation which is not part of the throughput test and on other models it has a noticeable impact on the throughput,
  • probably even more important, the 2000 individual simple queues make me cautious. The rules defining the simple queues are matched like firewall rules, one by one from the top until first match, for every single packet, so it may slow down the packet processing significantly. Just have a look at the test results on the product page - for 512-byte packets, already 25 simple queues reduce the throughput to 2/3 of the one stated for routing alone (no simple queues, no firewall rules).

    Again, /tool profile should show which type of packet processing takes the most CPU. So you may consider to replace simple queues by PCQ handling (where the relationship of a packet to the virtual flow handled by the virtual queue is determined internally so higher efficiency can be expected as compared to simple queue matching). I'm not an expert here, so I only suppose that you can create several "normal" queues as parents and set all clients with the same bandwidth contract into the same PCQ queue (all flows in the PCQ queue have the same limit and max-limit parameters).

Lastly, track also the memory consumption using the /tool graph facility. If you can see the average of memory usage to grow all the time, there is some memory leakage issue in one of the processes your deployment uses.
Instead of writing novels, post /export hide-sensitive. Use find&replace in your favourite text editor to systematically replace all occurrences of each public IP address potentially identifying you by a distinctive pattern such as my.public.ip.1.
 
flameproof
Frequent Visitor
Frequent Visitor
Topic Author
Posts: 80
Joined: Tue Sep 01, 2015 3:17 pm

Re: CCR1072 running out of CPU, what next for a PPPoE ISP?

Tue May 05, 2020 1:17 pm

Hi sindy,

Thank you for your very detailed and quick reply. I'll quote and reply in blocks:

you are cutting the edge all the time, which /tool profile or, even better, CPU usage graph should show

We don't see any single CPU get loaded to 100% for extended periods of time, it's more like many CPUs spike to 100% for 2-5 seconds, then drop out. Also, nothing in the list of processes gets increased load than normal, there is literally nohing that spikes or gets out of control during these events.

you use an action=masquerade rule to src-nat your clients (speculating, as you gave no details), and this issue is triggered by a glitch of your uplink interface (or a change of its address - again, details missing), which causes all the existing connections to get removed from the connection tracking table at the same time, which causes a huge CPU load.

All the PPPoE clients have src-nat rules, no masquerade, we only have one final masquerade rule for "catch-all" e.g. internal network devices like P2P radios and sectors, switches, etc.

If it is the first case (constant load), 4 Mbit/s per each of 2000 clients means 8 Gbit/s in total, which is an order of magnitude below the declared 75 Gbit/s for 512-byte packets.

The clients of course have a varying traffic profile, they are not doing 4 Mbps 24x7 - they average out at 0.7Mbps during a 24-hour period.

there is the PPPoE header manipulation which is not part of the throughput test and on other models it has a noticeable impact on the throughput,

Maybe this is what doesn't get measured by any profiling or debugging tool...

probably even more important, the 2000 individual simple queues make me cautious. The rules defining the simple queues are matched like firewall rules, one by one from the top until first match, for every single packet, so it may slow down the packet processing significantly. Just have a look at the test results on the product page - for 512-byte packets, already 25 simple queues reduce the throughput to 2/3 of the one stated for routing alone (no simple queues, no firewall rules).

So I focused on this one, and tried using PCQ or simple queues matching the target subnets, but then traffic rates dropped to almost nothing - I'm not an expert in queues and haven't had the time to experiment more, so this is where someone with good knowledge of queues in high-volume scenarios would come in handy :-)

Lastly, track also the memory consumption using the /tool graph facility. If you can see the average of memory usage to grow all the time, there is some memory leakage issue in one of the processes your deployment uses.

No leaks in memory that we could spot either.

Thanks again!
 
sindy
Forum Guru
Forum Guru
Posts: 5394
Joined: Mon Dec 04, 2017 9:19 pm

Re: CCR1072 running out of CPU, what next for a PPPoE ISP?

Tue May 05, 2020 1:32 pm

We don't see any single CPU get loaded to 100% for extended periods of time, it's more like many CPUs spike to 100% for 2-5 seconds, then drop out. Also, nothing in the list of processes gets increased load than normal, there is literally nohing that spikes or gets out of control during these events.
What is this "normal" load of the cores? 1%, 10%, 50%? The events most likely occur so fast that everything breaks before the raise in CPU consumption can get recorded. If the "normal" load is 50 %, it is already bad as the margin for surprises is very low. These effects come in an exponential rather than linear way.

...tried using PCQ or simple queues matching the target subnets, but then traffic rates dropped to almost nothing
I've recently come across a statement that use of simple queues prevents the regular ones from working, but I cannot find it now. So if you've tried to migrate part of clients to the PCQ and others stayed on simple queues, that might explain the issue. I'd have to see the configuration you've tried to eventually spot a flaw in it.

Do you happen to have a smaller 'Tik or a CHR to do the testing using just a few test clients rather than annoying the live people?
Instead of writing novels, post /export hide-sensitive. Use find&replace in your favourite text editor to systematically replace all occurrences of each public IP address potentially identifying you by a distinctive pattern such as my.public.ip.1.
 
flameproof
Frequent Visitor
Frequent Visitor
Topic Author
Posts: 80
Joined: Tue Sep 01, 2015 3:17 pm

Re: CCR1072 running out of CPU, what next for a PPPoE ISP?

Tue May 05, 2020 1:40 pm

So in terms of the loads, the CPUs are in the low 10s at most, except the ones that get to 100%.

In terms of queues, I have a lab setup where I did test using 5 hAPs downloading a large file from a HTTP server repeatedly, and testing queues etc. I got the queues to work (PCQ queue type holding the rate limit + simple queue targetting the subnet), and things worked.

I then translated this to one of the small networks with ~200 CPEs on a 1036, and they all got like 10-50kbps, it was all over the place. After that, I gave up...
 
andriys
Forum Guru
Forum Guru
Posts: 1353
Joined: Thu Nov 24, 2011 1:59 pm
Location: Kharkiv, Ukraine

Re: CCR1072 running out of CPU, what next for a PPPoE ISP?

Tue May 05, 2020 1:58 pm

The rules defining the simple queues are matched like firewall rules, one by one from the top until first match, for every single packet, so it may slow down the packet processing significantly.
It used to be the case in RouterOS v5, but since early v6 it is not the case anymore. Simple queues are now hash-based.
 
sindy
Forum Guru
Forum Guru
Posts: 5394
Joined: Mon Dec 04, 2017 9:19 pm

Re: CCR1072 running out of CPU, what next for a PPPoE ISP?

Tue May 05, 2020 2:07 pm

The rules defining the simple queues are matched like firewall rules, one by one from the top until first match, for every single packet, so it may slow down the packet processing significantly.
It used to be the case in RouterOS v5, but since early v6 it is not the case anymore. Simple queues are now hash-based.
In that case, the manual should be upgraded accordingly... it still says Simple queues have a strict order - each packet must go through every queue until it reaches one queue which conditions fit packet parameters or until the end of the queues list is reached. (In case of 1000 queues, a packet for the last queue will need to proceed through 999 queues before it will reach the destination)
Instead of writing novels, post /export hide-sensitive. Use find&replace in your favourite text editor to systematically replace all occurrences of each public IP address potentially identifying you by a distinctive pattern such as my.public.ip.1.
 
flameproof
Frequent Visitor
Frequent Visitor
Topic Author
Posts: 80
Joined: Tue Sep 01, 2015 3:17 pm

Re: CCR1072 running out of CPU, what next for a PPPoE ISP?

Tue May 05, 2020 2:26 pm

In that case, the manual should be upgraded accordingly... it still says Simple queues have a strict order - each packet must go through every queue until it reaches one queue which conditions fit packet parameters or until the end of the queues list is reached. (In case of 1000 queues, a packet for the last queue will need to proceed through 999 queues before it will reach the destination)
Totally agree - this is also something that threw me off and took me down the rabbit hole of PCQ. Also, Mikrotik support didn't flag the amount of queues as an issue to look into.
 
paulct
Member
Member
Posts: 324
Joined: Fri Jul 12, 2013 5:38 pm

Re: CCR1072 running out of CPU, what next for a PPPoE ISP?

Tue May 05, 2020 3:42 pm

Simple question, why have you not split it over 2 x 1072s?
 
flameproof
Frequent Visitor
Frequent Visitor
Topic Author
Posts: 80
Joined: Tue Sep 01, 2015 3:17 pm

Re: CCR1072 running out of CPU, what next for a PPPoE ISP?

Tue May 05, 2020 3:55 pm

Simple question, why have you not split it over 2 x 1072s?

Cost, essentially. The 1072 costs over $4000 in Kenya, so it's quite a heavy hit unless we can also add the number of customers to support the CAPEX. However, that is one test we are going to do, with a 1036 (we don't have that many 1072s laying around!), any suggestions as to using it just as a PPPoE concentrator? We only have one 10Gbps fiber connection going upstream, which terminates physically at the 1072, so what would be the most efficient way to use the 1036 in between CPEs and the 1072?
 
sindy
Forum Guru
Forum Guru
Posts: 5394
Joined: Mon Dec 04, 2017 9:19 pm

Re: CCR1072 running out of CPU, what next for a PPPoE ISP?

Tue May 05, 2020 4:13 pm

what would be the most efficient way to use the 1036 in between CPEs and the 1072?
Well, this changes the perspective a bit - you can either cascade the machines, clearly splitting the tasks between them, i.e. the 1036 would only deal with PPPoE and the 1072 would only deal with the queueing and NAT (provided there is enough capacity between the two so that the not-yet-curbed upload traffic from the CPEs wouldn't clog the interconnection, but the upload is rarely an issue with residential clients), or, better, you can let them work in load sharing if you can redirect part of the PPPoE clients to the 1036 at L2 level (depends on your access network architecture). In this case, each machine would take care about both PPPoE and queueing for "its own" clients, and the 1072 would forward traffic to/from the 1036 bypassing any queues. It would require to dedicate different address ranges to PPPoE clients of both machines to obtain simple routing rules. You cannot split the load 2:1 as the 1072 will need some capacity to do the connection tracking which is not necessary at the 1036 if you let it do only PPPoE and queues but no stateful firewalling.
Instead of writing novels, post /export hide-sensitive. Use find&replace in your favourite text editor to systematically replace all occurrences of each public IP address potentially identifying you by a distinctive pattern such as my.public.ip.1.

Who is online

Users browsing this forum: bpwl, eddieb, inteq, jvanhambelgium, mohkhalifa, sindy and 105 guests