Community discussions

MikroTik App
 
flameproof
Member Candidate
Member Candidate
Topic Author
Posts: 128
Joined: Tue Sep 01, 2015 3:17 pm

CCR1072 running out of CPU, what next for a PPPoE ISP?

Tue May 05, 2020 12:18 pm

Hi all,

I'm writing to pick the collective mind on our conundrum. We have been in contact with Mikrotik support, due to a problem we are having in some of our networks (we're a small ISP in Kenya), where under 1400 to 2000 CPEs connected, over PPPoE, cause "flaps", where the CCR gets in a state where:

- PPPoE sessions get dropped
- RADIUS timeouts increase drastically, even though RADIUS is functioning fine (we use RADSEC)
- WebFig and Winbox both work, but all sections of the configuration are empty - no interfaces, no firewall rules, nothing
- Generating a supout.rif sometimes results in an "empty" file - it's still 5MB in size, but there is zero configuration details
- The CCR is unresponsive over API, SSH or telnet, where login prompts "freeze" after connection

The official answer from Mikrotik is, essentially, "you have run out of CPU".

Thus, I'd like to get people's views on:

- Are we really maxing out a CCR1072 with 1500-2000 PPPoE clients, each on a 4Mbps with burst to 5Mbps, each having a simple queue?
- Has anyone experienced the "freezing" issue before? It causes all traffic to drop at once, and slowly build up again over the course of 10 minutes or so
- Can anyone recommend PPPoE concentrators in the "low cost ISP" range?

We have contacted a couple of Mikrotik consultants who came up empty, if anyone has the experience and wants to take a stab at improving our configs & solving the issue, we're happy to setup a consultancy gig.
 
sindy
Forum Guru
Forum Guru
Posts: 10205
Joined: Mon Dec 04, 2017 9:19 pm

Re: CCR1072 running out of CPU, what next for a PPPoE ISP?

Tue May 05, 2020 1:01 pm

First of all, two cases are possible:
  • you are cutting the edge all the time, which /tool profile or, even better, CPU usage graph should show,
  • you use an action=masquerade rule to src-nat your clients (speculating, as you gave no details), and this issue is triggered by a glitch of your uplink interface (or a change of its address - again, details missing), which causes all the existing connections to get removed from the connection tracking table at the same time, which causes a huge CPU load.

If it is the first case (constant load), 4 Mbit/s per each of 2000 clients means 8 Gbit/s in total, which is an order of magnitude below the declared 75 Gbit/s for 512-byte packets.
However:
  • there is the PPPoE header manipulation which is not part of the throughput test and on other models it has a noticeable impact on the throughput,
  • probably even more important, the 2000 individual simple queues make me cautious. The rules defining the simple queues are matched like firewall rules, one by one from the top until first match, for every single packet, so it may slow down the packet processing significantly. Just have a look at the test results on the product page - for 512-byte packets, already 25 simple queues reduce the throughput to 2/3 of the one stated for routing alone (no simple queues, no firewall rules).

    Again, /tool profile should show which type of packet processing takes the most CPU. So you may consider to replace simple queues by PCQ handling (where the relationship of a packet to the virtual flow handled by the virtual queue is determined internally so higher efficiency can be expected as compared to simple queue matching). I'm not an expert here, so I only suppose that you can create several "normal" queues as parents and set all clients with the same bandwidth contract into the same PCQ queue (all flows in the PCQ queue have the same limit and max-limit parameters).

Lastly, track also the memory consumption using the /tool graph facility. If you can see the average of memory usage to grow all the time, there is some memory leakage issue in one of the processes your deployment uses.
 
flameproof
Member Candidate
Member Candidate
Topic Author
Posts: 128
Joined: Tue Sep 01, 2015 3:17 pm

Re: CCR1072 running out of CPU, what next for a PPPoE ISP?

Tue May 05, 2020 1:17 pm

Hi sindy,

Thank you for your very detailed and quick reply. I'll quote and reply in blocks:

you are cutting the edge all the time, which /tool profile or, even better, CPU usage graph should show

We don't see any single CPU get loaded to 100% for extended periods of time, it's more like many CPUs spike to 100% for 2-5 seconds, then drop out. Also, nothing in the list of processes gets increased load than normal, there is literally nohing that spikes or gets out of control during these events.

you use an action=masquerade rule to src-nat your clients (speculating, as you gave no details), and this issue is triggered by a glitch of your uplink interface (or a change of its address - again, details missing), which causes all the existing connections to get removed from the connection tracking table at the same time, which causes a huge CPU load.

All the PPPoE clients have src-nat rules, no masquerade, we only have one final masquerade rule for "catch-all" e.g. internal network devices like P2P radios and sectors, switches, etc.

If it is the first case (constant load), 4 Mbit/s per each of 2000 clients means 8 Gbit/s in total, which is an order of magnitude below the declared 75 Gbit/s for 512-byte packets.

The clients of course have a varying traffic profile, they are not doing 4 Mbps 24x7 - they average out at 0.7Mbps during a 24-hour period.

there is the PPPoE header manipulation which is not part of the throughput test and on other models it has a noticeable impact on the throughput,

Maybe this is what doesn't get measured by any profiling or debugging tool...

probably even more important, the 2000 individual simple queues make me cautious. The rules defining the simple queues are matched like firewall rules, one by one from the top until first match, for every single packet, so it may slow down the packet processing significantly. Just have a look at the test results on the product page - for 512-byte packets, already 25 simple queues reduce the throughput to 2/3 of the one stated for routing alone (no simple queues, no firewall rules).

So I focused on this one, and tried using PCQ or simple queues matching the target subnets, but then traffic rates dropped to almost nothing - I'm not an expert in queues and haven't had the time to experiment more, so this is where someone with good knowledge of queues in high-volume scenarios would come in handy :-)

Lastly, track also the memory consumption using the /tool graph facility. If you can see the average of memory usage to grow all the time, there is some memory leakage issue in one of the processes your deployment uses.

No leaks in memory that we could spot either.

Thanks again!
 
sindy
Forum Guru
Forum Guru
Posts: 10205
Joined: Mon Dec 04, 2017 9:19 pm

Re: CCR1072 running out of CPU, what next for a PPPoE ISP?

Tue May 05, 2020 1:32 pm

We don't see any single CPU get loaded to 100% for extended periods of time, it's more like many CPUs spike to 100% for 2-5 seconds, then drop out. Also, nothing in the list of processes gets increased load than normal, there is literally nohing that spikes or gets out of control during these events.
What is this "normal" load of the cores? 1%, 10%, 50%? The events most likely occur so fast that everything breaks before the raise in CPU consumption can get recorded. If the "normal" load is 50 %, it is already bad as the margin for surprises is very low. These effects come in an exponential rather than linear way.

...tried using PCQ or simple queues matching the target subnets, but then traffic rates dropped to almost nothing
I've recently come across a statement that use of simple queues prevents the regular ones from working, but I cannot find it now. So if you've tried to migrate part of clients to the PCQ and others stayed on simple queues, that might explain the issue. I'd have to see the configuration you've tried to eventually spot a flaw in it.

Do you happen to have a smaller 'Tik or a CHR to do the testing using just a few test clients rather than annoying the live people?
 
flameproof
Member Candidate
Member Candidate
Topic Author
Posts: 128
Joined: Tue Sep 01, 2015 3:17 pm

Re: CCR1072 running out of CPU, what next for a PPPoE ISP?

Tue May 05, 2020 1:40 pm

So in terms of the loads, the CPUs are in the low 10s at most, except the ones that get to 100%.

In terms of queues, I have a lab setup where I did test using 5 hAPs downloading a large file from a HTTP server repeatedly, and testing queues etc. I got the queues to work (PCQ queue type holding the rate limit + simple queue targetting the subnet), and things worked.

I then translated this to one of the small networks with ~200 CPEs on a 1036, and they all got like 10-50kbps, it was all over the place. After that, I gave up...
 
andriys
Forum Guru
Forum Guru
Posts: 1526
Joined: Thu Nov 24, 2011 1:59 pm
Location: Kharkiv, Ukraine

Re: CCR1072 running out of CPU, what next for a PPPoE ISP?

Tue May 05, 2020 1:58 pm

The rules defining the simple queues are matched like firewall rules, one by one from the top until first match, for every single packet, so it may slow down the packet processing significantly.
It used to be the case in RouterOS v5, but since early v6 it is not the case anymore. Simple queues are now hash-based.
 
sindy
Forum Guru
Forum Guru
Posts: 10205
Joined: Mon Dec 04, 2017 9:19 pm

Re: CCR1072 running out of CPU, what next for a PPPoE ISP?

Tue May 05, 2020 2:07 pm

The rules defining the simple queues are matched like firewall rules, one by one from the top until first match, for every single packet, so it may slow down the packet processing significantly.
It used to be the case in RouterOS v5, but since early v6 it is not the case anymore. Simple queues are now hash-based.
In that case, the manual should be upgraded accordingly... it still says Simple queues have a strict order - each packet must go through every queue until it reaches one queue which conditions fit packet parameters or until the end of the queues list is reached. (In case of 1000 queues, a packet for the last queue will need to proceed through 999 queues before it will reach the destination)
 
flameproof
Member Candidate
Member Candidate
Topic Author
Posts: 128
Joined: Tue Sep 01, 2015 3:17 pm

Re: CCR1072 running out of CPU, what next for a PPPoE ISP?

Tue May 05, 2020 2:26 pm

In that case, the manual should be upgraded accordingly... it still says Simple queues have a strict order - each packet must go through every queue until it reaches one queue which conditions fit packet parameters or until the end of the queues list is reached. (In case of 1000 queues, a packet for the last queue will need to proceed through 999 queues before it will reach the destination)
Totally agree - this is also something that threw me off and took me down the rabbit hole of PCQ. Also, Mikrotik support didn't flag the amount of queues as an issue to look into.
 
paulct
Member
Member
Posts: 336
Joined: Fri Jul 12, 2013 5:38 pm

Re: CCR1072 running out of CPU, what next for a PPPoE ISP?

Tue May 05, 2020 3:42 pm

Simple question, why have you not split it over 2 x 1072s?
 
flameproof
Member Candidate
Member Candidate
Topic Author
Posts: 128
Joined: Tue Sep 01, 2015 3:17 pm

Re: CCR1072 running out of CPU, what next for a PPPoE ISP?

Tue May 05, 2020 3:55 pm

Simple question, why have you not split it over 2 x 1072s?

Cost, essentially. The 1072 costs over $4000 in Kenya, so it's quite a heavy hit unless we can also add the number of customers to support the CAPEX. However, that is one test we are going to do, with a 1036 (we don't have that many 1072s laying around!), any suggestions as to using it just as a PPPoE concentrator? We only have one 10Gbps fiber connection going upstream, which terminates physically at the 1072, so what would be the most efficient way to use the 1036 in between CPEs and the 1072?
 
sindy
Forum Guru
Forum Guru
Posts: 10205
Joined: Mon Dec 04, 2017 9:19 pm

Re: CCR1072 running out of CPU, what next for a PPPoE ISP?

Tue May 05, 2020 4:13 pm

what would be the most efficient way to use the 1036 in between CPEs and the 1072?
Well, this changes the perspective a bit - you can either cascade the machines, clearly splitting the tasks between them, i.e. the 1036 would only deal with PPPoE and the 1072 would only deal with the queueing and NAT (provided there is enough capacity between the two so that the not-yet-curbed upload traffic from the CPEs wouldn't clog the interconnection, but the upload is rarely an issue with residential clients), or, better, you can let them work in load sharing if you can redirect part of the PPPoE clients to the 1036 at L2 level (depends on your access network architecture). In this case, each machine would take care about both PPPoE and queueing for "its own" clients, and the 1072 would forward traffic to/from the 1036 bypassing any queues. It would require to dedicate different address ranges to PPPoE clients of both machines to obtain simple routing rules. You cannot split the load 2:1 as the 1072 will need some capacity to do the connection tracking which is not necessary at the 1036 if you let it do only PPPoE and queues but no stateful firewalling.
 
xormac
just joined
Posts: 1
Joined: Sat Jul 11, 2020 10:28 am

Re: CCR1072 running out of CPU, what next for a PPPoE ISP?

Wed Dec 02, 2020 10:28 am

Hi flameproof,

Did you guys ever find a solution to the problem? We seem to be seeing the same behaviour on our 1072 with 1000+ PPPoE clients.

Best Regards,

Michael
 
flameproof
Member Candidate
Member Candidate
Topic Author
Posts: 128
Joined: Tue Sep 01, 2015 3:17 pm

Re: CCR1072 running out of CPU, what next for a PPPoE ISP?

Wed Dec 02, 2020 10:45 am

Hi Michael - we are currently testing a split-duties configuration on one of our smaller networks. In essence, we will use a 1036 as NAT/router, and one or more 1016s as PPPoE concentrators enforcing rate limits via queues. We can thus scale with more 1016s (cheaper) as required.

I'll post here when we see results, the next step will be to implement on our network with 1500+ CPEs.

A couple of mitigating actions we have taken, that have helped reduce the duration of "flaps" and service loss from 15-20 minutes to around 2 minutes:

1. Reduce all the connection tracking timeouts to the bare minimum, see ours:

Screen Shot 2020-12-02 at 09.42.08.png

2. Offload DNS to an external server (eg. Bind9 or Unbound), thus liberating the CCR.
3. RADIUS timeout to 1000ms, and go with RADSEC instead of UDP. We find TCP a lot more solid at times of reconnection floods.
4. Make sure your RADIUS server is tuned to cope with large peak request volumes.

Good luck!
You do not have the required permissions to view the files attached to this post.
 
sindy
Forum Guru
Forum Guru
Posts: 10205
Joined: Mon Dec 04, 2017 9:19 pm

Re: CCR1072 running out of CPU, what next for a PPPoE ISP?

Wed Dec 02, 2020 11:03 am

Did you guys ever find a solution to the problem? We seem to be seeing the same behaviour on our 1072 with 1000+ PPPoE clients.
You may be interested in another topic dealing with similar symptoms. In general and short: when the load of a CCR exceeds some critical threshold, things go really wrong really fast, because handling of PPPoE management traffic has no precedence over payload traffic and is quite CPU-hungry on its own, and configuration errors which remain unnoticed during normal operation cause even more load to handle if PPPoE session management starts to tear down and re-establish the sessions due to the initial overload caused by something else.

What I'm trying to say is that there are many possible root causes leading to the same symptoms, so the complete configuration of this particular PPPoE AC machine needs to be analysed. On the other hand, the final conclusion may still be that there is nothing to improve in the configuration and still a "natural" peak of traffic can trigger the disaster.
 
bugino
newbie
Posts: 29
Joined: Tue Aug 08, 2006 12:05 am

Re: CCR1072 running out of CPU, what next for a PPPoE ISP?

Thu Dec 03, 2020 5:30 pm

Threshold for PPPoE AC is somewhere near to 1.4k active connections. More active client connections make mess. It is problem on ccr1036 or ccr1072. It doesnt matter. I even turned off NAT, Connection tracking and firewall. For this i have separate ccr 1072. Only using simple queue and pppoe and it drops active clients and makes tunnels without router, so it create problem for client to get connection when reaching 1.4k connections.

Now i am running 3 PPPoE AC /2 in production and 1 as backup/. I mirror secrets by script.

viewtopic.php?f=2&t=168602&p=827251&hilit=pppoe#p827251
 
hssindigo
just joined
Posts: 1
Joined: Sat Dec 07, 2019 7:29 pm

Re: CCR1072 running out of CPU, what next for a PPPoE ISP?

Thu Mar 04, 2021 5:54 am

Hello,

We have a same issue with 1000+ clients,

But we were able to find a temporary solution,
1. Do not use bridge, use a single port for src-nat (u may connect to a sfp switch next)
2. Disable stp on all interface.

We found this to be stable with no freezes atleast from a week.

If some one can confirm this who is having a similar issue, would be helpful.
 
flameproof
Member Candidate
Member Candidate
Topic Author
Posts: 128
Joined: Tue Sep 01, 2015 3:17 pm

Re: CCR1072 running out of CPU, what next for a PPPoE ISP?

Thu Mar 04, 2021 11:39 am

Greetings,

We have a few more CCRs that have fallen into this problem. The CPU #1 gets stuck at 100% on "networking", and all PPPoE sessions are eventually dropped and re-established. We use software bridge to tie more than one ethernet port together, as our total subscriber throughput exceeds 1Gbps.

Thus, we have two or three 1Gbps interface in a bridge, and the PPPoE server runs on the bridge interface. Tonight we are going to run an exercise to split a CCR with this setup, where two PPPoE servers will service each a single Ethernet interface, and src-nat upstream.

I'll report back on our findings, thanks for your very well timed input!
 
JaviGL93
just joined
Posts: 12
Joined: Thu Mar 03, 2022 7:13 pm

Re: CCR1072 running out of CPU, what next for a PPPoE ISP?

Thu Mar 03, 2022 7:37 pm

Greetings,

We have a few more CCRs that have fallen into this problem. The CPU #1 gets stuck at 100% on "networking", and all PPPoE sessions are eventually dropped and re-established. We use software bridge to tie more than one ethernet port together, as our total subscriber throughput exceeds 1Gbps.

Thus, we have two or three 1Gbps interface in a bridge, and the PPPoE server runs on the bridge interface. Tonight we are going to run an exercise to split a CCR with this setup, where two PPPoE servers will service each a single Ethernet interface, and src-nat upstream.

I'll report back on our findings, thanks for your very well timed input!
any new information? ^_^
 
flameproof
Member Candidate
Member Candidate
Topic Author
Posts: 128
Joined: Tue Sep 01, 2015 3:17 pm

Re: CCR1072 running out of CPU, what next for a PPPoE ISP?

Fri Mar 04, 2022 9:46 am

It has been a year, and no. What seems clear is Mikrotik is moving away from Tile, has been developing new hardware for months, and we're going to test the new models.
 
User avatar
chechito
Forum Guru
Forum Guru
Posts: 2989
Joined: Sun Aug 24, 2014 3:14 am
Location: Bogota Colombia
Contact:

Re: CCR1072 running out of CPU, what next for a PPPoE ISP?

Fri Mar 04, 2022 5:28 pm

divide and conquer

you need to scale your design beyond one box to do everything

leave that CCR1072 only for core and internet Border Tasks

Split your PPPoE Load between 2 aditional CCR1036 8g 2s+

enjoy
 
User avatar
chechito
Forum Guru
Forum Guru
Posts: 2989
Joined: Sun Aug 24, 2014 3:14 am
Location: Bogota Colombia
Contact:

Re: CCR1072 running out of CPU, what next for a PPPoE ISP?

Fri Mar 04, 2022 5:32 pm

divide and conquer

you need to scale your design beyond one box to do everything

leave that CCR1072 only for core and internet Border Tasks

Split your PPPoE Load between 2 aditional CCR1036 8g 2s+

enjoy
another option:

1 x CCR1036 8g 2s+ only for core and internet Border Tasks (replacing ccr1072)
2 x CCR1036 8g 2s+ to Split your PPPoE Load

sell ccr1072
 
flameproof
Member Candidate
Member Candidate
Topic Author
Posts: 128
Joined: Tue Sep 01, 2015 3:17 pm

Re: CCR1072 running out of CPU, what next for a PPPoE ISP?

Fri Mar 04, 2022 5:42 pm

That's what we end up doing ;-)
 
User avatar
chechito
Forum Guru
Forum Guru
Posts: 2989
Joined: Sun Aug 24, 2014 3:14 am
Location: Bogota Colombia
Contact:

Re: CCR1072 running out of CPU, what next for a PPPoE ISP?

Fri Mar 04, 2022 10:02 pm

That's what we end up doing ;-)

try to modify your configuration to make viable disabling any other functionality on PPPoE Routers leaving only PPPoE and simple queues you will see great results

For Gpon for example you can try controlling bandwidth on OLT, then remove simple-queues to leave only PPPoE running, this way you can max out your PPPoE performance and stability

also when you have so much interfaces (1000's pppoe) can be beneficial todo some tasks using command line interface, winbox graphic interface can be slow an use some router resources to show and update status of thousands of interfaces
 
User avatar
shailparmar
Frequent Visitor
Frequent Visitor
Posts: 97
Joined: Wed Aug 20, 2014 6:07 pm
Location: GB
Contact:

Re: CCR1072 running out of CPU, what next for a PPPoE ISP?

Wed Nov 30, 2022 2:35 pm

We are using 1036 As BNG + 1072 As NAS, with 1700+ pppoe users and 4.5Gbps of peak traffic. Running smoothly with 25% clu usage on 1072.

You can use this guide for optimization.

https://www.daryllswer.com/edge-router- ... -for-isps/

Who is online

Users browsing this forum: coffee1978, EsaqzpHot, icemending and 84 guests