Customers PPPoe unplugging ccr1036

I have a serious problem in two concentrators PPPoE (PPPoE server). They are 2 units (A and B) of CCR1036 12G and are matched with the same VLAN + OSPF. Do nothing all pppoe clients are disconnected from the hub and the processing on for close to 100% and it stops responding to ping and lose access to it. The client is connected to the other hub B. Pass a little time, and customers who are disconnected from the hub and B is the same processing error and they are authenticated on the hub A. I do not know what to do. I already have a week that I expect a response from you without success. I am an internet provider and we are losing a lot of customers because of it. We have a third C ccr1036 hub which is also remotely isolated with the same settings and the problem does not happen in it.

We have a third C ccr1036 hub which is also remotely isolated with the same settings and the problem does not happen in it.

And same ROS and firmware?

Yours isn’t a simple scenario, and you haven’t provided detailed information (and those you provided is confusing), but my first reaction would be checking L2 from the CCR down to your clients.

same firmware version and RouterOS. What information the more you need?

My scenario:

home Router to borad ccr1072 (services: BGP + ipv6 + mpls) → ccr1036 sfp + (service: OSPF + ipv6 + mpls) → PPPoE ccr1036 servers (service: pppoe server ipv6 + + mpls).

At the time of issue, I observed that the consumption of the processor goes all to the firewall (I only nat rules of public IPs for private IPS and all ccr1036 are so) and then, all clients are disconnected.

I had a 103612G-4S that had this behavior. I was able to fix it by switching to 6.36rc12 (testing) firmware, but that was months ago, and I no longer use the CCR as a PPoE aggregator. Take this with a grain of salt since you said all three have the same firmware.

Based on the process of elimination, it would seem like your issue might be caused by your redundancy setup since, by your assessment, everything else is identical to the third site.

Thanks for trying to help. Before you even read your answer, we took a new ccr1036, put in 6.36.3 version and let it just taking the two redundancy. It just happened the same problem. Firewall on cpu to 99% and various pppoe clients are disconnected. I do not know what else to do.

I guess VPLS tunnels are brought to the CCRs onto a bridge an then PPPoE server is run on top? (that’s the kind of details I need).

If so, have you watched if any tunnel stops its running state?

The fact firewall is peaking CPU could indicate some kind of issue with it. Do you have enabled “Use IP Firewall” at bridge settings?

Hello,

About MPLS, it is active but we have no tunnel in operation. Activated only to test whether best meet customer lan to lan by VLAN or MPLS. Before you activate it the problem already existed.

About Firewall, the only rules that have are our management system amending the pool and reduces the queue of the default client for 10% of the contracted speed and plays it on a list created in mangle address.
The other rules are of public IP NAT to private. All CCR 1036 has the same rules and the only difference between the cc1036 who are problem for those working 100% are public IPS in NAT rules.

It is very strange that problem. Since yesterday when I said here in the forum about the last drop of PPPoE clients, ccr1036 up is operating normally with 15% of the CPU. Certainly until evening should happen the problem.

OBS .: We do not have bridge.


At the time of the problem:
[fernando@CONCENTRADOR-DC-1-A] > /tool profile
NAME CPU USAGE
ppp all 1.2%
firewall-mgmt all 0%
snmp all 0%
spi all 0%
ethernet all 0%
console all 0%
firewall all 79%
networking all 1.9%
winbox all 0%
mpls all 0%
management all 0.3%
routing all 0%
idle all 12%
profiling all 0%
queuing all 1.4%
unclassified all 0.4%

The moment is not the problem.:
[fernando@CONCENTRADOR-DC-1-A] > /tool profile
NAME CPU USAGE
ppp all 0%
firewall-mgmt all 0%
snmp all 0%
spi all 0%
ethernet all 0%
console all 0%
firewall all 3%
networking all 0.9%
winbox all 0%
mpls all 0%
management all 0.5%
routing all 0%
idle all 94.4%
profiling all 0%
queuing all 0.8%
unclassified all 0%

This seems to be related to firewall. Could you elaborate on the “management system amending the pool and reduces the queue of the default client” a little more? Maybe the root of the cause lays there, so first step would be isolating that; if it proves to be the problem, you’ll had to send a detailed description to support along with supouts and a link to this post.

My system manages the PPPoE authentication (radius server), reduces the rate of delinquent customers (changes the queue for 10% of the contracted) by inserting the customers in a list address with a different pool. It inserts firewall rules to open warning screens and lock.
I believe than the system as one of the concentrators ccr1036 is working 100% with the same settings. I’m doing a test to test because, between the profile of the server pppoe and mpls put off because it was the only difference found between ccr1036 that is working perfectly for ccr1036 that is with this problem. I’m on the second test day with no loss but I’m still monitoring it has happened cases spending 3 days without giving the problem.