Total PPPoE crashing with CCR1072 and x86 - With all RouterOS 6.3x.xx versions available

We are using CCR1072 since few days, and the moment we connected around 100-200 users on PPPoE we started experiencing strange problems

  1. After few hours of normal operations Most of the users start disconnecting and connecting on every few minutes/seconds
  2. Some of the users are showing as double logged in, like username “test” and also “test-1” are logged in. (maybe as result from 1).
  3. Almost all Dynamic Simple Queues are RED

The issue occurs after running normally for few hours and we can’t find what triggers it as of now.

We tried both 6.33.3 and 6.32.3 versions but the issue remains same.

Reboot of the device solves the issue immediately, so we can’t blame the network for the Connectivity flapping.

Anyone with ideas what can be the issue, or what can be checked further.

Are you using local users (secrets) or radius for AAA? What’s your PPP > Profile Limits “Only one” parameter??

Is the CCR firmware up to date?

Hi,

Authentication happens through Radius.

In PPPoE Profile, “Only one” option is set to “Default”.

In normal circumstances when problem is not triggered everything is smooth, but once problem starts … nothing but reboot helps.

Tested 6.33.3 and 6.32.3 versions, if this is what you mean by firmware…

I mean System > Routerboard firmware.

Looks like what you’re experiencing is the first pppoe client connection gets in a “stale” status, and then client stablishing a second connection.

It could be due to some sort of Layer 2 problem, not related to the CCR, or to the CCR itself.

Check firmware is up to date.

Hi,

firmware is 3.27

The 200 customers that are connecting are split on 5 different Vlans, and when the issue Starts, All users on all Vlans are experiencing the issue… When I reboot the CCR1072, it’s immediately resolved.

This is the reason I suspect the CCR1072, and not something else.

If that’s the case, then contact Mikrotik Support.

You should do first an export reset to no defaults, then reload the configuration. You may be asked to do a netinstall also, so if you can, do it and test if the problem persists afterwards.

If it does, try to generate a supout file when no issues are happening, and another when issues are present, then submit both with a detailed explanation of the setup (switches connected to it, etc) to support.

I have even installed 1 more NAS that is based on x86 platform, but again same problem occurred after around 30 hours of running.

I’m now running the 2 NASes (one ccr1072 and second x86) in parallel and have created supout on both when everything is fine, and when issue occurs will create second one.

No other options for now…

I have even installed 1 more NAS that is based on x86 platform, but again same problem occurred after around 30 hours of running.

Which ROS version? Where is it connected?

I’m afraid that could point to L2 problems further down your network, not related to the CCR/x86…

What’s behind the CCR/x86? Network topology?

Problem happened on 6.33.3 on the x86 NAS, now I have downgraded to 6.32.3

Network is simple.

1 Data center, where 2 NASes are connected to Huawei S6700 switch.

Some of the users, are directly connected on the same Huawei, on a Vlan.

Other part of the users are through Vlans that are going through Point to Point links, provided by Telcos to distant areas.

And are you sure this isn’t a “network glitch” on your provider network or Huawei switch? Try to came out with a test to proof that if that’s the case…

Do you have any sort of HA setup?

Double check physical connections to the core network.

So ALL Vlans are failing?

I don’t think it’s a Network related issue, because

  1. Some of the users (50% of them, that are direct on the Huawei Switch through fiber) were previously running on other NAS (Linux based) again on PPPoE and were having no issues of such kind.

  2. On reboot of Mikrotik, issue is immediately resolved. If No reboot, it can continue for hours.

  3. When the issue occurs, all vlans are affected (the one running directly + those through VPLS).

So it might be something on the network that triggers the issue…, but definitely Mikrotik is also to be blamed due to above facts.

Otherwise I don’t have HA (high availability) running.

I have opened communication with Mikrotik support team.

Hope to get resolution fast as it seems that it’s some software related issue/bug.

Unfortunately still there is no solution for this issue and it’s strange to me if someone else is not getting it.

We suffer from 10-15 occurrences

Support is saying that R&D team is working on the issue, but there is no visibility on the progress …

As per last communication from Mikrotik support 2 days back:

\

  1. The issue is well known old bug but unsolved for now.

  2. It’s happening when someone accessing PPP->Active Connections through Winbox and maybe together with some other unknown condition because not every access of PPP → Active Connections triggers the issue.

Workaround suggested by them:

  1. Do not use Winbox to access PPP->Active Connections menu
  2. We can use Webfig + console in case we need to see PPP->Active Connections menu


    I have disabled the Winbox access totally from IP->Services menu since 2 days,

but unfortunately today again the same issue occurred so I’m expecting next ideas from them…

Issue keeps on happening on both CCR1072 and the x86.

Still no positive results and even no recent replies from Mikrotik Support.

What are the users of Mikrotik doing in such situations?

downgrade 6.29.1 :slight_smile: did not noticed any issues.
uptime 200d

6.29.1 is from june 2015, are you sure it supports the ccr1072??

HI,

I’m ready to use only x86 based solution if I have a way to stabilise it.

Why do you think the older version and especially 6.29.1 will resolve my issue?

We have the same experience since first versions of ROS v6 on x86 and different CCRs. Accessing active tab on PPP menu in Winbox sometimes crashes ppp (pptp/l2tp/pppoe/etc ppp) connections with radius auth. I’ve did numerous supouts, email support but didn’t get a reply stating that it is a long time know bug. Support suggested to tune our radius, make some more supouts and so on.
As a workaround I’ve added missing columns to PPP->Interfaces in Winbox (IP & uptime) and crashing is gone.
To fix connections after you get red simple queues and also red dynamic ip’s you can use two scripts. This way you don’t have to reboot the box after ppp crash. Also take in account that if you don’t use the “Active Connections” tab there are almost no issues with “red simple queues” (maybe once in 6 month).

/system scheduler
add interval=1m name=red_simple_remove on-event="/queue simple remove [find invalid=yes disabled=no]" policy=ftp,reboot,read,write,policy,test,password,sniff,sensitive start-time=startup
add interval=1m name=remove_dynamic_invalid_ips on-event="/ip address remove [find dynamic=yes invalid=yes]" policy=ftp,reboot,read,write,policy,test,password,sniff,sensitive start-time=startup

Hi Dude,

It’s good to see someone else also aware about this issue.

It’s strange to me how Mikrotik R&D team are not finding a way to either fix it or disable the View that is triggering the issue.

Almost all cases when you use this View on a NAS that is working since 2-3 days the issue is triggered - at least in our deployment :slight_smile:

I have to mention also, that Their support convinced me that the issue is happening only through Winbox, but it’s not true - When using PPP->Active connections through WebFig interface, the issue is also triggered - already happened 2-3 times.

Otherwise thanks for the script, it will most probably help to avoid the reboot of the device. I have implemented it and will monitor next days.