The 200 customers that are connecting are split on 5 different Vlans, and when the issue Starts, All users on all Vlans are experiencing the issue… When I reboot the CCR1072, it’s immediately resolved.
This is the reason I suspect the CCR1072, and not something else.
If that’s the case, then contact Mikrotik Support.
You should do first an export reset to no defaults, then reload the configuration. You may be asked to do a netinstall also, so if you can, do it and test if the problem persists afterwards.
If it does, try to generate a supout file when no issues are happening, and another when issues are present, then submit both with a detailed explanation of the setup (switches connected to it, etc) to support.
I have even installed 1 more NAS that is based on x86 platform, but again same problem occurred after around 30 hours of running.
I’m now running the 2 NASes (one ccr1072 and second x86) in parallel and have created supout on both when everything is fine, and when issue occurs will create second one.
And are you sure this isn’t a “network glitch” on your provider network or Huawei switch? Try to came out with a test to proof that if that’s the case…
Do you have any sort of HA setup?
Double check physical connections to the core network.
I don’t think it’s a Network related issue, because
Some of the users (50% of them, that are direct on the Huawei Switch through fiber) were previously running on other NAS (Linux based) again on PPPoE and were having no issues of such kind.
On reboot of Mikrotik, issue is immediately resolved. If No reboot, it can continue for hours.
When the issue occurs, all vlans are affected (the one running directly + those through VPLS).
So it might be something on the network that triggers the issue…, but definitely Mikrotik is also to be blamed due to above facts.
Otherwise I don’t have HA (high availability) running.
As per last communication from Mikrotik support 2 days back:
\
The issue is well known old bug but unsolved for now.
It’s happening when someone accessing PPP->Active Connections through Winbox and maybe together with some other unknown condition because not every access of PPP → Active Connections triggers the issue.
Workaround suggested by them:
Do not use Winbox to access PPP->Active Connections menu
We can use Webfig + console in case we need to see PPP->Active Connections menu
I have disabled the Winbox access totally from IP->Services menu since 2 days,
but unfortunately today again the same issue occurred so I’m expecting next ideas from them…
We have the same experience since first versions of ROS v6 on x86 and different CCRs. Accessing active tab on PPP menu in Winbox sometimes crashes ppp (pptp/l2tp/pppoe/etc ppp) connections with radius auth. I’ve did numerous supouts, email support but didn’t get a reply stating that it is a long time know bug. Support suggested to tune our radius, make some more supouts and so on.
As a workaround I’ve added missing columns to PPP->Interfaces in Winbox (IP & uptime) and crashing is gone.
To fix connections after you get red simple queues and also red dynamic ip’s you can use two scripts. This way you don’t have to reboot the box after ppp crash. Also take in account that if you don’t use the “Active Connections” tab there are almost no issues with “red simple queues” (maybe once in 6 month).
It’s good to see someone else also aware about this issue.
It’s strange to me how Mikrotik R&D team are not finding a way to either fix it or disable the View that is triggering the issue.
Almost all cases when you use this View on a NAS that is working since 2-3 days the issue is triggered - at least in our deployment
I have to mention also, that Their support convinced me that the issue is happening only through Winbox, but it’s not true - When using PPP->Active connections through WebFig interface, the issue is also triggered - already happened 2-3 times.
Otherwise thanks for the script, it will most probably help to avoid the reboot of the device. I have implemented it and will monitor next days.