I have a problem with an RB1000 managing some 300+ pppoe connections.
At some times, it closes suddenly tens of connections. In the logs (and de Freeradius logs) the disconnection cause is loged as “User request” (contrary to the usual one of “Peer is not responding” when some modem or line is wrong) The connections come from several DSLAM at different VLANs and physical interfaces, and some from a wireless distribution made with another mikrotik.
Each VLAN has it separate PPPOE server and we observed that when the drops occur, they occur at the same PPPOE server, although all of them closes sessions in this manner from time to time.
We are not completly sure, but seems this is happening when there is a burst of failed authentications ( maybe that is a coincidence, as we couldnt repeat the problem forcing a client to misauthenticate).
We tried different versions of ROS: 3.22, 3.30 and 4.3, and 2 differents RB1000s and the problem is consistent across all of them.
Did anobody see a problem like this ? Any suggestions ?
Javier, it is very weird, that only 10 connections are closed.
Perhaps you can enable pppoe,debug logs and get the reason, why all the 10 clients were disconnected simultaneously.
Otherwise it is very hard to guess, what could be the reason.
Just curious, do you have RADIUS server for these 300 clients?
Well, maybe because English is not my native language I was not clear. When I say “tens”, I refer to 23 one time, 45 next, 30 next and so. (In Spanish the word is “decenas”, and allways thought the English word “tens” meant the same).
With some hundreds of users, and the fact that this disconnections happen sparsely and randomly is somewhat difficult to enable full pppoe logs without affecting performance, and then the size of the log file in my server (I log to a server with syslog) gets huge. Anyway, both in RB´s and the server logs, the disconnection cause is “User Request”. The same in my Radius logs.
And yes, I use 3 RADIUS servers that get user data from replicated LDAP bases.
Some data that I think could be relevant to this issue.
Peer is not responding with tens of PPPoE sessions and sometimes all sessions that disconnect intermittently is still an issue for me. I saw one live once, had about 100 users connected and suddenly everyone got disconnected with mention in log “Terminating… disconnected” the pppoe service wen’t crazy… collapse, RB was still alive. Then it took about 30 seconds before everyone logged back in. I can have this issue 3 times a day or once a week, it depends on something that I don’t know and it’s very frustrating. Clients complaints, have about 250 users on an RB1000. Thought it was a bridge problem first, so I separated the PPPoE service trough ether interface but still same issue. Have looked around the forum but found nothing. Does someone have found a fix or do I have to switch to another solution than Mikrotik.
So, I´m not the only one exprimenting this problem, and it is a real problem when one has 300+ users angry.
Can someone @mikrotik engage in this issue ?
Same problem here… I have a dozen of RB1000’s which work as PPPoE servers…CPU goes for about 20 secs on 100% and all users got “terminated”.
Please find any solution because I’ll have to stop working with you MikroTik guys…
The same problem occurs here, too. With about 100 users on RB1000 PPPoE server. All sessions disconnect simultaneously.
The RB1000 has two uplinks to two different providers using PPPoE for their connection, too. Also the uplink PPPoE sessions disconnect from time to time.
Two weeks there was no problem, no disconnects - and today all PPPoE sessions (incoming & outgoing) disappeared within 3 hours. This is horrible…
The 100 users are on one physical interface and shared across two VLANs. Each of the two uplink PPPoE sessions goes over a separate physical interface.
We tried every firmware from 3.22 up to 4.6. The same problem… we also completely exchanged the RB1000 without any success.
rborz, it would be great you can contact MikroTik support (support@mikrotik.com) with detailed problem description and support output file generated, when problem with PPPoE users is present.
As SergeJS advised, I opened a ticket @mikrotik. After seeing my supout file, their suggestion was:
replace all your dynamic change-mss rules with one global change-mss rule.
check that you use latest winbox loader (cleare cache after upgrading)
Think about switching from dynamic simple queues to Dynamic address-list
and queue tree with PCQ
#1 was relatively easy do do. #2 I did what was asked to do.
Up to here, the problem persists, I even tried to split users among 2 RBs, some 100+ on a 600 and some 180+ stayed in the RB1000. Less frequently, but both had massive disconnections.
#3 I don´t understand what I´m asked to do. I have hundreds of users and rely on RADIUS to pass PPPoE concentrators customer´s bandwidth parameters. Can I pass address list instead of “Mikrotik-Rate-Limit” attributes ? How ?
There is a “Mikrotik-Mark-Id”. Is it for doing that ? In that case, how do I use it ?
I am not using a RB1000, but x86 based RouterOS on a PowerRouter 2242. Same issue here. Terminating 600+ PPPoE customers and the queues will fail, disconnecting all users until the unit is reboot. We have tried several versions of RouterOS including 4.9 and the problem persists.
The previous suggestions from Mikrotik staff either do not work, or are not usable in our environment. (We need simple queues to assign specific bandwidth profiles a customer is paying for which is passed from our RADIUS server)
This can happen once a week or a few times a day, it is very random. It does seem to happen more often if we have multi cpu support enabled. We cannot contact Mikrotik support since we purchased our license through a 3rd party. This is becoming very frustrating for us, but even more so for our customers.
I am using RB1000, x86, RB433AH as PPPoE servers using RADIUS Centralized accounting and i don’t encounter similar problems. I use simple queues so your issues might not be related to simple queues.
I had a similar problem a while ago, but i addressed it to a flapping wireless link. Fixed the link, fixed the issue. Did you try to look for packet losses on the path to the clients?
edit. My PPPoEs ask for interim updates, might this help?
interim-update > - defines time interval between communications with the router. If this time will exceed, RADIUS server will assume that this connection is down. This value is suggested to be not less than 3 minutes
sergejs, I already contacted support multiple times… the last hint was to upgrade 5.0 beta. But I’m afraid of doing that, as the routerboard is on a production network serving about 120 PPPoE clients at the moment.
Thinking a lot, my last thoughts yesterday were, if anybody having this issue maybe using external radius server? If this is the case, I think most of the users will use FreeRADIUS (as we do). FreeRADIUS default configuration states this:
# max_requests: The maximum number of requests which the server keeps
# track of. This should be 256 multiplied by the number of clients.
# e.g. With 4 clients, this number should be 1024.
#
# If this number is too low, then when the server becomes busy,
# it will not respond to any new requests, until the 'cleanup_delay'
# time has passed, and it has removed the old requests.
#
# If this number is set too high, then the server will use a bit more
# memory for no real benefit.
#
# If you aren't sure what it should be set to, it's better to set it
# too high than too low. Setting it to 1000 per client is probably
# the highest it should be.
#
# Useful range of values: 256 to infinity
#
max_requests = 1024
Maybe together with interim updates, this value might be to low… maybe this has something to do with the issue. But in my case with about 120 PPPoE clients and approximately 4 sip accounts per client this may lead to 600 simultaneous requests (in the worst case). Concerning this, maybe this hasn’t to do anything with the connections drops… just my two cents…
EDIT: Sometimes we have bruteforce attacks with about 500 requests per second against our sip gateways… and each register/login-attempt also leads to a radius request. Now above becomes more reasonable… So the main question is: Does PPPoE server on MikroTik drop connections if there are timeouts on interim-updates…?
EDIT: Ok, a few minutes ago - all my PPPoE sessions were gone again… so this time I checked all the logs - no brute force or something like that leading in a DoS on the radius server. So this must be another issue…
I have the same problem with 2 RB1000 units. However, none of my PCQ options are in use as I have disabled all the Queues. I have 600+ angry customers and would really like a fix. Has anyone found anything that actually works, or have any information from Mikrotik about what the problem could be? The only thing that I have noticed thus far is that in the PPPoE Servers tab, some of the interfaces display “unknown” when the sessions are dropped.