PPPoE + queuing bug?

Over the years using Mikrotik as our PPPoE server I’ve noticed a strange occurrence… Every once in awhile I’ll see a user using well over the defined SLA 4-10Mbps when they’re subscribed for say 512kbps. It doesn’t really matter what speed they’re subscribed for it just happens randomly. It doesn’t happen that often and normally I forget to put up a post to see if anyone else has experienced this.

I have the SLA’s defined so that they have a MIR of whatever their subscribed for a CIR of a fraction of the MIR (for example 512kbps, has a CIR of 256kbps), and then I have burst limits defined so they can burst up to the next SLA (ex: if they’re subscribed for 512kbps they can burst to 768kbps) for 20seconds or ‘x’ amount of bits. The SLA’s are all defined by the RADIUS server and work fine, its just every once in awhile one connection seems to think it’s unlimited.

Like I said before it doesn’t happen all that often, I catch it maybe a couple times a month if that.

To fix it I just kill their PPPoE session their router reconnects shortly there after and the speeds are back to normal.

Not a big issue, but it would be nice if it didn’t happen. :slight_smile:

Anyone else experience this issue?

All my PPPoE servers are running the latest stable release 2.9.46. I’ve started noticing this since ~2.9.40…

I have seen this happen 2 or 3 times as well, but I never make a point to go and look to see if it’s happening. The times I have seen it (due to random coicendence that it happened while testing a specific account as changes were being made to it) it failed to create the dynamic queue for the user, a quick disconnect of the user and reconnect also fixed it for me

Yeah I believe I saw the same problem – no dynamic queue was created for that user. Or it was somehow removed… So that begs the question, why is the dynamic queue not being created or lost? Possible bug? Or could it just be that it didn’t receive the parameters from the RADIUS server quick enough? I have redundant RADIUS servers and the PPPoE routers are only 1-10ms away..

I assumed that it was due to something on the radius side since I noticed it while kicking an account who we’d just moments before made a change that the radius servers would have had to pick up, so the new bandwidth package would take effect… but I could not, and still cannot come up with any way that it would authenticate the user, send the authenticated status, the framed pool (we use that to define what IP pool to assign IPs from), but not the rate-limit values…

dito on the redundant radius servers, and they are also <10ms from the PPPoE servers…

I wonder if that’s it.. I use a PHP script to force the RADIUS server to re-read its configs (SIGHUP) Maybe while it’s being updated after I create/modify an account, and a user tries to connect similtaneously the user gets authorized with no queueing information.. Anyway it’s somthing to test I guess.

Actually I take that back.. I used to have it configured like that.. now I’m using a MySQL backend, and the script just inserts the required info to the required tables. I don’t believe the radius server needs a SIGHUP when using SQL as a backend. I’d have to double check my script to verify what it’s doing exactly.


UPDATE: Yeah the script just updates the SQL tables. There’s no sighup required… So that kind of rules out that theory.

we use IAS for radius (built into server 2003) and Active Directory for our LDAP database which stores our account information, including what bandwidth package to deliver to the customer.

Since we’ve both seen it, using different databases and radius servers, I’m sure it’s got to be some sort of bug in the PPPoE Server creating the dynamic queue… I suspect that it may be related to quick disconnects and reconnects. UserA disconnects (administrativly through winbox, session removed) so the PPPoE session is terminated, radius accounting records are transmitted, and then the dynamic queue named is deleted.

BUT

what if UserA reconnects very quickly, before the radius accounting update is completed… could it be possible in certain situations that a new dynamic queue named is created, before the command still processing from the disconnect is processed, and when it does, it removes both dynamic queues?

We are seeing this ALL THE TIME. Its not a radius issue because we do not use an external radius server rather we just create PPPoE users in winbox. The heavier used PPPoE server seems to do it more frequently then the lesser loaded server.

Does Mikrotik have any ideas how to fix this? We have done the support out and submitted a few times. Does Mikrotik 3.0RC do this as well? We only run stable 2.9 releases.

Matt

Has anyone heard from Mikrotik? Do they have plans to fix this? I sent them support-out a number of times and received the impression I was the only one seeing this issue. Clearly its affecting more users then just me.

Heard someone mention on the forum that a faster CPU helps. Thinking about upgrading router. Right now its like a 2Ghz P4 though and CPU load is not that high.

Matt

Nope, haven’t heard anything..

I’m using P4 2GHz processors with 1GB RAM.. Generally under 30-40% CPU usage.

I’m using 2.8Ghz P4’s with 256mb at my bigger tower sites, and 1ghz P3’s with 256mb at the smaller ones… the problem has only been noticed 2 or 3 times, and on the 2.8ghz machines if I recall…

What RADIUS are you using? I had same problems because of slow RADIUS

FreeRADIUS, one of them was on a Virtual Machine, and was running pretty slow compared to the one on a non-vmware machine.

Now they’re both on stand alone boxes. The response times are much better, under 100ms RTT.

I had thought the same thing, the radius server just wasn’t getting the parameters to the NAS quick enough…

i have two standalone servers both running Microsoft IAS server, response times are less then 50ms usually

I thought it may have been slow radius as well, but the user could not have authenticated to begin with if radius had timed out, so I ruled that out (although I do not claim to be an expert on how radius sends it’s response, it might send the “user authenticated” and the “here’s the rate-limit info” messages seperate, and thus allow the user on, without receiving the info to limit them, I’m not sure)

bump

We have the same problem long time.

The problem is in 2.9.X versions, I notice in all versions we used: from 2.9.1 to 2.9.50. Every 20-30 Days of uptime, at some point, all users starts to have unlimited bandwidth when they connect from that moment on. This is not the problem of RADIUS as it works all the time.

Reboot-ing of the MT solve problem.

I didn’t notice that in V3 probably because we didn’t run it more than 3 days because of the problem with Packet Loss when having more than 150 concurrent pppoe tunnels so we downgraded to v2.9.X.. But problem with unlimited queues remains.

Also, as somebody mention here, the problem is often and it occurs very fast (15-20days) if pppoe concentrator is heavily used.

I didn’t notice that in V3 probably because we didn’t run it more than 3 days because of the problem with Packet Loss when having more than 150 concurrent pppoe tunnels so we downgraded to v2.9.X.. But problem with unlimited queues remains.

I have not heard of that problem. Running v3.0 for last 5 days with no issues and no queue problems so far. Hitting well over 400 connections peak. 2.9 frequently gave us the queue problem.

Matt

If you run over 400 pppoe tunnels, can you check your routerOS for packet loss?

The best tool for checking is smokeping.

we have the same problem as this, but on busy hotpsot servers, sometimes the queue mechanism seems to just break down and allow unlimited bandwidth through to users. quite annoying.

jm

we have the same problem as this, but on busy hotpsot servers, sometimes the queue mechanism seems to just break down and allow unlimited bandwidth through to users. quite annoying.

What Mikrotik release? We saw this all the time on 2.9.x but not so far on 3.0.stable. SO FAR we have not seen it on 3.0.x that is!

Matt