PPPoE and CPU load, since 2.9.3x, up to 2.9.44

vgs · August 10, 2007, 2:46pm

Hey folks.

We’re still experiencing weird behavior with 600+ PPPoE connections on our MT router. (L6 license)
If more than about 100-200 users reconnect at the same time, CPU goes to 100%, PPPoE locks up, and we have to reboot. The PPPoE server list page shows up blank, existing PPPoE sessions remain connected but do not pass any traffic, yet regular static routing keeps working fine. Sometimes it takes 3-4 reboots to get it to stabilize. Machine takes about 3 minutes to process a reboot command.

Normal CPU usage is around 30-50% when things are running smooth. Aggregate traffic is around 15Mbps.

The machine is a P4 3.0GHz with 1GB RAM. Problem appeared first in the 2.9.3x series, don’t remember which one. We are now on 2.9.44.

We use queues (bandwidth limit via PPPoE profiles). No scripts, or any other unusual stuff. There are no firewall rules on this server (actually, we have 2 MT boxes that are doing the same thing, both with about 700 PPPoE sessions.) We do run multiple PPPoE server instances because of multiple ethernet interfaces. There are no wireless interfaces. RAM has been tested with memtest86+.

Looking at the changelog for 2.9.45, it does not seem that this issue was addressed.

Any clues?

Thanks!

-vlad

P.S.: has anyone tested 3.0 RC1 with hundreds of PPPoE sessions yet?

Giepie · August 13, 2007, 12:23am

Does the clients connect via Wireless or Ethernet?

If wireless, I would suggest implementing more antenna’s. This will reduce the load per wireless card too.

Then I’d suggest writing a script to disable all PPPoE Servers, and start them up in increments of eg 2 minutes.

vgs · August 13, 2007, 2:28pm

We do not use wireless inside of mikrotik. We have a Motorola Canopy wireless system that all the clients are connected with.
The server that’s giving us the most problems only has 1 pppoe interface and 2 pppoe servers (one with regular MTU, and one with lowered MTU for a small handful of customers).

so… one pppoe instance has maybe 30-40 clients and the other - close to 700.

troy · August 18, 2007, 4:41pm

We’ve experienced similar problems, currently at 2.9.42 on a P4/2Ghz/512MB with normal CPU load ranging from 5-10%.

With about 100 client connections (80 or so with queues assigned), we’re having a problem with the clients being dropped due to excessive data loss. This was happening every 2-5 minutes, after a reboot, it’s happening every 15-90 minutes. 3-4 clients may survive a drop once, but eventually, every client is dropped.

We’ve examined every part of the network and have yet to find anything wrong. Our main tower has 3 fully loaded WAR4 boards, with a pair of WAR2 boards for the backhaul. From there, ethernet to the MT PPPoE server. No single part of the link exhibits a problem. If we bypass the PPPoE server (still going through the same router though), we have no identifiable packet loss.

mneumark · August 18, 2007, 7:27pm

I also had the exact same problem in the past. I ended up finding out it was the NIC i had on the pppoe concentrator. I replaced it with a intel pci-e dual port nic and that thing works wonderful. I haven’t had any problems with that or speed since.

Matt

troy · August 20, 2007, 4:18pm

What NIC(s) did you have in there before?

This box has like 12 VIA (RhineIII) NICs in it. (not sure what brand of multi-port cards we’re talking about here).

Unfortunately, I’m not in control of the router, nor did I build it (I’ve been partial to Intel NICs for years, and would never dream of using anything else).

Are there any known issues with VIA NICs in RouterOS boxes or with Linux in general? (I’m a BSD man, and have very little experience with Linux).

mneumark · August 20, 2007, 4:35pm

Troy,

I was using RB44’s before the intel pci-e dual port nic’s. It seems like they aren’t able to handle a high amount of pps and throughput. They were causing speed issues, disconnections, and so on. Once i put in the intel the pppoe has never been better.

BTW the chipset on the RB44 is VIA Rhine NIC’s.

Matt

vgs · August 21, 2007, 12:25am

Our problem is a bit different:

The NICs do not stop working – regular non-PPPoE encapsulated traffic continues working at (almost) full speed, through the same interface. However… the PPPoE sessions just seize to pass traffic (although they do not disconnect), and no new connections are being accepted. CPU goes to 100%. If I go to the PPPoE server list, the window just comes up blank. Everything other than PPPoE remains in operation.

This only happens when a large amount of sessions reconnect at the same time, although not consistently. Usually, after a couple of reboots (dumb luck, I suppose) the router stabilizes and happily routes 700 or so PPPoE sessions at the same time, with 40-50% CPU load. This has never happened with 2.8 series, and only started with 2.9.3something (just after the ghost PPPoE session bug was fixed). So, we traded one bug for another

vgs · August 21, 2007, 12:26am

oh yeah, almost forgot: one of the servers that has this problem is strictly Intel, CPU, motherboard and the NICs.

mneumark · August 21, 2007, 1:04am

Do you have any firewall rules, queues, or anything else running like BGP or anything? I have had very good luck with my 2.9.44 pppoe concentrator. I would highly recommend upgrading to 2.9.46 as they have alot of bugs they have fixed, even knowning they haven’t put them in the changelogs. I believe on 2.9.45 they fixed a routing-test package issue that was causing a memory leak.

Matt

sten · August 25, 2007, 11:03pm

I don’t know if this helps much, but the authentication issue has been a problem since forever on 2.9.x (not 2.8.x, 2.8.x was slow but solid)
It just turned much worse after 2.9.25 when they changed the way they synchronize the login procedures.

http://forum.mikrotik.com/t/heavy-loaded-pppoe-server-troubles/13065/1

The problem, as i see it, is race conditions during login setup that happen under high cpu load (which btw, authentication is apparently quite the cpu hog in routeros). I know how to minimize it but not know how to completely remove it.

support@ proclaimed that / ppp secret lists was not designed to handle this, but I’ve seen the effects when using RADIUS as well.
Anything that slows the login procedure will only make it worse.

I have no trouble having 700 - 800 users login at the same time ( on 2.9.24 ) but every 3-5 months, when the server has had time to leak enough memory and thus slow things down, a single user can destabilize the whole server. I’m sure if we replaced the cpu with something faster than the 2ghz Xeon (P4) with 512 KB cache, we wouldn’t see it that often but we have adopted a different strategy all together.

uldis · August 27, 2007, 6:12am

please make the support output file when you have this problem so we could look into that and try to fix it. Send that file to support@mikrotik.com

vgs · September 4, 2007, 2:44am

well, I have sort of an update – it may be too early to tell, but here’s what I’ve done just on a hunch -

I created multiple (in my case, 4) PPPoE server instances with exactly the same parameters, other than the service name.
That allowed for a reboot without a lockup and CPU utilization is down from 70% to about 35%.

weird…

I did not manage to get a support file out of it during the PPPoE lockup. but I’ll watch this for a couple of days and try this trick on our other PPPoE server and see how it goes.

ropebih · October 17, 2007, 3:00pm

I have some problem.

vgs · October 19, 2007, 3:15pm

Update:

As per Mikrotik support, the problem seems to be in the number of entries in the Secrets database. We trimmed ours down quite a bit, and things have smoothed out a lot.

We are using FreeRadius for authentication, and it seems to work great for PPPoE sessions.

We are not experiencing the 100% CPU and PPPoE lockup at this point.

vgs · November 20, 2007, 11:59pm

Update, again…

Still locking up, last pppoe debug message was something like “no pppoe servers available”. CPU shooting to 100%, same thing as before.

P.S. there are about 40 secret entries still in the database.

sten · November 21, 2007, 6:28am

There is a race condition in the setup (adding queues etc) of the session, not the authentication.
This problem will appear as soon as there is a lot of hardware triggered load and people connecting and reconnecting often.
Even if you only use RADIUS.

vgs · November 21, 2007, 4:34pm

that makes sense. even if I remove all entries from the secrets, when 700+ PPPoE devices slam the server at the same time, it locks up. Then, it’s a vicious cycle of curse-reboot-curse-reboot… until it finally stabilizes.

I really hope that MT team will address this soon.

hci · November 21, 2007, 6:57pm

that makes sense. even if I remove all entries from the secrets, when 700+ PPPoE devices slam the server at the same time, it locks up. Then, it’s a vicious cycle of curse-reboot-curse-reboot… until it finally stabilizes.

I wander if running with Mikrotik v3.x and dual core support would help? We rarely ever see the lockup but often see users that are supposed to be capped at say 512k running at over 3m.

Matt

vgs · November 21, 2007, 9:19pm

we haven’t tried 3.0 in the field yet, although we have an L6 with 3.0 release candidate ready to go, just waiting for the release to mature a bit… As far as processing power is concerned - during peak hours, CPU utilization on the current setup is 50-70% (P4 3GHz).

Does the PPPoE stack in 3.x differ much from 2.9?