~~ Problem ~~
We just implemented a dual-core / 1gb RAM rack mount router with ROS v2.4.48, Level 6 license. Because I don’t know of a way to test PPPoE loading (if anyone has any ideas on this one please let me know), I could only test with my laptop to make sure the configs were working for PPPoE authentication. The failure point is this. When the router was installed into a production network, with 1000+ users (< 1500 users), the PPPoE server showed that clients were authenticating on the PPPoE server, radius showed authentication sessions, and we had 9mb of traffic on a 30mb circuit. That next morning our help desk was SLAMMED with user calls saying they couldn’t get on the internet.
I checked the outside route servers to make sure our BGP routing was operational, then checked network OSPF routing was operational. All routers in the network were seeing the OSPF routes and we were able to ping anything we tested, even from a Canada route server via the public internet. After speaking with our help desk guys (i’m an engineer) they said that some customers were calling in saying they could get web sites, just extremely SLOW!!! When I checked the IPs that were issued to the customer complaining of slow speeds, I was able to see a little bit of traffic (56k speeds) from their PPPoE interface.
Network Setup (Short version):
LAN (Eth 2 – PPPoE interface) > POP site (Mikrotik Router [Eth 1 – WAN Interface]) > 30MB fiber to NOC > Cisco Core Routers > Internet BGP feeds
PPPoE auth is done via radius which sits at the NOC on Linux servers.
/------------------------------------/
From what I can tell, it appears that the PPPoE server is having trouble passing traffic once it gets to a certain user load. The part that I don’t understand is that OSPF is seeing everything on the network, BGP is seeing the IP routes, and the internet (tested from Canada) is seeing and can ping the end user device… but yet they can’t get out. There are NO firewall / filter rules setup on the ROS system. Everything is Public IPs, except our Mgmt WAN layer, which is on an isolated VLAN.
/------------------------------------/
Before I pulled the router back out of the network, I logged into it and shut down the PPPoE server, waited for my radius server to clear all the connections, and then re-enabled the PPPoE server. As soon as I re-enabled it, my CPU went to 100% (normal for initial PPPoE requests) and my router seemed to “hang”. I went back to the PPPoE server to disable it and it now does not show a PPPoE server. I rebooted the hardware and it did the same thing (3 times). I then went to the site and plugged directly into the switch to see if I could emulate the problems. YUP! same problem for me directly connected to the switch and then directly connected to the router LAN port.
/------------------------------------/
Any help would be appreciated. IF i can get this solution working it will save us 20k on capital expense per site, otherwise I am looking at using Cisco 7200 hardware at the POP sites for the PPPoE servers. Most sites have 1000+ users.
/------------------------------------/
If you would like to see a config of the router please let me know. Thanks for the help.




