Complete PPPoE Router Failure -- Looking for ideas...

~~ Problem ~~
We just implemented a dual-core / 1gb RAM rack mount router with ROS v2.4.48, Level 6 license. Because I don’t know of a way to test PPPoE loading (if anyone has any ideas on this one please let me know), I could only test with my laptop to make sure the configs were working for PPPoE authentication. The failure point is this. When the router was installed into a production network, with 1000+ users (< 1500 users), the PPPoE server showed that clients were authenticating on the PPPoE server, radius showed authentication sessions, and we had 9mb of traffic on a 30mb circuit. That next morning our help desk was SLAMMED with user calls saying they couldn’t get on the internet.

I checked the outside route servers to make sure our BGP routing was operational, then checked network OSPF routing was operational. All routers in the network were seeing the OSPF routes and we were able to ping anything we tested, even from a Canada route server via the public internet. After speaking with our help desk guys (i’m an engineer) they said that some customers were calling in saying they could get web sites, just extremely SLOW!!! When I checked the IPs that were issued to the customer complaining of slow speeds, I was able to see a little bit of traffic (56k speeds) from their PPPoE interface.

Network Setup (Short version):
LAN (Eth 2 – PPPoE interface) > POP site (Mikrotik Router [Eth 1 – WAN Interface]) > 30MB fiber to NOC > Cisco Core Routers > Internet BGP feeds
PPPoE auth is done via radius which sits at the NOC on Linux servers.

/------------------------------------/

From what I can tell, it appears that the PPPoE server is having trouble passing traffic once it gets to a certain user load. The part that I don’t understand is that OSPF is seeing everything on the network, BGP is seeing the IP routes, and the internet (tested from Canada) is seeing and can ping the end user device… but yet they can’t get out. There are NO firewall / filter rules setup on the ROS system. Everything is Public IPs, except our Mgmt WAN layer, which is on an isolated VLAN.

/------------------------------------/

Before I pulled the router back out of the network, I logged into it and shut down the PPPoE server, waited for my radius server to clear all the connections, and then re-enabled the PPPoE server. As soon as I re-enabled it, my CPU went to 100% (normal for initial PPPoE requests) and my router seemed to “hang”. I went back to the PPPoE server to disable it and it now does not show a PPPoE server. I rebooted the hardware and it did the same thing (3 times). I then went to the site and plugged directly into the switch to see if I could emulate the problems. YUP! same problem for me directly connected to the switch and then directly connected to the router LAN port.

/------------------------------------/

Any help would be appreciated. IF i can get this solution working it will save us 20k on capital expense per site, otherwise I am looking at using Cisco 7200 hardware at the POP sites for the PPPoE servers. Most sites have 1000+ users.

/------------------------------------/

If you would like to see a config of the router please let me know. Thanks for the help.

Here is the config so that you can see it.

/--------------------------------------/

# Jan 16, 2008 12:31pm by RouterOS 2.9.48
# software id = WZER-R9T
#
/ interface ethernet 
set E1-WAN name="E1-WAN" mtu=1500 mac-address=00:60:E0:42:B0:81 arp=enabled disable-running-check=yes auto-negotiation=yes \
    full-duplex=yes cable-settings=default speed=100Mbps comment="" disabled=no 
set E2-LAN name="E2-LAN" mtu=1500 mac-address=00:60:E0:42:B0:82 arp=enabled disable-running-check=yes auto-negotiation=yes \
    full-duplex=yes cable-settings=default speed=100Mbps comment="" disabled=no 
set E3-MGMT name="E3-MGMT" mtu=1500 mac-address=00:60:E0:42:B0:83 arp=enabled disable-running-check=yes auto-negotiation=yes \
    full-duplex=yes cable-settings=default speed=100Mbps comment="" disabled=no 

/ interface pppoe-server server 
add service-name="pppoe" interface=E2-LAN max-mtu=1480 max-mru=1480 authentication=pap,chap,mschap1,mschap2 \
    keepalive-timeout=10 one-session-per-host=yes max-sessions=0 default-profile=pppoe disabled=no 

/ ip pool 
add name="pool1" ranges=70.xx.116.1-70.xx.116.254 next-pool=pool2 
add name="pool2" ranges=70.xx.117.1-70.xx.117.254 next-pool=pool3 
add name="pool3" ranges=70.xx.118.1-70.xx.118.254 next-pool=pool4 
add name="pool4" ranges=70.xx.119.1-70.xx.119.254 next-pool=pool5 
add name="pool5" ranges=70.xx.126.1-70.xx.126.254 next-pool=pool6 
add name="pool6" ranges=70.xx.127.1-70.xx.127.254 

/ ip dns 
set primary-dns=64.xx.xx.138 secondary-dns=64.xx.xx.139 allow-remote-requests=no cache-size=2048KiB cache-max-ttl=1w 

/ ip address 
add address=64.xx.xx.33/27 network=64.xx.xx.32 broadcast=64.xx.xx.63 interface=E2-LAN comment="Primary LAN (PPPoE) Address" disabled=no 
add address=10.1.56.1/21 network=10.1.56.0 broadcast=10.1.63.255 interface=E2-LAN comment="" disabled=no 
add address=10.1.32.1/21 network=10.1.32.0 broadcast=10.1.39.255 interface=E2-LAN comment="" disabled=no 
add address=10.0.112.1/21 network=10.0.112.0 broadcast=10.0.119.255 interface=E2-LAN comment="" disabled=no 
add address=10.1.24.1/21 network=10.1.24.0 broadcast=10.1.31.255 interface=E2-LAN comment="" disabled=no 
add address=10.0.16.1/21 network=10.0.16.0 broadcast=10.0.23.255 interface=E2-LAN comment="" disabled=no 
add address=10.0.192.1/21 network=10.0.192.0 broadcast=10.0.199.255 interface=E2-LAN comment="" disabled=no 
add address=172.16.10.10/29 network=172.16.10.8 broadcast=172.16.10.15 interface=E1-WAN comment="WAN Uplink \(E1-WAN\)" \
    disabled=no 
add address=192.168.254.200/24 network=192.168.254.0 broadcast=192.168.254.255 interface=E3-MGMT comment="" disabled=no 


/ ip neighbor discovery 
set E1-WAN discover=yes 
set E2-LAN discover=yes 
set E3-MGMT discover=yes 

/ ip route 
add dst-address=0.0.0.0/0 gateway=172.16.10.9 check-gateway=ping distance=1 scope=255 target-scope=10 comment="" disabled=no 

/ ip firewall service-port 
set ftp ports=21 disabled=no 
set tftp ports=69 disabled=no 
set irc ports=6667 disabled=no 
set h323 disabled=yes 
set quake3 disabled=no 
set gre disabled=yes 
set pptp disabled=yes 

/ ip firewall connection tracking 
set enabled=yes tcp-syn-sent-timeout=5s tcp-syn-received-timeout=5s tcp-established-timeout=1d tcp-fin-wait-timeout=10s \
    tcp-close-wait-timeout=10s tcp-last-ack-timeout=10s tcp-time-wait-timeout=10s tcp-close-timeout=10s udp-timeout=10s \
    udp-stream-timeout=3m icmp-timeout=10s generic-timeout=10m tcp-syncookie=no 

/ system logging 
add topics=info prefix="" action=memory disabled=no 
add topics=error prefix="" action=memory disabled=no 
add topics=warning prefix="" action=memory disabled=no 
add topics=critical prefix="" action=echo disabled=no 
add topics=ospf prefix="" action=memory disabled=yes 
add topics=pppoe prefix="" action=memory disabled=yes 
add topics=ppp prefix="" action=memory disabled=yes 

/ system logging action 
set memory name="memory" target=memory memory-lines=100 memory-stop-on-full=no 
set disk name="disk" target=disk disk-lines=100 disk-stop-on-full=no 
set echo name="echo" target=echo remember=yes 
set remote name="remote" target=remote remote=64.xx.xx.253:514 

/ system upgrade mirror 
set enabled=no primary-server=0.0.0.0 secondary-server=0.0.0.0 check-interval=1d user="" 

/ system clock manual 
set time-zone=+00:00 dst-delta=+00:00 dst-start="jan/01/1970 00:00:00" dst-end="jan/01/1970 00:00:00" 

/ system watchdog 
set reboot-on-failure=yes watch-address=none watchdog-timer=yes no-ping-delay=5m automatic-supout=yes auto-send-supout=no 

/ system console 
add port=serial0 term="" disabled=no 
set FIXME term="linux" disabled=no 
set FIXME term="linux" disabled=no 
set FIXME term="linux" disabled=no 
set FIXME term="linux" disabled=no 
set FIXME term="linux" disabled=no 
set FIXME term="linux" disabled=no 
set FIXME term="linux" disabled=no 
set FIXME term="linux" disabled=no 

/ system console screen 
set line-count=25 

/ system identity 
set name="SITE--RTA" 

/ system note 
set show-at-login=yes note="" 

/ system health 
set state-after-reboot=enabled 

/ system routerboard bios 
set 

/ system ntp server 
set enabled=no broadcast=no multicast=no manycast=yes 

/ system ntp client 
set enabled=no mode=unicast primary-ntp=0.0.0.0 secondary-ntp=0.0.0.0 

/ port 
set serial0 name="serial0" baud-rate=9600 data-bits=8 parity=none stop-bits=1 flow-control=hardware 
set serial1 name="serial1" baud-rate=9600 data-bits=8 parity=none stop-bits=1 flow-control=hardware 

/ ppp profile 
set default name="default" use-compression=default use-vj-compression=default use-encryption=default only-one=default \
    change-tcp-mss=yes comment="" 
add name="pppoe" local-address=64.xx.xx.33 remote-address=pool1 use-compression=yes use-vj-compression=yes \
    use-encryption=default only-one=yes change-tcp-mss=default dns-server=64.xx.xx.138,64.xx.xx.139 comment="" 
set default-encryption name="default-encryption" use-compression=default use-vj-compression=default use-encryption=yes \
    only-one=default change-tcp-mss=yes comment="" 

/ ppp aaa 
set use-radius=yes accounting=yes interim-update=0s 

/ queue type 
set default name="default" kind=pfifo pfifo-limit=50 
set ethernet-default name="ethernet-default" kind=pfifo pfifo-limit=50 
set wireless-default name="wireless-default" kind=sfq sfq-perturb=5 sfq-allot=1514 
set synchronous-default name="synchronous-default" kind=red red-limit=60 red-min-threshold=10 red-max-threshold=50 red-burst=20 \
    red-avg-packet=1000 
set hotspot-default name="hotspot-default" kind=sfq sfq-perturb=5 sfq-allot=1514 
add name="default-small" kind=pfifo pfifo-limit=10 

/ queue interface 
set E1-WAN queue=ethernet-default 
set E2-LAN queue=ethernet-default 
set E3-MGMT queue=ethernet-default 

/ user 
add name="admin" group=full address=0.0.0.0/0 comment="system default user" disabled=no 

/ user group 
add name="read" policy=local,telnet,ssh,reboot,read,test,winbox,password,web,sniff,!ftp,!write,!policy 
add name="write" policy=local,telnet,ssh,reboot,read,write,test,winbox,password,web,sniff,!ftp,!policy 
add name="full" policy=local,telnet,ssh,ftp,reboot,read,write,policy,test,winbox,password,web,sniff 

/ user aaa 
set use-radius=no accounting=yes interim-update=0s default-group=read 

/ radius 
add service=ppp called-id="" domain="" address=64.xx.xx.132 secret="^Radius$" authentication-port=1645 \
    accounting-port=1646 timeout=300ms accounting-backup=no realm="" comment="" disabled=no 

/ radius incoming 
set accept=no port=1700 

/ driver 

/ snmp 
set enabled=yes contact="noc@domain.com" location="Site Location" 

/ snmp community 
add name="SNMP-ReadString" address=0.0.0.0/0 read-access=yes 

/ tool bandwidth-server 
set enabled=yes authenticate=yes allocate-udp-ports-from=2000 max-sessions=10 

/ tool mac-server ping 
set enabled=yes 

/ tool e-mail 
set server=0.0.0.0 from="<>" 

/ tool sniffer 
set interface=all only-headers=no memory-limit=10 file-name="" file-limit=10 streaming-enabled=no streaming-server=0.0.0.0 \
    filter-stream=yes filter-protocol=ip-only filter-address1=0.0.0.0/0:0-65535 filter-address2=0.0.0.0/0:0-65535 

/ tool graphing 
set store-every=24hours 

/ tool graphing queue 
add simple-queue=all allow-address=0.0.0.0/0 store-on-disk=yes allow-target=yes disabled=no 

/ tool graphing resource 
add allow-address=0.0.0.0/0 store-on-disk=yes disabled=no 

/ tool graphing interface 
add interface=E1-WAN allow-address=0.0.0.0/0 store-on-disk=yes disabled=no 

/ routing ospf 
set router-id=172.16.10.10 distribute-default=never redistribute-connected=as-type-2 redistribute-static=as-type-1 \
    redistribute-rip=no redistribute-bgp=no metric-default=1 metric-connected=20 metric-static=20 metric-rip=20 metric-bgp=20 

/ routing ospf area 
set backbone area-id=0.0.0.0 type=default translator-role=translate-candidate authentication=none disabled=no 
add name="area1" area-id=0.0.0.1 type=stub translator-role=translate-always authentication=none summary=no default-cost=10 \
    disabled=no 

/ routing ospf network 
add network=172.16.10.8/29 area=backbone disabled=no 

/ routing bgp instance 
set default name="default" as=65530 router-id=0.0.0.0 redistribute-connected=no redistribute-static=no redistribute-rip=no \
    redistribute-ospf=no redistribute-other-bgp=no out-filter="" client-to-client-reflection=yes ignore-as-path-len=no comment="" \
    disabled=no 

/ routing rip 
set distribute-default=never redistribute-static=no redistribute-connected=no redistribute-ospf=no redistribute-bgp=no \
    metric-default=1 metric-static=1 metric-connected=1 metric-ospf=1 metric-bgp=1 update-timer=30s timeout-timer=3m \
    garbage-timer=2m 

/ routing rip interface 
add interface=all receive=v2 send=v2 authentication=none authentication-key="" key-chain="" in-filter="" out-filter="" \
    disabled=no 

/--------------------------------------/

everything imho looks fine. only things i would be doing different is to remove all the addresses from from e2_lan, not needed in order to make ppp work, and to adjust my mtu to 1420.

Since you are not using any firewall rules disable connection tracking. It will save a good amount of memory and cpu.

I would also turn compression off for your pppoe clients to save extra processing power.

-Gerard

bbmj … just for reference… the ip address’s on the LAN interface ARE required. They provide our management layer to the CPE devices. They have nothing to do with the PPPoE termination.

Connection tracking is a big thing, if you turn it off, you should save quite a bit of CPU time.

You will also need v3 for dual core support. Good thought to try to upgrade to v3, and turn off connection tracking. See how that does.

Make sure you turn ON multi-processor support in v3 though, its not on by default.

i agree with bbmj i wouldn’t put (64.xx.xx.33) on interface E2-LAN and if this IP is required, i will change local address in PPPoE profile to any other free IP.

and i will disable change-tcp-mss since it create two dynamic mangle policy to every user.

Multi-processor on ROS v3 is not working: ROS just stops responding to PPPoE then hangs or reboots:
20:57:17 system,error,critical router was rebooted without proper shutdown

Under a not so heavy load (<200 users) even with multi-cpu turned off there’s a problem with dropped packets (5% or more).

Pity, but 3.0 looks like not really a release, but just another not so stable beta ;(

Maybe this is all with PPPoE. Again, we have clients running 300mbit routing ,not PPPoE using 3.0 and they save LOTS of CPU time by having multi-cpu on!

I understand you are saying its not working, but I don’t know if that is the cause of your issues. I have not had issues with it yet! But of course, most of the clients are not running 1500+ pppoe sessions. Just routing.

Well… just to post a reply to this topic. Here is what i’ve decided to do, with permission of my CTO.

~~ Hardware ~~
PowerRouter 732
• Dual-Core CPU (3.0GHz)
• 2gb DDR2 Memory
• Mikrotik ROS v3.0 (multi-cpu enabled)

~~ Interface Setup ~~
E1 → VLAN WAN LINK w/Public IP
E2 → PPPoE Interface w/VLANs
E3 → PPPoE Interface w/VLANs
E4 → PPPoE Interface w/VLANs

Because I can’t justify the 10+K that it will take to purchase load testing equipment for my lab, here is what i’m going to test, one step at a time…

• OSPF – peering with two Cisco 7206-G1 routers and Cisco 3550 switches (routing mode)
(note: One 7206-G1 router is currently terminating 3000+ PPPoE sessions w/OSPF & Full BGP from two Tier1 internet uplinks)
• Take smallest site (50+ PPPoE sessions) and move the VLAN to the second interface (E2) … watch load / resources
• Take next smallest site (150+ PPPoE sessions) and move the VLAN to the second interface (E2) … watch load / resources
• Take next site (400+ PPPoE sessions) and move the VLAN to the second interface (E2) … watch load / resources
— Depending on how this is working will determine if I continue. With these settings I will have approx 600+ PPPoE sessions terminated on the router.
— This will test two things…

  1. 500+ users (already accomplished with v2.9.48, then started seeing problems after 500 users)
  2. OSPF peering with Cisco equipment (previous problem in < v2.9.48)
    • The big step… take the next site (1500+ PPPoE sessions) and move the VLAN to the third interface (E3) … watch load/resources … wait for failure… keep waiting…
    — we are now looking at 2000+ PPPoE session on this device …
    • Take the next site … (it just keeps going from there!)

Let me know if you have a timetable, and I can see about being available to watch progress and provide some support if necessary.

I would try without connection tracking. Multi-CPU has shown a great performance increase. There was also some talk about removing the change TCP MSS value as it creates lots of rules etc.

This is not a solution. You will have a lot of problems by turning this off: Cannot open some websites, MSN messenger doesn’t working etc.

Under a not so heavy load (<200 users) even with multi-cpu turned off there’s a problem with dropped packets (5% or more).

Yes, this is a big problem for V3.

As of this morning… testing has been completed. I will post details once i’ve had more sleep.

Well… We completed testing last night with PPPoE load. Using the steps listed above we moved over the traffic to 600+ users with CPU running an average of 20-23% utilization. We then moved over a big site (1000+ PPPoE session). The CPU was running good at 33% utilization, memory 90% free, …

20min into it, approx 1600 PPPoE sessions, traffic throughput of 20-30mb (at 2:30AM that was good) … the winbox and telnet dropped. Checked my rolling pings and I still had communication with my router interface, checked my SNMP monitoring system of switch interfaces and ALL traffic had flat-lined. … … oh boy… this is not good.

We were able to get back into the router via telnet and generate a supout file (took over 20min to build it) … so now we are back to the testing board.

(supout has been sent to mikrotik for review)

The only response you get from support will be “turn off multi-cpu”.
And on the question, when this will be fixed:
“About multi-core - sorry, I can’t give you any estimates.”

The only response you get from support will be "turn off multi-cpu".

... That is the response we received...

"About multi-core - sorry, I can't give you any estimates."

... and yes, this was the response as well.


.......... Now on to further testing results ...........

The test results are in. We disabled the multi-cpu support on this device and loaded up the system. The test went very good! We were astonished at the results. We loaded up the system with 2,631 PPPoE sessions, 3,464 OSPF routes from two peer routers, while maintaining a 62% CPU load average for 30 minutes. The initial sessions were running for over 2 hours, with constant traffic running through the system.

View the attached graphs / screens shots for the proof!!!


I want to throw out a BIG thank you to Dennis Burgess with LinkTechs. LinkTechs is the hardware vendor for the PowerRouter 732. He worked with us during this testing process in our network maintenance window. I highly recommend this router appliance as a 1u, stable solution. Check them out for your next router needs. http://www.linktechs.net

Now if we could just get Mikrotik to fix the SMP problems, this would be an excellent appliance for termination PPPoE sessions with a load-balanced cluster solution.

Lucky man you are… We have a much heavier load, 'cause we need shaping and filtering.

Well… We completed testing last night with PPPoE load. Using the steps listed above we moved over the traffic to 600+ users with CPU running an average of 20-23% utilization. We then moved over a big site (1000+ PPPoE session). The CPU was running good at 33% utilization, memory 90% free, …

20min into it, approx 1600 PPPoE sessions, traffic throughput of 20-30mb (at 2:30AM that was good) … the winbox and telnet dropped. Checked my rolling pings and I still had communication with my router interface, checked my SNMP monitoring system of switch interfaces and ALL traffic had flat-lined. … … oh boy… this is not good.

I don’t buy this. You should:

Enable: Change MSS option in ppp profiles. As you probably know, there is a lot of problems with ppp if you don’t enable this.

You also have just of 20-30 M of traffic for 1600 pppoe connections. Try to get at least 100M so we can see if you will run this stable : )

After this you should post your CPU usage. Also, please post statistics of LOST packets if you use V3!

We pinged a good number of customers from remote off-net IPs and never lost more than a single packet. The loss was well under 1%. Nothing like what was reported of 15+% ..

Which RouterOS version you are running? How many pppoe tunnels you had active at the time you test ping ?