ROS 5.10 - 5.12 very unstable as pptp BRAS on an x86.

lionspy · January 29, 2012, 2:19pm

ROS 5.10 - 5.12 very unstable as pptp BRAS on an x86. every 5-12 hours i see a a soft lockup on a random core of the cpu, after that whole system hangs. ROS 5.11 have much more performance on my BRAS (about 150+ Mbps, than it hangs). supout.rif does not creates in case of hangup.
ROS 5.6 seems to be very stable, uptime is over 2 weeks, generally.
but in pptp mode performance is wery low, below 130 Mbps.
Configuration - x86 core quad Q6600 + dualhead INTEL 82576 NIC + rtl8169 NIC for communication with billing server, (radius protocol). Billing - Abills 0.52b. Simple queue for each user, connected via pptp.
In case of 120+ users online i see that LA on 1 of cores CPU reaches 100% (98-100% by interrupts). This 100% LA moves from 1 core to another, and total throughput of BRAS limited near 130 Mbps. In this case other cores loaded on 5-20% (system-resources-cpu) in tools-profile i see that this core loaded by queuing & firewall (mostly).
Today i bind tx-rx-lanes of NICs to different cores of cpu, and increase ethernet-default queue to 400 and default-small queue to 50. Now i gathering statistics of LA.
On this forum i read that i should disable RPS on my ethernet interfaces. How can i do this? i can’t find it.
Any ideas about increasing PPtP performance? (Linux Accel-ppp + quagga?)

P.S. Cracked ROS 3.30 was very stable, but it doesn’t support 82576.
I bought L5 license for ROS 5.6 and i don’t know why i should pay for such instability.
P.P.S. Sorry for my English, i’m from Ukraine.

lionspy · January 29, 2012, 6:28pm

i found where i can disable RPS
System - Resources - RPS

lionspy · January 30, 2012, 7:21pm

this is the 100% LA on a random core of CPU

lionspy · January 30, 2012, 7:29pm

another one

lionspy · January 30, 2012, 7:36pm

It’s a daily loadings of the backbone interface. (ether4)
interface is limited at 200Mbps by our provider.

NetworkPro · January 31, 2012, 9:42pm

Hey man I was bored to death and saw you had a problem that no one helped with.

I see you use RTL8169. have you tried removing and replacing that one with one with a totally different chipset?

nest · January 31, 2012, 9:51pm

I will support NetworkPro’s suggestion. I have tried various ROS versions, upgrade/downgrade but always a problem. Then, changed it for a NIC with an Intel based chipset, no more problems. Might be a problem with the hardware chipset, might be a ROS driver problem. Who cares, I now know to stay away from the Realtek chipset. Haven’t got the time to investigate every reason why something doesn’t work!

lionspy · February 1, 2012, 8:36am

“+ rtl8169 NIC for communication with billing server, (radius protocol)”.
it’s ether6 interface with traffic = 30-40 kBps (max 50 pps)

NetworkPro · February 1, 2012, 9:26am

If you take two-three supouts when this problem happens and send them to support, what is their answer about what is causing 100% CPU there ?

lionspy · February 6, 2012, 6:27pm

there is their answer!
Hello,

in /queue interface menu set all queues to
multi-queue-ethernet-default

i change ethernet interfaces queues to multi-queue-ethernet-default. doesn’t affected on router behavior.
I don’t know how i can change queue type from “default-small” to multi-queue-ethernet-default on my dynamic pptp user interfaces.

2)in /system resources rps menu disable all entries

i did it 3 days before sending supouts to support. doesn’t affected on router behavior.

in /system resources irq try to allocate cores manually (by default it at least
one core per interface)

i did it 3 days before sending supouts to support. doesn’t affected on router behavior.
Examine this presentation:
http://www.tiktube.com/index.php?video=JGdn3goDdEHnnpLLIEDwGtnwolLonKGo=
no more news for me. I did all optimisations.

Regards,
Janis Megis

seems to me, it’s no necessary to attach supouts - MTK support don’t use them when analyses user complaints.

NetworkPro · February 7, 2012, 2:44am

I like these optimisation tips. Good info.

When you allocated IRQs to cores, can you tell if the core that goes to 100% is the one with the RTL8169 IRQ allocated to it?

Does mutli-queue-etherhet-default work with the RTL8169?

lionspy · February 7, 2012, 10:35am

I don’t know. It’s no errors displayed when i change RTL queue to mutli-queue-etherhet-default.
But in mutli-queue-etherhet-default kind ob buffer is “unknown” instead of “pfifo” on etherhet-default.

NetworkPro · February 7, 2012, 11:32am

use the multi-queue. Does it still happen, the problem?

Send one-two supouts when it is not happening and two-three supouts when it is happening. Maybe they could catch it by comparison.

lionspy · February 8, 2012, 8:38pm

Yes, it still happen.
The problem exist when any pptp user (it’s may be only 1 user, may be few users) exceeds speed of 20-25 MBps. 100% LA on 1 of cores of CPU, moving from 1 core to another.
when all users running at 6MBps it’s all right, load normally balanced by CPU cores. But in this case overall load don’t overs 100-110 MBps and i don’t know which will be result, when total traffic exceeds,for example, 150 MBps.
Seems to me, it’s limitation of userspace pptp server software in MTK ROS.

NetworkPro · February 9, 2012, 6:30am

Can you make a Linux installation with the same router cofiguration and boot it with the same hardware to test ?

lionspy · February 9, 2012, 7:59pm

i’m moved from slackware +2.6.23 kernel + accel-pptp 0.8.5

self-made billing (all settings and DB, statistics in mysql, lot of perl scripts) with NAT on this Q6600
2 100 Mbit intel desktop NIC’s and 80-95 MBps peak traffic load
to ROS 3.30 + Abills billing + 2 D-link DGE-530T NIC’s (marvell?).+ BGP default route (to handle real IP per each user)
than i bought 82576 NIC $ ROS 5.6 to handle 82576.
Troubles began.
On attached interface utilisation diagram:
apr. 26, 2011 - linux replaced by ROS 3.30
sept. 17, 2011 - ros 3.30 upgraded to ROS 5.6, NIC’s upgraded to 82576 dualhead pci-e.

I move to ROS because my self-made billing can’t handle users with real IP’s.(it has been build in far 2002-2004y).
Today i’m prepare my network to move to pppoe, than - to ipoe.(i have 4 routed subnets /24)

Chupaka · February 10, 2012, 12:56pm

what’s difference between real IPs and… uhm… unreal (?) IPs?..

lionspy · February 10, 2012, 6:43pm

That system has built-in NAT and hasn’t BGP.
Self-made, not by me.

NetworkPro · February 10, 2012, 6:58pm

Can you swap the motherboard with a different model to test? It is what I would do if I get stuck like this due to hunch about hardware+drivers issues.

lionspy · February 11, 2012, 6:38pm

similar results on intel G33 (Gigabyte) & intel G41 (Asus).

seems to me - it’s userlevel code of pptp server software on mikrotik.

I’ve tested this box in both pptp (all users) + 2-3 pppoe tunnels (test users, 1 mschapv2+mppe + 2 md5 chap without mppe), in pppoe throughput was near 100M, this load don’t overload the CPU, i achieve (65-70 MBps download/ 25-60 MBps upload)/user (netbook on intel atom N450 CPU) in torrents and no softlockups on BRAS CPU.