CPU spikes on pptp and l2tp VPN server

Hello,

I see a bit strange CPU usage on nice white CRS125-24G-1S: CPU used to go up and down all the time, and even change CPU frequency won’t help. I tried to change 600 → 750 Mhz, which is 25% difference but spikes are the same.

The router itself working as a “one WAN - many LAN” router and switch (no sophisticated setup, just simple NAT), and also as a pptp/l2tp VPN concentrator. It authorize users against Radius which is Windows IAS. During the day there can be up to 150 vpn users online, but the total traffic on WAN is not really big, I see mean usage of 10 Mbit and rarely spikes to 50 Mbit.

As I try to understand what uses CPU at most, I see “management” process and nothing more. This is like a black box, king of “default” category where all unidentified traffic goes to. As I can understand, pptp/l2tp is CPU-based, no hardware acceleration available - do this CPU usage can be because of VPN.

But I would like to ask - do I have any tricks to lower CPU usage on device? For what I see fasttrack won’t help me with pp2p packets, not sure for l2tp.

management is your own Winbox management session, it could be the tables you are displaying, etc.

Try using system resource monitor in the CLI to see if CPU usage is the same.

I’m amazed however a CRS is able to cope with 150 tunnels, do those spikes reach 100% cpu usage?

I’m amazed too, and to my even greater surprise the routing works well (no complains from LAN users). But each tunnel uses vary little traffic, so I think the whole picture may be different if huge traffic would go over these VPNs.

Yes, spikes are at 100%.

And yes, profiling in CLI gives me lower CPU usage, thank you for the advice. Wish I can afform myself to gather these CLI output once a second so I can create a per-type graph later - but I think this will create additional load.

Are there any tools to profile deeper? Maybe there is some extra tricks to easy CPU load at this scenario?

Problem is we could rise the load by monitoring… several alternatives comes to mind, but you’ll have to try which one has the lesser impact on load:

Use php api to fetch cpu usage and throughput for example, and feed an RRD graph
Activate SNMP and either feed an RRD graph, or use the dude to graph
Graph CPU usage on the CRS itself (think is isn’t a good idea)

STG is a lightweight, simple, no install required tool great for graphing directly from SNMP when you’re in a hurry or for a non permanent setup.

To get the oids open a terminal, and issue

system resource cpu print oid

You can get the oids also on a menu, .eg.

/interface
print oid

Which process did get the most CPU utilization when profiled through CLI?

Indeed. Freshly installed CRS with no traffic and only LCD turned on eat 6-11% of CPU. As I off the LCD CPU reduced to 0%. I don’t know how they implement LCD stats but it is hungry thing! Monster LCD! :slight_smile:


Thank you for directing me to STG, will take a look! I do use Zabbix to get CPU and interfaces usage. I do believe SNMP may use a bit more CPU and also I don’t know how accurate SNMP polls the device (once I saw switch that used 100% of CPU during each SNMP walk over it and two concurrent SNMP walks resulted in dumb hang for a minutes!), but I’m not sire using API will use less resources.

Most intensive processes was ‘firewall’, ‘idle’, but neither management nor routing or something. Moreover CLI shows device is clearly unused (hard to believe) so I doubt if it show right info. I would also like to know how CPU-based encryption may be displayed in that list.

Firewall? Maybe rules can be optimized? Which ROS version? Are you doing QoS?

Regarding api load, a forum mate, boen_robot, is PEAR_php developer, he should have first hand experience…

No QoS, not much rules but yes I will take a look at them (I do use fasttrack but with no visible success).

ROS was used all from 6.20 to current RC (6.30rc13) and again with no visible success in lowering CPU usage.

Just checked another time (much longer) and CLI issued /tools profile show “management” with CPU up to 80%. It is very “impulsive” output, but looks much like GUI based profile window.

For what I know from wiki http://wiki.mikrotik.com/wiki/Manual:Tools/Profiler#Classifiers, “management” is “RouterOS management processes that do not fall into any other classifier. For example, when routes added to kernel, internal messaging exchange between RouterOS applications, etc.”. 80% for internal messages a bit confusing, but if it is for handling tunnels (I mean control traffic, not CPU-based encrypting user data) this can be true.

Which kind of tunnels are you using? Are you using encryption?

As I’ve said, pp2p and l2tp. Take a look:

/interface pptp-server server
  set enabled=yes
/interface l2tp-server server
  set enabled=yes
/ppp aaa
  set use-radius=yes
/radius
  add address=x.x.x.x secret=xxx service=ppp

just like that, pretty “set-as-default”. l2tp won’t use ipsec, it is used as “last-resort” way to connect via VPN if remote network won’t pass pptp.

What I also noticed, every VPN connection adds two mangle rules for tune up MSS. Can’t figure out if it is “expensive” in terms of CPU.

By the way, how high should I set max-mtu and max-mru at pptp and l2tp servers settings: 1450 or 1460? Both are kind of default for different ROS versions.

I’d try to set just one MSS clamp rule (copy a dynamic one and change interface to “all pptp”) and disable the dynamic MSS Clamp creation to see if that relieves the CPU spikes . Regarding tx/rx maximum units, 1460 is the default, having in mind this is a hub and spoke server with unpredictable client devices, I’d leave it at that default.

I did that and also have some inspection on overall config. You may imagine my wonder when I found that netflow is set to be sent from bridge1 rather that from ether2 (which is master port for all inside ports). As I changed that I see CPU usage lowered for maybe 10-20% (as I can see from my eyes), but the spikes are still there. So I’d say we still have things to tune up. But, as I did this reconfiguration (set netflow to be sent from ASIC–driven “ether” port rather that from CPU-driven “bridge” one) I think the device will be able to handle some like 20-30 extra VPNs - Or, at least I’d wish to believe it is so :slight_smile:. Nice device to have for its price, really!

What I still don’t like are these spikes. Looks like some process are eat CPU from time to time. One candidate is SNMP (and I’ll try to overcome this by changing to use API), another is internal irq and process managers.

Here what I figured out: for hours after device reboot I see no huge CPU spikes, but after a while CPU goes up. Here is the graph:

(here is https://monosnap.com/file/70VJ1yim5hk9FsHaKMxQ2Vvp0bUeYd.png the same but bigger)

this is two consequent days, each graph for 24h long. First is the ordinary way router works over the days. The second graph shows the CPU usage as of reboot at about 22:20. This was upgrade to the newest ROS RS (6.30 RC19), as I tried to upgrade to overcome this strange behavior.

No idea why it goes this way. The ordinary load profile is users to connect at morning (maybe 8-9 at the graph), and disconnect at 21-22. The traffic for both days was quite small, max=3Mb/sec. The VPN users online number right now is 148.

I’d supply some debug info but simply don’t know which log may help.