Problem CPU CHR 100 % whit 27 GHZ xeon processor

Hi,
we have installed a CHR realease of rouuteros on a vmware VM on a dedicated host phisical machine in our datacenter.
It acts as pppoe server on our network, 1850 subscribers active.
On peak hours, subscribers have packet loss when they ping hosts on the internet (se when they pass through the pppoe server). In theese moments we observe saturation of some of the virtual CPUs from the CPU usage table on “Resources” section of routeros.
We have excluded other kind of transport issues and the proof of this is that customers ping perfectly the pppoe server lan interface and the server (from terminal) ping smoothly the internet.
It seems CHR isn’t able to manage efficently the available cores (it apparently spreads the load but in un unbalanced way, leading to a saturation of cpu0 and cpu1). This way some of the vCPU remains partially offloaded. Performance graphs from vsphere show just 6000mhz of 27000mhz, used at most.
Can you find any incorrect settings in our configuration? We attach supout and exported config and others screens for your convenience

Thanks in advance and best regards
server.JPG
performance.JPG
cpu.jpg
mikrotik.JPG
core.JPG
cpu vmwere.JPG

Some thoughts to lower the CPU load on your CHR. This is what I do on my CHR hosted on VMware ESXi.

1)- consider swapping out both of your Intel Xeon E5-2637 processors to an Xeon E5-2690.
I have found that CPU built-in cache is more important than CPU clock speed - especially when running tiny hosts which mostly run in CPU cache.

2)- It looks like you have hyper-threading enabled. Disable it. Hyper-threading cuts your build-in Xeon CPU cache in half and also slightly slows down the system because you have your Xeon processor emulating twice as many processors. This costs CPU throughput. Never use hyper-threading. Hyper-threading is only good for systems that can not be hardware upgraded where you need more CPUs (at the cost of slowing down the entire system to make it appear that it has more CPUs).

3)- There may be some issues with your 5 (five) network cards.
Each network card should use a unique CPU interrupt. If I remember correctly, you can go up to 4 networks card with unique interrupts. Five (5) network cards and higher will start using shared interrupts. Shared interrupts slow down the system because the operating system now has the added job of determining which device triggered an interrupt.

4)- Configure your VMware ESXi so that the next time you boot/power-on your CHR - that you end up going into the BIOS.
In the BIOS, disable/remote the following
serial ports
floppy drive
CD drive
Each of these in-necessary in-used devices has an interrupt associated to the device. There is no point in having the CHR check these devices to see if they generated an interrupt.

5)- network cards for CHR via VMware ESXi
— If drop down to a max total of only 4 network cards !!!
— If can use 802.1q vlans on your CHR - it works pretty well

6)- After all of the above … Perform a btest ( UDP send ) to 127.0.0.1
With your older/slower Xeon processor, I would guess you should hit 2-gig to 11-gig.
My CHR best to 127.0.0.1 reports 17 to 19 Gig.
FYI - 127.0.0.1 is an internal IP address in your CHR. Kinda like a built-in internal virtual network card.
If you tweak and settings to improve your CHR, get a base-line btest to 127.0.0.1 prior to making any changes. The faster the btest to 127.0.0.1, the faster your CHR is running.

7)- If you have other virtual hosted machines on your VMware ESXi server … then as a test , stop/shutdown the other servers then run a btest to 127.0.0.1 and see if there is any difference.

8 ) - You should always get much better/faster network throughput if you Intel 10-gig network cards

9)- On your VMware ESXi server, go into the BIOS on your physical machine. Delete/remove any devices you do not need (you want to keep the physical interrupts down so that your VMware ESXi physical server does not spend a lot of time processing interrupts - which in turn slows down everything it hosts). Disable all power management and set BIOS for maximum performance.

10)- VMware ESXi, edit your CHR settings (console - edit - Options - General … then disable logging)

11)- If you know how to, convert any hosted machines on your VMware ESXi box “Hard disk #” from “thick” to “thin”. Do this for all guest hosted machines on your physical VMware ESXi server.

EDIT:
12)- Every hosted guest machine and including the CHR should be using VMXNET-3 network cards. This is a paravirtual optimized network card to enhance network throughput and lower the CPU cost to operate the network interface.


Post something if anything here helps

North Idaho Tom Jones

Hi,
after your post I changed the ethernet driver to VMXNET-3 and a I can see a slight improvement. The strange thing is that I need to keep RPS enabled to be able to have the load spread over all the vCPUs. Without it only CPU0 and CPU1 are loaded, others remain idle. I’ve read it would be better to disable RPS when using vmxnet-3.

I’ve also phisically removed two of five ethernet cards.

Throughput tests to the loopback address are around 12gbps.

If I increase vcpu assigned to vm from 2 to 4 to 8 to 16 cpu inside ESXi, CHR performance get worst.
Mikrotik cumulative CPU usage (from Resources) remains abaut 95% while the value in the cpu summary of ESXi a slow as 20%. This is the most evident issue.
I would like to be able to see the usage of the phisical CPUs of the host macchine occupied a lot more than 20%

I guess that if I would be able to install routeros x86 over the phisical machine I’ll get all cpu cycles used but I have problem with raid support (it’s not recognized).

Thanks for your suggestion and tips. This way the load has been lowered of about 10% but the aim of my experimentation is take advantage of the whole power of the phisical machine.

Maybe you should consider a CCR1036-8G-2S+EM ?

I tried CCR1072-1G-8S +, 100% cpu with various continuous ppp disconnects, removed after 10 min For network instability

That sounds like the “known” problem of having so much processing associated with disconnects that the system overloads when users disconnect, causing users to disconnect due to the overload, resulting in an avalanche.
A couple of known causes:

  • using of MASQUERADE instead of SRC-NAT
  • using a routing protocol that creates routes for each individual IP (e.g. for fail-over)

scara – install routeros x86 over the phisical machine
ROS x86 is OK , however in my environment , I found ROS x86 was prone to locking up.

I run the Mikrotik public accessible btest server ( 207.32.195.2 ).
In the beginning when I set it up, I was running x86 ROS. My physical VMware ESXi servers had/have an Intel 2-port 10-Gig SFP+ ports. My Internet feed is also connected via a 10-gig port and my Cisco switches are also 10-Gig. My 207.32.195.2 ROS x86 btest server could hit up to around 6-Gig to/from the Internet on btests. I experienced a problem almost on day-1 where my x86 ROS btest server would just lock up. All networking would halt - and I could not even ping 127.0.0.1 and-however the x86 ROS console was still talking. It hit the point that I would have to reboot the x86 ROS btest server almost every other day. (I am guessing the problem may be related somewhat that there were no paravirtulized Ethernet drivers).

Months later, I changed from x86 ROS (32-Bit) to CHR 64-bit. The CHR has never locked up as a btest server.

In conclusion, I found the following:

  • For almost everything, x86 ROS (32-Bit) and CHR (64-Bit) have almost the same throughput ability
  • As a btest server, x86 ROS is prone to random crashes
  • As a btest server, CHR has proven 100 percent stable.

Now - re your possible decision to try/run x86 ROS, I would say “If it is not critical core, then give x86 ROS a try”.

North Idaho Tom Jones

How does the cpu of the chr have an average of 90% and the cpu of the physical machine is at 20% when it’s all devoted to chr?

Have you tried setting Latency Sensitivity: High on VM settings ?

Also, what NIC cards do you have ?
Maybe you should change nic driver kernel parameter in esxi so it should have more tx/rx queues.
Rps should be disabled, and multi queue int default should be set at nic interface.
Besides of that, check resources->irq (how many tx/rx queues of vmxnet3 avaiable).
I think 35% ethernet usage is too much…

I can help you with that.

To the original poster - did you ever try disabling Hyper-Threading ?

FYI: Hyper-Threading is a built-in function of many newer Intel CPUs which is a semi-software/semi-firmware/semi-hardware feature which makes a single CPU behave as if it were two CPUs. If you have a multi-core processor, then Hyper-Threading makes the CPU count behave as if it had twice as many CPUs.

Hyper-Threading uses internal times CPU interrupts which act like this:
#1 run Virtual CPU #1 (of 2)
#2 a few moments later , stop Virtual CPU #1 (of 2)
#3 copy the contents of the CPU registers used for Virtual CPU #1 (of 2) to a 1st memory location (aka Pop the stack)
#4 copy the contents of a 2nd memory location to the CPU registers to be used by Virtual CPU #2 (of 2) (aka Push the stack)
#5 run Virtual CPU #2 (of 2)
#6 a few moments later , stop Virtual CPU #2 (of 2)
#7 copy the contents of the CPU registers used for Virtual CPU #2 (of 2) to a the2nd memory location (aka Pop the stack)
#8 copy the contents of the 1st memory location to the CPU registers to be used by Virtual CPU #1 (of 2) (aka Push the stack)
#9 Jump to #1 (and continue looping #1 -through #9)

The issue here is that both Virtual CPUs have a combined total processing power of just slightly less than a single non Hyper-Threading CPU. This is because of the following:

  • The CPU has to Pop and Push the stack non-stop. This takes away CPU processing time that could of been used doing something wanted.
  • The CPU built-in Cache is constantly re-learning & re-populating the Cache memory. The memory contents being cached is now being shared and re-written. With a non Hyper-Threading CPU, the CPU Cache has a better chance for Cache Hits. CPU Cache runs at CPU speed. When there is a Cache Hit, the processor can process at CPU speed. When there is a Cache Miss, the CPU has to slow down to memory speed (and any possible wait-states imposed on memory). So … By having as many Cache Hits as possible, you actually can run much faster. If the operating system memory and/or program used is small enough, it is possible to achieve a near 100 percent Cache Hit system. This can really make a system go much much faster.

There are Pros and Cons to Hyper-Threading. If you plan out your hardware and software/programs correctly, a non Hyper-Threading system can be measurably much faster.


FYI - One of the reasons I prefer newer Xeon processors with the largest amount of CPU Cache is this:
A - A slower Xeon processor with lots of CPU Cache can give you a near 100 percent Cache-Hit system running almost always at CPU clock speed.
B - A faster Xeon processor with a small amount of CPU Cache might give you an almost always Cache-Miss system which results in the CPU processing memory at memory + memory-wait-state speeds.

In general, when selecting components for a bad-ass computer, I always prefer the Xeon processor with the largest amount of CPU Cache then clock speed 2nd.

North Idaho Tom Jones

We are having the same problem where users are getting slow service and drops over 3000 customers and huge drops over 3000 plus users.
CHR using 40% CPU and host CPU utilization not exceeding more than 65% cpu if 2000 plus customers are connected results are fine but over 3000 users CPU and DATA utilization on CHR will be remain same and got slowness in services and huge ping drops.

Along with all of the above items I have posted here …
If you CHR is running on a Hyper-Visor (aka something like VMware ESXi), there are some additional tricks you can do:

  • On the physical BOX, remove all devices not needed. (CDROM, Serial Ports, Parallel ports, Floppy drives …)
  • On the physical BOX, in the BIOS setup, disable all non-necessary devices (CDROM, Serial Ports, Parallel ports, Floppy drives …)
  • On the physical BOX, in the BIOS setup, set all settings to performance instead of the default slower-power-save settings.
  • On the physical BOX, when ever possible, try not to have any two devices use the same CPU interrupt. Example, two network cards using the same interrupt forces the host operating system to spend time trying to figure out what device needs attention - this takes away from CPU processing speed to do other things.
  • On the physical BOX HyperVisor, convert all virtual drives to THIN for all virtual machines.
  • On the physical BOX HyperVisor, disable logging.
  • On the physical BOX HyperVisor, try operate two physical boxes. One box for your essential fast virtual stuff and a second physical box for all other non-essential guest operating systems that do not need to be top-priority.
    Also …
  • On the Virtual box, remove all virtual devices not needed. (CDROM, Serial Ports, Parallel ports, Floppy drives …)
  • On the Virtual BOX, in the virtual BIOS setup, disable all non-necessary devices (CDROM, Serial Ports, Parallel ports, Floppy drives …)
  • When possible, use ParaVirtual device (device-derivers). ParaVirtual devices are virtual devices where the drivers are optimized to perform faster/better when running in a HyperVisor environment. An examples of a ParaVirtual devices on VMware ESXi are; “VMXNET-3” network cards - and “VMware Paravirtual” SCSI Controllers
  • When possible, if you have a 10-Core Xeon CPU (with Hyper-Threading disabled), the number of assigned CPUs to all guest hosted operating systems should be 9 CPUs or less. Avoid over-subscribing the CPUs and Memory. Over subscription of CPUs and Memory causes the physical HyperVisor to start swapping.

One thing I think is possible to make a really fast CHR would be to try and do this:

Convert virtual CHR system to a physical box (no hypervisor - the CHR is the only booted operating system).

also - somewhat related
In a ROS or CHR-ROS, avoid the use of bridges when ever possible. Software bridges take away from the CPU to do other things. If you need to bridge a network, use a physical Ethernet switch. If possible, put IP address directly on an interface and not on a bridge. Also, the more fire-wall filters you have, the more it slows everything down.

Also - a decent CHR or x86 ROS system should be able to perform a “UDP send” tools-btest to 127.0.0.1 and achieve and hold a rate faster than 17-Gig !!! If you can’t hit that mark, then you have a slow physical box.

And also - if possible , avoid using 1-Gig Ethernet interfaces. A high-throughput data center should be using 10-Gig switches and 10-Gig Ethernet interfaces on everything.

Dear Tom,

Yes i knew about Vm tweaking infarct we have used specialized high end computing network card but still no luck.
one of my teak hits my Host CPU 100% with 2G but condition was same timely more users are get connected and data did not cross more than 2G.
one more issue i observe that i got message that 1522 can not be send to over 1500 MTU on my host as CHR L2 is 0 and Mikrotik disabled edit option in this section.
I dont know why they did not set 10G L2 MTU over 10G sync interface they just hard coded it to 0 which is completely unpredictable.

i have used btest with three of CCR’s and got easily 3G over udp but with user load its sucks as i mention above.

After second tweaks i got low CPU usage on host but its down CHR throughput as well after my last tweaks CHR even drop data over 1400 to 1500 Mbps.
I have also CCR1072 which is also similar issues and that also reported by other users as well in production environment.
I was hope that CHR might not such issue as its claimed by Mikrotik that its virtualization aware/tweak appliance.

This is somewhat interesting.
I have been able to achieve faster than 8 Gig sustained when routing between a virtual CHR hosted on one VMware ESXi machine talking to a virtual x86 ROS hosted on a totally different VMware ESXi machine — with two other x86 virtual servers performing a btest where the btest sessions were layer 3 routed.
VMware ESXi server #1 btest server (connected to 10-Gig Ethernet to #2
VMware ESXi server #2 CHR routing non-natted LAN from server 1 to server 4 (via #2)
VMware ESXi server #3 x86 ROS routing non-natted LAN from server 4 to 1 (via #2)
VMware ESXi server #4 x86 ROS system performing the btest to x86 ROS on server #1

Everything is 10-gig
All x86 ROS and CHR systems hosted on four different VMware ESXi physical systems.
No NAT - No Firewalls

The btest sessions are routed from #4 to #1 (no btest sessions running on the same network - thus forcing routing)
I was able to hold a sustained UDP send or receive faster than 8 gig.

North Idaho Tom Jones

Sheezzzeee

I am totally convinced that VMware running on a XEON E5410 @2.33 GHz (2 physical CPUs with 4-Cores with Hyper-Threading disabled) on a Dell PowerEdge 2950 with 16 gig of RAM is the slowest thing I have ever encountered. It takes forever to do anything.

My SuperMicros with XEON E5-2960 V2 processors running at 3.00 GHz totally scream crazy fast compared to the Dell with the XEON E5410 .

I don’t think I have a mis-configuration. I would suggest anybody using a XEON E5410 (version 1), to dump the physical box and use a faster box for their HyperVisor to run on. Especially if you are running a virtual router such as CHR or x86 ROS.

North Idaho Tom Jones

Dear Tom/ Experts,

You may achieved that throughput on L3 traffic with Btest which i can also get approx 3G plus with our Three CCR as client and CHR as server but in real production environment CHR is very disappointed me over tunnel traffic.
I have tested PPTP as well now and pppoe as i said earlier CHR drops packets in both tunnel wih PCQ and Dynamic Simple Queue as well
i have checked hypervisor software where we do not observe a single packet drop on main interface observed on Esxi HOST.
Furthermore i am also using hp g8 series with dual octa core processors.

Dear Tom,

Kindly share your VMX config file if its possible for you.