CHR kernel crash when heavy traffic

Hi, I have a CHR running the latest version (v6.42.1) on VMware Esx 6.5 server

I found that if there is a heavy flow in a interface ( >5Gbps ) will cause CHR kernel crash.

on the console will showing ”No irq handler for vector“.

on the Esx server log showing “The CPU has been disabled by the guest operating system. Power off or reset the virtual machine.”

Resources to the CHR are:
12 Cores of Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz
Memory 2048GB
Network VMware VMXNET 3 x 10 (QLogic Corporation NetXtreme II BCM57810 10 Gigabit Ethernet)

How do I solve this problem ?

We have observed 6.42.1 locking up on KVM with VirtIO drivers as well. Our throughput is considerably less, 1.4Gbps with 4 x Intel 2640v4 cores.

6.41.4 was stable prior to this…

ie: Me too…

PS: I have not observed any messages on the VM console, screen doesn’t wake from blanking…

Disable conntrack

Hello I have also noticed that disabling connection tracking improves stability of CHR. However I need this feature since it runs WebProxy.

What is the root cause for this issue and can it be fixed?

I have had 2 crash then automatic reboots on the same version using AHV (KVM)
Mine has been since creating an additional VRF. (only 2 in total!!)
Did you upgrade? Was the later relase more stable?

Thanks.

I’m having the same problem on CHR 6.42.9

1 gbps of traffic. Running on ESXI 6.7

Any words from Mikrotik?

I have moved my CHRs back to ESXi 6.0 and they have been stable since.
Mikrotik support said they would help troubleshoot if I needed once back on ESXi.

You can have a look to your cpu readyness stats running:

esxtop

command…

If you have more than 20% of %ready, decrease your chr numbers of core to fit better your hardware CPU.
All our chr runs with 2-4 cores better than 10 or 20.

When CPU ready increase over 100%, it crash.

Best regards,
Julien

Also having this issue.

VM has 2vCPU, host has 40 & is not under heavy load.
We have multiple CHR’s, only one is doing this.
Latest bugfix (6.44.6)
Anyone got a solution? It is when heavy throughput.