CHR + ESXi 6.7 U3 tx-drops with VLANs

Throwing this out here to see if anyone else has experienced tx-drops on VLAN interfaces in CHR running in ESXi?

The configuration is basically a vSwitch assigned with a single 10G port from an Intel XL710 adapter, MTU set to 1600 and promiscuous mode set to accept. Then a PortGroup configured using VLAN ID 4095. The PortGroup is then added to the CHR instance as a VMXNET3 device.

In CHR the VLANs are defined as needed on the parent interface. RPS is disabled. Interface queues set to multi-queue-ethernet-default.

In my test lab I am running two CHR instances on two different physical host systems. ESXi 6.7 U3 on both. One box is a Vengeance 2 and the other is a Lanner 6210. CHR is on version 6.47.9. I also have a RB4011 with 10G connected to each CHR instance for running traffic-generator. Both boxes have HyperThreading disabled.

As soon as you push any load over about 2Gbps the VLAN interface starts to clock tx-drops. The parent interface does not. The higher the traffic load, the more drops and the more unstable the CHR instance becomes.

lanner-txdrop_1.PNG
veng2-txdrop_1.PNG
‘esxtop’ doesn’t show any sort of excessive %DRPTX or %DRPRX counters.

If you use VST to define the VLAN (instead of defining it in CHR), the problem goes away. However, this is a poor workaround because it limits your ability to dynamically add VLAN’s without stopping the VM. Not to mention that for each VLAN you need you would be exposing a new interface to CHR and subsequently more IRQ’s.

I have SUP-37609 opened with Mikrotik about this, but so far no resolution.

edit 1: my imgur links didn’t work. Uploaded the images instead and placed them inline.

I also get TX drops on VLANs using Proxmox. My setup is as follows:- Intel 82599ES card E5-2690 v2 x 2. I tried PCI passthrough first and noticed tx drops on VLANs. Moved the interface to the Promox bridge and drops on Mikrotik VLANs were gone but started seeing drops on the Proxmox bridge itself. I run PPPoE on those VLANs. So far I haven’t been able to figure out the cause.

Consider testing with e1000e driver instead of VMXNET3.

Just be aware that e1000 is a legacy driver that should be used only for troubleshooting the root cause of issues with VMXNET3 which is the preferred driver for production use.