Mikrotik CHR P1 Licence - packet loss [FIXED]

Just wanted to beware, and inform everyone (maybe with similiar problem) about some really bad bug in Mikrotik CHR licencing.

We have launched new BGP CHR server instance (v6.37.1) in December. We’ve obtained 60 days P unlimited licence.
Everything has been working like a dream during trial period. No any packet loss, no hangs, throughput was perfect, BGP sessions uptime was >30 days. All perfect.
Then, we decided to buy P1 licence, as our router interfaces are 1GBit only for now.

After installing licence, some of our customers started to complain of VPN disconnections and hangs.
We’ve found packet loss, which were caused by CHR instance. Actual loss was between 0.5 and 5%. For example:

--- 185.28.167.66 ping statistics ---
1000 packets transmitted, 987 received, 1% packet loss, time 11987ms
489 packets transmitted, 480 received, 1% packet loss, time 5968ms
1000 packets transmitted, 996 received, 0% packet loss, time 11987ms
1000 packets transmitted, 996 received, 0% packet loss, time 11987ms
1000 packets transmitted, 994 received, 0% packet loss, time 12948ms
1000 packets transmitted, 993 received, 0% packet loss, time 11996ms

We have tested loss rate at night (when total router throughput was about 5-10mbit), and loss was much smaller (maybe like 0.1%).
Loss rate rised again after crossing 80mbit+.
Our conslusion was - there is either some underlying queue on CHR instance, which dropping packets, or HW problem.
We 've carefully examined ESXI stats and Mikrotik interfaces/vlans stats: no drops, no errors. Nothing.

Then, I have created new CHR instance, installed same 6.37.1 version, obtained new P unlimited trial for 60 days, imported all settings from P1 licenced server.
Voila, everything backs to normal, no VPN drops, no LOSS!

--- 185.28.167.66 ping statistics ---
1000 packets transmitted, 1000 received, 0% packet loss, time 13303ms
1000 packets transmitted, 1000 received, 0% packet loss, time 13238ms
1000 packets transmitted, 1000 received, 0% packet loss, time 12970ms

Conclusions:

  1. There have to be some underlying, hidden queue in P1 and most likely P10 licence, which limiting interface speed to licence level speed, and is most likely improperly configured and dropping packets, even when our interface rates are much smaller than 1GBit ceiling.
  2. Mikrotik engineers should NOT place any underlying queues on bottom of CHR. As for me - this is unprofessional and should never happen. Interface limitation should be done by software/driver modification, so that interface should not negotiate speeds above licenced ones. For example by shadowing “advertisement link speeds” and allowing to link only speeds below 1000mbps fdx for example, and applying apporiate patch for driver or driver interface between winbox and driver. This should be best way, as it’s not harming and influencing router passed traffic any way.
  3. I have created support ticket, however it’s not processed (and most likely will not be processed) because they ask for supout.rif. Sorry, i can’t run old buggy P1 instance on production server.
  4. Besides of packet loss experienced on P1 licenced instance, we didn’t had any other problems with VoIP or games (at least customers didn’t reported that). However, Apollo Games machines UDP VPN tunnel seems to suffered a lot (probably also a design fault on this machines - seems like UDP over UDP, and they don’t use TCP on bottom VPN layer maybe).. So you may experience loss on your instance, and even don’t know that.

interface should _not negotiate speeds above licenced ones

Have fun with fiber/SFPs :slight_smile:

We have found an issue in the queue which creates the license speed limitations. Next RouterOS build should have a fix. Thanks for reporting it.

Hi Normis

Any details on the Issue and when the fix is out, I’m wondering if that’s why I was seeing packet-loss on my VDSL even if I shaped it to below the maximum (It’s 68/17)

I had assumed it was just a performance bottleneck of the SOC (N3150) or kvm (proxmox)

This bug only affected CHR speed limitation mechanism. It is not related to non-virtual (real) devices.

Yes my home router happens to be a CHR running in Proxmox (KVM) on an N3150 celeron with 2 cores allocated to the VM

There’s also an instance of Opnsense doing IPS for the IOT (internet of things) vlan (Hardly any traffic, usually kilobits) and Linux VM running radius/samaba4 for DC (Again mostly Idle)

It might explain why I was dropping packets even though the link wasn’t fully saturated and the CPU on the host and VM wasn’t being maxed..
Of course crappy realtek NIC’s could also explain that.

Dear mods,
the thread is marked as a [FIXED] one,
however at moment I do not see any new version released with fixed problem.
I still need to use P unlimited trial licence.

Sorry about that, was a bit hasty. This will be included in RC release later today:

*) chr - fixed problem when transmit speed was reduced by interface queues;

Anyone been able to confirm this is fixed after putting the RC in? I haven’t had a chance to test in my lab yet

I am not dare to test it until confirmed and put stable.
Everyday we find new nasty bugs. Moving to each new version is too big pain.

you can spin up any number of CHR trial instances, this is the beauty of CHR. It needs no license and it is virtual.

Hello,

short and practical from software development:

Fixed is ONLY then considered to be fixed, when proofed as fixed (at least by the originator of the issue !).

So as long as there’s no proof of a fix, the issue is needed to be kept open.

That’s not only common practise in SW development but also in any other Kind of issue tracking and also ITIL best practise.

So please remove the [FIXED] tag, thanks

greets

Hello,
Normis, I can assume that “Ticket #2016062166000118” is fixed too?

Thanks.

We already fixed it and tested it. The poster in this thread asked if any OTHER user also tested it. Surely you can wait until 5 or 10 more people will also test it, but where is the line ?

Hi,

I don’t know if this is related, but I am seeing really bad performance (especially on GRE tunnels which then show some TX drops, but also on direct ethernet routing) with the CHR on ESXi when I give 2 vCores or more to the VM.

With only 1 vCore however, everything works correctly, with good rates (except that the BGP process takes processing speed away that might prevent traffic from flowing correctly).

I’ve seen this on multiple platforms, with the same symptoms every time, so I’m asking here before starting a new thread.

You are right, this is big pro of CHR. However I meant for example x86 version.

For example, we’ve found simple queues bug last days. v5, Router keeps rebooting, sometimes each hour, sometimes 4 hours, sometimes 1 day.
I catched kernel panic right before reboot, however i didn’t had enough time to dump stack trace (on bottom there was Routing rebooting in 1 seconds…).
I have removed all simple queues and no more kernel panics (btw in logs there was only message about unproper shutdown). I haven’t decided to provocate reboot again and dump stack trace to not annoy our customers and probalbly Mikrotik will not solve this issue, because AFAIk v5 development line is closed.

It’s pity, because before we had very good experience with V5. We had like 200days uptime with no single problem. Problem’s have started after using simple queues.
Maybe MT should think about release bugfix branch for V5 version?

Now we would upgrade our router to v6.x, however i saw v6.x have problem with intel 82571eb support (packet loss). I will not risk upgrading in this case, and this is not CHR we can easily revert that.
We will try queue trees, hope it will not keeps router reboots.

sorry, but this is not related to this topic. The topic is specifically about CHR and Trial speed limits. You should make a separate topic about the x86 issues, or better - email support directly.

Router migrated to 6.37.5 and applied P1 licence
Seems packet loss problem were fixed
Thx

We are using CHR P10 license with ROS v6.49.8 (latest stable as of today 2023-07-31) and with 400-500mbit traffic we have a loss of 4-5% and maximum traffic does not pass 1Gbit ever.

With a trial unlimited license all problems and traffic loss are gone and we easily go past 2Gbit up/down trafic with a few percent lower cpu usage also.

So the problem persist/was not fixed/was re-introduced… I hope this will help others with packet loss on CHR too.

P.S.: This has created a lot of problems and a lot of time used for debug until I have found this thread, so thank you @karwos for finding out the workaround.
We are using a CHR instance on a Dell R730 (was R620 until a few months ago) with 10G interfaces with XCP-ng (works better and with lower load than vmware).

@karwos or @normis
Please edit topic and remove “[FIXED]” it is not fixed, the problem is reproductible even on P10 trial license and a new trial unlimited license solves the pachet loss problem on 6.49.8 version.
We can’t use v7 yet as is lacking some features…