RB3011 - low routing performance with low CPU usage?

millenium7 · Fri Apr 03, 2020 6:30 am

Ok so riddle me this
We've been encountering speed issues since starting to migrate off PPPoE client connections inside VPLS tunnels to a PPPoE concentrator, and onto DHCP instead
Why are we doing this?
- Faster recovery if a link goes down
- Faster failover
- Multi path selection
- Simpler topology without the need for tunnels
- Policy based routing anywhere in the network
- QoS tags can be read by all radio's in our network (not all can read when into PPPoE tunnel)
- VPLS bugs have been driving me insane, tunnels not connecting until a reboot, random addresses not being added as MPLS tags etc
- Customer can install any router out-of-the-box and it 'just works', no need for usernames/passwords
- Customer can 'accidentally' reset their router as much as they want. Less support calls to us
- Lower traffic overhead

The big hurdle was getting /32 public addresses assigned to customers via DHCP and routing to it working correctly, this has been solved, everything looking good, except speed tests have been way, way lower than they should be especially in the upload direction. I've been scratching my head for days, and I found its the RB3011's we're using. I'm not sure if this also applies to other models, but CCR's have no problem

So here's an example network diagram

CustomerA only has CCR routers in his path to the internet. I can go to speedtest.net on a computer and get the full speed (300/300mbit/s) no worries. I can also log into his router and run a speed test to RouterC/B/A also gets 300mbit/s no problem
If I run a speed test from RouterC to RouterB - 1gbit/s
Speed test from RouterC to RouterA - 1gbit/s. Everything fine

However CustomerB gets maybe 200mbit/s download and 40mbit/s upload. Here's where it gets weird
If I run a mikrotik speed test from CustomerB's router to RouterE. I get ~1gbit/s, this is fine
CustomerB to RouterD/RouterA - 200/40mbit/s
RouterE to RouterD - 1gbit/s
RouterE to RouterA - ~200/40mbit/s (wtf????)
RouterD to RouterA - 1gbit/s

The numbers are not exact but they're relatively correct. The weird thing is when passing packets 'through' RouterE or through any more than 1 hop/direct connection from RouterE, it's really slow. But if directly connected, it's fine
CPU usage should not be the problem, because it can do 1gbit/s when running a speed test to/from it to routers directly connected, which is way more CPU load than passing packets 'through' the router, but the speeds are totally screwed up. Using UDP or more TCP connections helps, but it still only reaches ~85mbit/s in the upload direction

However if I move CustomerB off DHCP and back onto PPPoE (carried via VPLS to RouterA) RouterE is happy to pass traffic through it very quickly, and he can get the full ~300/300mbit/s on speed tests no worries

------------------------------------------

Network topology is OSPF + MPLS everywhere, and there are VPLS tunnels from every router to RouterA for all existing PPPoE clients
If I export the config on RouterE and replace it with a CCR router, full speeds no problem. But at no point does RouterE ever reach 100% or even close to 100% CPU usage on any core so it shouldn't be a bottleneck

Is there something very different with the RB3011 hardware or packet processing that is causing the issue here? It's like the routing engine performance is just utter crap, but everything else seems to be ok
At the moment it looks like we might have to replace all 3011's in our network with CCR which would get quite expensive. I want to know if there's something we can do
Even a HEX is capable of passing traffic through way faster, but there's not enough ports for us to use

Zacharias · Fri Apr 03, 2020 12:49 pm

If your topology is OSPF is that the actual diagram ? do we see an area or the whole OSPF network ?
Did you check CPU usages?

millenium7 · Wed Apr 08, 2020 3:22 am

CPU usage is very low ~5-15%. It's not even close to maxing out 1 core, yet the actual results of passing packets through it look very much like a lack of processing power
So either CPU usage is reported completely incorrectly for routed traffic, or there's something else going on

Note that if I use a PPPoE session from CustomerB to RouterA over a VPLS tunnel, it works perfectly fine. Something funky going on because it can pass traffic through it fast, but not when its IP traffic. When its encapulated in PPPoE and just shoved through it seems to be fine with it
But when it comes to IP traffic, basically anything where RouterE doesn't have to look up a destination any further than a directly connected peer, is fine and totally fast. If it has anything further than directly connected, it's slow as balls

Zacharias · Wed Apr 08, 2020 12:39 pm

I am not sure if changing the network type from broadcast to point to point would make a difference...

millenium7 · Wed Apr 08, 2020 1:13 pm

It wouldn't. But everything is already set to point to point

Kindis · Wed Apr 08, 2020 2:39 pm

This looks relay strange. Even I do not full get what you have done it should not behave like this and as you say the CPU does not seam to be be bottleneck.
So most CCR routers do not have a switch chip in them. I wonder if your port layout or something else is causing this issue.
Having a look at the block diagram of a 3011 and a CCR1009-7G-1C-1S+PC you can see that all ports on the CCR is attached to the CPU and that is not the case in 3011.

Regardless posting you config might give more clues to what is wrong or if this is a feature.
Also have you contacted support@mikrotik.com ?

millenium7 · Thu Apr 09, 2020 10:34 am

Ok very interesting

I setup a lab with a CCR as the core
a HEX/RB2011/RB3011 as routers connected to it through a gigabit switch

Then another CCR behind it that I used as a customer to simulate this

In my initial testing I got the expected behaviour, which is slower tests 'from' the router and faster traffic 'through' the router
However..... I then made the 'customer CCR' part of the OSPF network and got the same throughput
However I then enabled MPLS and things got really weird

a HEX went from ~860mbit/s of throughput to ~900mbit/s with MPLS on, small but noticeable gain
RB2011 gained massively from ~350mbit/s to a whopping 680mb/s
RB3011 lost a huge amount of performance, ~1300mb/s down to 850mb/s and the CPU usage is not reaching saturation on any core

So it seems MPLS is a bit broken on 3011's. Unfortunately I can't just immediately turn it off throughout our production network because we rely on VPLS tunnels at the moment

Kindis · Fri Apr 10, 2020 11:28 am

This sound like a case for support@mikrotik.com and a supout file when you have the issue.

RB3011 - low routing performance with low CPU usage?

RB3011 - low routing performance with low CPU usage?

Re: RB3011 - low routing performance with low CPU usage?

Re: RB3011 - low routing performance with low CPU usage?

Re: RB3011 - low routing performance with low CPU usage?

Re: RB3011 - low routing performance with low CPU usage?

Re: RB3011 - low routing performance with low CPU usage?

Re: RB3011 - low routing performance with low CPU usage?

Re: RB3011 - low routing performance with low CPU usage?

Who is online