Ok so riddle me this
We've been encountering speed issues since starting to migrate off PPPoE client connections inside VPLS tunnels to a PPPoE concentrator, and onto DHCP instead
Why are we doing this?
- Faster recovery if a link goes down
- Faster failover
- Multi path selection
- Simpler topology without the need for tunnels
- Policy based routing anywhere in the network
- QoS tags can be read by all radio's in our network (not all can read when into PPPoE tunnel)
- VPLS bugs have been driving me insane, tunnels not connecting until a reboot, random addresses not being added as MPLS tags etc
- Customer can install any router out-of-the-box and it 'just works', no need for usernames/passwords
- Customer can 'accidentally' reset their router as much as they want. Less support calls to us
- Lower traffic overhead
The big hurdle was getting /32 public addresses assigned to customers via DHCP and routing to it working correctly, this has been solved, everything looking good, except speed tests have been way, way lower than they should be especially in the upload direction. I've been scratching my head for days, and I found its the RB3011's we're using. I'm not sure if this also applies to other models, but CCR's have no problem
So here's an example network diagram
CustomerA only has CCR routers in his path to the internet. I can go to speedtest.net on a computer and get the full speed (300/300mbit/s) no worries. I can also log into his router and run a speed test to RouterC/B/A also gets 300mbit/s no problem
If I run a speed test from RouterC to RouterB - 1gbit/s
Speed test from RouterC to RouterA - 1gbit/s. Everything fine
However CustomerB gets maybe 200mbit/s download and 40mbit/s upload. Here's where it gets weird
If I run a mikrotik speed test from CustomerB's router to RouterE. I get ~1gbit/s, this is fine
CustomerB to RouterD/RouterA - 200/40mbit/s
RouterE to RouterD - 1gbit/s
RouterE to RouterA - ~200/40mbit/s (wtf????)
RouterD to RouterA - 1gbit/s
The numbers are not exact but they're relatively correct. The weird thing is when passing packets 'through' RouterE or through any more than 1 hop/direct connection from RouterE, it's really slow. But if directly connected, it's fine
CPU usage should not be the problem, because it can do 1gbit/s when running a speed test to/from it to routers directly connected, which is way more CPU load than passing packets 'through' the router, but the speeds are totally screwed up. Using UDP or more TCP connections helps, but it still only reaches ~85mbit/s in the upload direction
However if I move CustomerB off DHCP and back onto PPPoE (carried via VPLS to RouterA) RouterE is happy to pass traffic through it very quickly, and he can get the full ~300/300mbit/s on speed tests no worries
------------------------------------------
Network topology is OSPF + MPLS everywhere, and there are VPLS tunnels from every router to RouterA for all existing PPPoE clients
If I export the config on RouterE and replace it with a CCR router, full speeds no problem. But at no point does RouterE ever reach 100% or even close to 100% CPU usage on any core so it shouldn't be a bottleneck
Is there something very different with the RB3011 hardware or packet processing that is causing the issue here? It's like the routing engine performance is just utter crap, but everything else seems to be ok
At the moment it looks like we might have to replace all 3011's in our network with CCR which would get quite expensive. I want to know if there's something we can do
Even a HEX is capable of passing traffic through way faster, but there's not enough ports for us to use