Low performance on RB5009 with machine behind NAT

I just got a RB5009 and I’m trying to figure out how I can get decent performance when connecting to many different hosts/ports. I have a server behind NAT that needs to check for open ports on a large subnet, but I’m only reaching around ~150K pps/~80 Mbps.

Using /tool/profile shows that firewall is around ~75 % and networking around ~15 % CPU at 300K pps in from the server.

The interface the server is connected to and that are a part of the bridge reports: RX 300K pps, FP RX 150K pps
The bridge: RX 150k pps, FP RX 150K pps
The WAN interface: TX 150K pps

Why is only 50 % of the packets showing up as FP and on the bridge? Is there any way to make it faster?

Why isn’t it bridged to the LAN it needs to examine instead?


Why is only 50 % of the packets showing up as FP

Because that decision can’t be made until after the first SYN is seen, where the default firewall applies the fasttrack-connection flag.

This is one of many costs of running traffic through a firewall on the CPU instead of bridging it directly.

The interface the server is connected to is part of the default LAN bridge, if that’s what you mean. I’m using defconf and have only enabled IPv6, added Hairpin NAT and some port forwards. If what you’re asking is why I don’t scan from within the same network then that’s because I have firewalls upstream, so that would not give me the correct results.


Yeah, that makes sense, so only the SYN-ACK package uses fast path. Thank you for the explanation. :slight_smile:


What do you mean by bridging it directly? I’d love if I could somehow speed up forwarding of theese packets.

So you’re measuring the speed of the firewalls, not the speed of the network.

Take a look at the RB5009 test results. Your application is the lower rightmost number in the first table, tiny packet sizes, so that almost nothing gets fast-tracked, and you have as close to 100% packet overhead as possible.

Now yes, that same table claims the RB5009 can do better than this, but that’s an aggregate multi-port test. Atop that, do you expect that this effect doesn’t apply to these other firewalls?

In debugging, you simplify the problem to the point that you have one variable at a time, even if that makes the result “inaccurate” by some standard. If the RB5009 bridged to the test LAN is fast, the problem isn’t the RB5009. If you back it off one level and now it’s slow, you have the culprit.

Bisect and conquer!

Not even that. Tests are using normal long-living connections, so even tests which use tiny packets, can benefit of fast-tracking.

OP is doing port scanning, which means that every third packet (or so) means a new connection. Not only this skips fast-tracking, connection-tracking machinery has much more to do (first much more work to find out that it’s a new connection compared to finding existing connection) and additionally allocating structure for new connection. And optionally dropping some old structure for another port scan which was done a few tens of seconds ago because connection tracking table gets huge, consuming too much of limited router’s resources.

So yes, it’s no wonder things are not going wirespeed.

Considering the CPU is pinned and the RB5009 becoming effectively unresponsive, the problem is clearly the routing performance of the RB5009, and that is why I created this post so somebody who has knowledge about Mikrotik hopefully can explain why the performance is so bad.


From what I can see, the tests are not at all using long-living connections. The test seems to be using UDP. I guess that makes things quite a bit easier to handle. However, I don’t need connection tracking for this traffic. The test results states 761k pps at 64B packets with 25 ip filter rules, without fast path, and thats around twice the amount of what I’m seeing, even when using fast path for half the packages.

Is it possible to disable connection tracking for the scanner, while still swapping the LAN IP with WAN IP? If I use Raw in the firewall to set “no track” on this traffic then NAT rules doesn’t seem to be touched, but I need netmap to swap the IPs back and forth.

Is there some other way to swap the addresses that works without connection tracking?

You already got that, to a lesser extent from me, and then mkx, who’s about as knowledgeable as it gets around here.


The test seems to be using UDP. I guess that makes things quite a bit easier to handle.

No, worse. mkx’s point about long-lived connections is that the longer a TCP connection lasts, the closer to zero the overhead of setting up the Fast-Tracking amortizes. With UDP and no predictable flows following, the router becomes 100% CPU bound.


The test results states 761k pps

Yes, and as I pointed out, that’s a multi-port aggregate test, not a single-stream single-port test. mkx’s point builds atop that.

First mistake, not using IPV4 :slight_smile: ( Dark Nate is going to crucify me )

Nope, NAT relies on connection tracking. So no connection tracking, no NAT. At least in ROS.

What you’re saying makes no sense. It’s not like each interface is dedicated to it’s own single CPU core, so using more ports won’t make the CPU process the packets any faster.


Well, that sucks, guess this was a bad choice then. It’s kinda weird that it’s not possible to swap the address of packets without using connection tracking. I guess I can find some other use for it.

You’re presuming an implementation. I thought you came here to ask how RouterOS works, not tell us.

We forum denizens are fellow end users for the most part, not RouterOS software engineering insiders, but one thing I can confidently predict from past experience as an outsider is that the more independent packet flows, the greater freedom you give the routing engine to divide the work among the cores. We see that in independent testing again and again.

Easy disconfirming test for you to go and prove me soundly wrong: try running this scanner of yours from a second host connected to a second port, in parallel with the first. Try it both against the same target and an independent target. Does the aggregate PPS rate go up, down, or stay the same?

I do not expect to be alone in my interest to read your results.


It’s kinda weird that it’s not possible to swap the address of packets without using connection tracking.

Even with UDP, changing a packet’s destination address requires changing its checksum field, which requires CPU resources unless you’re on a CRS3xx class device and can make use of IPv4 NAT offloading, neither of which is true in your case.

(Replacing your RB5009 with a CRS309 fixes only one out of the three obstacles you’ve set for yourself, the others being IPv6 and the inability to get flows into the fast-track path.)

What might work for you is to put an Ethernet switch rule ahead of the NAT rule, matching your scanner’s packets and preempting the routing layer with a “new-dst-ports” directive.

Note that “ports” in this instance refers to one or more physical device ports to copy the packet to, not to rewriting the UDP destination port, contravening my earlier point.

Packet processing (e.g. NAT) adds some latency to end-to-end packet flow. So if you do the port scanning in sequence, the end-to-end delay will severely reduce pace of scanning.
Due to problems that true parallel processing of packets might induce (e.g. out-of-order delivery), in ROS same CPU core does processing of all packets belonging to same connection. Which means that with multi-core CPU you may see a single core to get hit the most while other cores might be almost idle. The core being “hammered” will likely change, but the pattern will remain. If you’d run port scanning in parallel (e.g. sending probes to multiple ports on remote side), then traffic would be handled by multiple CPU cores in parallel, somehow increasing overall throughput (interrupt handling would then become a bottleneck - I’m talking about CPU interrupt per packet received by CPU, with small packets there are many interrupts at relatively low bps).

Even high-end firewall devices can be overwhelmed without mitigation under attack once cores are loaded/buffers are full. The 5009 is a small wonder router, but it’s still a low power ARM device. Basically, if it can’t be done by the ASIC, it has to be done on the CPU. If you know how to run software another way, I’d like to learn.

The knowledgeable users have already given you sound answers…

What’s up with this toxicity? I’m not presuming, I checked, and it’s true. I am here to ask, but when you say the reason for the problem is something that is not true and doesn’t make any sense then it must be possible to say so.

That’s not the intent. I’m reacting to a combination of things. You currently have a post count of five, and yet you are insisting that you know how RouterOS works internally. I believe my years of experience counts for something here, but at the same time, I’ve taken a properly scientifically skeptical position above. I’m willing to be swayed.

To that end, I asked you for a test result. What happened when you doubled the number of scanning hosts?

Argument from incredulity is not science.

OP is yet another victim of the configuration abstraction complexity of MikroTik, again.

Root cause can’t be determined without config dump, but this is screaming typical Linux bridge misconfiguration. But OP is clearly an expert in switchdev/Linux DSA paradigm, so I’ll leave it here.

One problem with the RB5009 you need to be aware of is that it has 4 cores and variable clock speed.
It will normally run at 350 MHz but it can kick up to 1400 MHz when the OS decides that this is required.

Unfortunately the mechanisms used to do this speed governing seem to be not optimal for routers, and certainly not for the test being done:

  1. the governor seems to be working with total system load. so when a task loads only one core, it sees a system load of 25% max, and is reluctant to increase the clock
  2. the switch up/down is quite rapid, it seems that when there is bursty load it does not remain at high frequency for the duration of the test

One way to circumvent this is to just set the CPU speed to 1400 MHz instead of “auto”. In theory it will run hotter, in practice there does not seem to be nearly as much difference as there is for the Intel processors.

The way I summarize that thread’s application to this one is that there is some RouterOS configuration change that would somehow cause the OP’s application to proceed much faster, and the only reason it isn’t being done is that there are too many possible ways to do it, and the OP has hit on the wrong one. Have I misapplied your other thread’s thesis here, @DarkNate?

I’ll grant that I may be overlooking something due to not having studied a config /export, but I don’t see any possible configuration change you could in principle make to overcome the combination of software NAT (RB5009 can’t HW offload NAT), tiny packet sizes, a single packet source, and no predictable connection flows, thus no fast-tracking.

The OP scarcely could’ve invented a worse torture test for a NAT router, with intent and malice aforethought.

That thread linked of mine isn’t a Thesis, it’s about UI/UX design flaws of RouterOS, and this includes the complex Linux bridging concept itself. It is complex at its roots to begin, Linux bridge doesn’t have good UI/UX design nor good docs on the Linux man pages to begin with, and this trickled down to impact MikroTik as well, which relies on the original Linux kernel “data plane” for layer 2 switching (bridge, VLANs, STP, VLAN filtering etc).

It’s not just about L3HW offloading of this or that, but also involved the intricate and complex layer 2 offloading AND layer 2 fast-path (single bridge + VLAN filtering on most MikroTik hardware).

There’s no one-size fits all for Mikrotik, but that’s also the problem with Linux switchdev/dsa/bridging itself.

In Juniper, Nokia and Cisco or Huawei, this issue doesn’t exist because the network software programmers in those companies decided to build their own platform-specific layer 2 source code and implementation that from day one didn’t have this complexity. That is why on Cisco/Juniper whatever, the concept of bridging (called IRB) is more or less unified and configured similarly on all of their modern hardware.

MikroTik cannot fix this design flaw without a complete overhaul of their source code, which obviously isn’t going to happen, maybe it’ll happen on ROSv8, but it also means a complete re-design of their CLI to be modern declarative config (Juniper type) and modernised API/rest API, also not cheap, and I doubt MikroTik has the financial resources to hire such expert network programmers who in the USA market are worth like $500k a year base comp to begin with (I have some friends in this segment of the industry and that’s how I know how much they get paid). MikroTik is a small European company, and pretty much zero European companies in tech can afford to pay a network programmer lead $500k or more.

I’m using that word in the “proposition stated as the basis of an argument to be proven” sense, not the “doctoral dissertation” sense.

I do assume you are interested in reasoned argumentation over mere argumentativeness, yes?


Linux bridge doesn’t have good UI/UX design nor good docs on the Linux man pages to begin with, and this trickled down to impact MikroTik as well

If we’re agreeing that nothing the OP can do with the stated configuration will get the packets off the CPU, then I don’t see how MT can fix this thread’s symptom with a better software bridge design. The hardware’s PPS rate limitations are fixed at design time, modulo details like the clock rate setting pe1chl brought up.

If instead you’re suggesting that a better software design would somehow offload a configuration like this to the RB5009’s preexisting switch chip and allow it to proceed at line rate, I think you’re overlooking the heterogeneous nature of MT hardware. Unlike the big guys you revere, MT doesn’t get to design custom ICs that support their idealized software designs. The only way to prevent the plumbing from poking up through the porcelain in places when you use this many different COTS chip designs is to nerf all designs to a least-common-denominator level. Under that type of design, the only HW features exposed are those present in all chips used.

MT took the opposite path: expose all chip features, requiring the user to know what those are and avoid designs that require RouterOS to activate one of the abstractions you rail against, in order to emulate a missing ASIC feature in software.


MikroTik cannot fix this design flaw without a complete overhaul of their source code

If you add “custom ASICs” to support that software, then yes, I agree that would result in a cleaner implementation…

…but you then wouldn’t have a $59 hEX on offer.