All is okay, but when enabling DHCP-Option 82 the performance drops immediately.
But the device have not to forward the DHCP-Pakets arrived on sfp1 to the CPU only Pakets that arrive on ether1 have to be redirected to CPU on Switch-Chip.
So activating DHCP-Option 82 should not have any impact on the bridging performance.
Mikrotik staff don’t regularly read this user forum, feature requests should be sent directly to them. Missing features with hardware offload are unlikely to be backported to older switch chips such as various Atheros/Qualcomm ones, the embedded switch in the MT7621A used by the hEX S may gain some as it the basically the same as that in the EN7562CT used by the new hEX refresh.
You can see the list of products that contain each switch chip model, and which particular interfaces/ports are attached to those switch chips, right below the switch chip feature matrix.
Also, please note that for the devices with those switch chip (RB5009, CCR2004-16G-2S+, L009), although turning on DHCP Snooping keeps hardware offload on the bridge, it will however make fast path non-functional on the bridge, and it affects fasttrack too. If you have a WAN port outside of the bridge, then fasttrack would still be working for connections between the WAN port and the interfaces on the bridge (including VLANs), but both fast path and fasttrack will be ineffective for any traffic between members of the bridges (such as inter-VLAN traffic). I have first-hand experience with the issue.
Okay, but why is it okay with v6.69.6 ?
Activating DHCP-Snooping on v6.69.6 does not decrease speed to round about 300M.
The 849,70 MBit/s shown on that Screenshot is 100% of ISP-Speed.
Do note that it is still true in this screenshot that bridge fast path is not working…fast path counters in Bridge > Settings are all at 0. Also, no ports in the bridge are showing hardware-offload either. Finally, one of the four CPU virtual “cores” (threads) is nearly maxed out during the transfer.
It is interesting, though, that you have that much of a difference in forwarding performance between RouterOS 6 and 7 on the hEX S when using software bridge with no hardware offload and no fastpath.
Are you using exact identical config when comparing ROS 6 and 7 on the same device? If you take your hEX S running 6.49, and upgrade it to 7 and do not change the config at all, it drops down to 300 megs?
Re-reading the OP and he/she has the setup where sfp1 is part of the bridge, and the tests are done between sfp1 and other ethernet ports. Which means even with DHCP Snooping disabled, those tests were never hardware offloaded, because sfp1 is outside of the switch chip and is always handled by the CPU: ]https://cdn.mikrotik.com/web-assets/product_files/RB760iGS-esw3_190600.png
I don’t have the RB760iGS, only a RB750Gr3, which means I cannot recreate the same test condition (with one port out of the switch chip) but following results with the MT7621A in the hEX under RouterOS 7.18.2 somewhat match OP’s experience. The firewall configuration used is the one from defconf.
With hardware offload fully available on all bridge port, with or without Bridge VLAN Filtering, switching throughput between two ports of the bridge (both connected to the switch chip) easily reach wirespeed (950Mbps iperf3 throughput, 1Gbps on the ethernet port). CPU usage is negligible. Bridge Fast Path or Fast Forward counters do not increase.
With hardware offload available, with Bridge VLAN Filtering, inter-VLAN routing throughput with fasttrack active reaches 930Mbps iperf3 throughput (ethernet port @980Mbps). Two CPU cores are fully loaded, both Bridge Fast Path and Fasttrack counters increase accordingly. This is the same results as with fasttrack’ed LAN-WAN traffic.
If I disable hardware offload, by turning on IGMP Snooping, inter-VLAN routing throughput with fasttrack active reaches 944Mbps iperf3 average throughput. Two CPU cores are fully loaded, both Bridge Fast Path and Fasttrack counters increase accordingly. This result is reproducibly better than the same tests with hardware offload enabled! My guess is that on the RB750Gr3 not using the switch chip gives two 1Gbps links from the CPU to the ports (my tests are between ether2 and ether3), why with hardware offload the switch chip only has one 1Gbps link to the CPU that must be shared when routing inter-VLAN.
With hardware offload still disabled by turning on IGMP Snooping, switching (within same L2) throughput averaged 940Mbps with iperf3, slightly slower than inter-VLAN routing! One CPU core is fully loaded and another one only at 50%. Fasttrack counter understandably does not increase, Bridge Fast Path counter increases as expected. If only two ports are connected, then the Bridge Fast Forward counter increases instead.
=> Conclusion until this point: the RB750Gr3 and probably the RB760iGS with the same CPU and switch chip are fast enough for switching or routing (with fasttrack) iperf3 throughput >= 930Mbps, with or without hardware offload on the bridge.
Now if I keep hardware offload disabled, by turning off IGMP Snooping but turning on DHCP Snooping (with or without enabling Option 82) switching performance (same L2) drops significantly. iperf3 throughput averages to about 420Mbps, with 1.6 CPU cores loaded. Bridge Fast Path counter does not increase (as documented by MikroTik).
Same setting but with inter-VLAN routing, iperf3 throughput averages to about 280Mbps. Neither Fast Path nor Fasttrack counter increase, as expected. This performance is in line with my previous LAN-WAN tests on the router when fasttrack is disabled and somewhat matches the published 512-byte-25-rule number published by MikroTik.
=> Conclusion: On RouterOS 7 Bridge Fast Path improves switching performance a lot when hardware offload is not available, and fasttrack requires working Bridge Fast Path. Turning on DHCP Snooping disable Bridge Fast Path (and make fasttrack non-functional when routing between the ports of the bridge).
I have no idea why the switching performance without Fast Path and Hardware Offload in RouterOS 6 is much better though. The numbers are lower than RouterOS 7 with Fast Path available though. So even with RouterOS 6 it would be better to not use DHCP Snooping on the hEX / hEX S.
Block diagram when switching is disabled shows, that both switch-CPU interconnects are used interleaved … and using two adjacent ports can indeed distribute traffic between both links. If you used only even or only odd ports (ether2+ether4 or ether1+ether3) results would likely be more in line with results you got with HW offload enabled (i.e. with switching).
Block diagram of hEX S is almost identical with exception of one switch-CPU interconnect being dedicated to SFP port if a module is plugged in (in which case switching and non-switching case become similar with regard to available switch-CPU bandwidth).
I think the conclusion is missing the point. Yes, there is a slight drop with no bridge fast path in ROS 6 on this hardware. But OP seems to be fine with the relative performance in that scenario. Still seeing mid-800s, so, nearly a gig. OP didn’t seemingly have a complaint about the performance until after the ROS 7 upgrade, when it dropped to ~300, which is less than half of what was being achieved with the older software running the same config (and from your report it sounds like is consuming even MORE CPU in order to move FEWER packets). I don’t see how you can look at that and not conclude that some sort of major regression has occurred. The more interesting question to me is whether this is a hardware-specific regression or a general one…
As i can see, v6.49 uses all 4 CPUs for Download and Upload traffic (sfp1 > ether1 = Download, ether1 > sfp1 = Upload).
With v7.x (7.18.2 in my tests, but also other versions act the same) Download is handled by CPU 0 only, while Upload will be handled by 0,1,2,3 so 100%.
Traffic always flows trough CPU, that is normal, because of board layout.
800 MHz CPU should be able to handle 250-320 MBit/s. So 4x 800 MHz should be able to handle 1000-1280 MBit/s.
The only issue between v6.49 and 7.18 is, that Downloads are nailed to CPU#0 only.
This is not what your screenshot that you posted on March 31 shows. It is a screenshot of a router running 6.49, you are downloading 850Mbit/s over Oolka Speedtest, and CPU 0 is 81.5%, CPU 1 is 6%, CPU 2 is 7.5%, and CPU 3 is 9.5%.
So even with 6.49, download is basically getting handled by a single CPU core only (at least, a single download from a single source). Since this happens with both 6.49 and 7.x, but 7.x still performs much worse, that means the issue with 7.x seems to be one of worse efficiency of CPU usage, not how well the workload is distributed across multiple cores/threads.
Also, the part of my comment that you were responding to (“a general one”) was questioning whether this performance drop-off only happens on “MMIPS” devices, or on all devices. To determine this, someone should probably take like a RB951 and run the same experiment on it. (951 is a good candidate, because anything with a faster CPU might not break a sweat trying to move 1Gbit/s through the CPU without fast path.) I should be able to run a side-by-side test like this between hEX S and RB951 soon.
Not necessarily. I suspect that not all traffic forwarding can necessarily be scaled infinitely over multiple threads. Similarly to why “round-robin” load balancing across more than one interface can be problematic, I would guess that there is a risk that packets within a single conversation or “flow” could arrive out-of-order if multiple CPU threads are participating in the forwarding of the traffic of that flow. In the case of software bridging, it could very well be the case that any set of frames that bear the same src-mac and dst-mac will likely have an affinity to a single core. Likewise, for L3 forwarding, same src-ip (+ maybe src-port) and dst-ip (+ maybe dst-port).