Focusing on traffic/packet troughput with Mikrotik V7 installed directly as bare metal system, which x86 cpu processor could be more suitable for bgp full table router:
Intel Xeon 2690 V4 with 14 cores at 2.60 Ghz
or
Intel Xeon 2699 V4 with 22 cores BUT at 2.20 Ghz
BGP convergence time is always important but not the top priority.
I’m close to be sure that more cores can handle more simultaneus traffic flows, but not sure about v7 architecture and how the traffic is being divaded beetwen cpu cores? Possible that lower cpu core clock could lead to lower single flow troughtput? But what kind of difference are we talking about?
I expect lower clock speed to affect single packet latency.
I expect 22 cores to have more concurrent packets in flight.
RouterOS v6 v7 have Linux kernels at differing versions I don’t recall.
When port speed is the bottleneck, CPU speed and processor count are irrelevant.
Case in point: examine hAP ax3 performance: https://mikrotik.com/product/hap_ax3#fndtn-testresults
Those numbers look very close to 2.5 GiB port speed IMO.
I have 4 x RouterOS v7 bare metal routers running on Intel E5-2699 CPU.
They load a full Internet BGP table in like 15-20 seconds.
10G both direction (20G aggregate)tcp speedtest in winbox from Mikrotik to Mikrotik (both E5-2699) is 20% load.
I’m using few servers with V7 and E2690, and it looks impressive regarding overall performance and bgp performance.
But not having a lab, to check what could be the bottleneck: the E2690 or 2x10GE ports.
Hi could you please share RouterOS V7? CHR? or BareMetal?
And which SFP+ NICs are you using?
i am having troubles with a old R620 with E5-2697v2 cpus.. 128GB ram.. and Intel X520 cards.. getting RX-ERRORS.. on low traffic average 130Mbps..
i did run bandwidth test prior from 2 ccrs1036 via crs317 switch switchOS mode.. fordwaring more then 8gbps traffic TCP to the server wan port.. and not a single packet rx-errors showing in 30minutes consecutive.. after i started pppoe-server connected to clients devices.. with 130Mbps i start getting RX-ERRORS on the WAN link port..
i have upgraded intel firmware nics from dell website latest firmware available April 2023..
i am running outo of ideas .. i am running bare metal v7.12.1
BTW: my issue is not cpuwise.. as i have played with this same server with Mellanox MCX455 card 100G connected ethernet mode.. and on the other side R420 device with Mellanox card also.. and bandwidth test agregate from one machine to the other reached 64gbps full-trougput which is the max the pci-e 3.0 8x can handle.. with over 3million ppps being sent and received on both servers with average 30% cpu usage.. without single packet RX-ERRORS.. as soon as we fire up the intel cards.. we start getting issues on the WAN side.. we tested both cards. on the Dell server.
As for testing bootleneck on this mikrotik servers.. i have found out the hard way that we need RAM in order to make the PCI-e slots BUSlines talk with CPU directly for troughput… i started with 16GB ram only on the server and with the 100G cards i could not forward on BT Test.. more then 15gbpbs aggregate.. once i upgraded to 128GB in the correct slot positions.. on both servers, we fired up a new BT test.. and we got full 64gbps troughput on both servers passing trough the PCI-e 100G in ethernet mode on the nic cards..
i heard the PCI-e 3.0 x16 Slot can go up to 128gbps full troughput on them.. but i only have 1 100G card x16 , on the other side i was running mellanox 40Gbps pci-e 3.0 8x slot card.. so both machines were linked at 40Gbps.
A suggestion is to start by focusing on the network interface which is generally the most crucial component whether is used as “bare metal” or as a virtual Network Interface Card (vNIC) in CHR. A well-developed driver is also a prerequisite and can be a showstopper determining whether the Network Interface Card (NIC) can be used with IO-SRV and so on. The rest is just raw CPU power and typically suffices regardless of the model, usually having significantly more internal throughput than the NICs
I probably don’t need to point out that production testing is of course a necessity.
I am using baremetal with Mellanox ConnectX-5 series. I initially tried newer Intel cards but had odd issues. Support stated that there were issues with the Intel driver so I swapped over to Mellanox and haven’t had any issues since. The other stuff to keep in mind is NUMA. I converted my dual cpu to a more powerful single cpu and moved the nics to the appropriate slots that are connected to the installed cpu. Since then it has been consistent. With dual cpu I never could get the performance as it was always using the qpi bus often and crippling performance.
Hi.. IO-SRV is disabled because we decided to run on bare metal on hdd no vmware or CHR.. i have a couple of broadcom BCM57xxx dual SFP+ cards, will give it a go tomorrow, will try to remove the mezzanine i350 dell card with 2 sfp+ intel, and will remove the x520da2 also.. and will put 2 broadcom dual sfp+ just for the sake of testing..
Interesting i will try out the dual SFP+ mellanox card also..i think i have one in stock just in case.. need to figure out this odd isue.. and why its throwin the rx-erro just.. and after 200k errors.. it crashes cards..
Modern CPUs have memory controller built in, so memory banks are connected directly to CPU. In a multi-CPU machine, memory banks are evenly distributed over all CPUs so each CPU controls part of memory.
When a process, executed on one CPU, neds to access memory, managed by same CPU, it can do it directly. If memory being accessed happens to be controlled by another CPU, this means that access has to pass inter-CPU bus (QPI) and the other CPU’s memory controller. Such access adds latency and may reduce memory access bandwidth (depends on QPI speed and how much memory and process execution are mismatched). Modern kernels (linux kernel included) try to reduce the mismatch ratio by keeping execution of processes local to memory used but that’s not always possible or it hits into CPU performance limitations (e.g. there are number of processes which all have their memory controlled by same CPU and may compete for CPU cycles).
And the same principle applies to DMA and IRQ … hardware (e.g. network interfaces) gets mapped into memory address space and IRQ lines of some CPU and if process, accessing/serving those resources, happens to be executed on a different CPU, communication again involves QPI and controllers on another CPU.
So if there’s no need for large number of CPU cores, it’s often better to go with single CPU. And I’m talking about CPU package, it can contain nany CPU cores and that’s fine. And it’s often better to go with lower number of faster CPU cores in use cases like ours (routing/firewalling).
so Numa should be disabled or enable, don’t get it.
Then having 2 socket, how i can discriminate which cpu are from socket1 or socket2 (I see a list of 72cpu where I have 2 sockets with 36cores processor) to assign to the the appropriate irq/queue, to avoid QPI?
Wikipedia says, non-uniform memory access (NUMA) is a computer memory design used in multiprocessing, where the memory access time depends on the memory location relative to the processor. Under NUMA, a processor can access its own local memory faster than non-local memory (memory local to another processor or memory shared between processors). NUMA is beneficial for workloads with high memory locality of reference and low lock contention, because a processor may operate on a subset of memory mostly or entirely within its own cache node, reducing traffic on the memory bus.
When you have machine with 2 CPU packs installed, then you have NUMA enabled. With single CPU pack (even if main board supports two or more CPU packs), NUMA is disabled. And you don’t have a choice to enable/disable it at your will. You only have choice about number of CPU packs installed.
Depends on OS/kernel … linux shows info about CPU core locations in /proc/cpuinfo (physical id). And then you may have very limited choice on how to map HW resources (DMA, IRQ) to particular CPU cores (and consequently to CPU packs). And similarly you have limited choice on “pinning” certain processes to particular CPU cores.
So you may want to disable it and see how it works afterwards. You also may want to check documentation of your appliance to see what the setting affects.
Unfortunately that’s true. Are you running ROS on appliance “bare metal” or as CHR via virtualization platform? If the later, then virtualization platform will mess with NUMA on its own and there’s nothing much CHR could do about it.