How to reduce CPU utilization when do Bandwidth control

My LAN have 2000 users, wifi + wired
The WAN have 4Gbps
My router is CCR1036-8G-2S+ , LAN and WAN all use the 10G SFP+ port to 10Gswitch

when i do the Simple configuration lan-ip\wan-ip\dhcp\nat\default-route,the network worked well, high bandwidth and low delay,when the online users at 1500 and bandwidth at 2Gbps ,The CPU utilization rate is about 10%-15% ,nice day!

but now I need do the bandwidth control, one ip down-speed max-limit 50Mbps,so i used ( mark-packets + queue-tree + pcq ) to do it
In the past, I usually do this without any problems , Of course, the number of users and bandwidth was not as large as it is now

Now when the number of users increases, the CPU utilization rate rises linearly. When the number of users reaches 1000 and the bandwidth reaches 1Gbps, the CPU utilization rate reaches 100%

my script :
/ip firewall add add list=que-104 address=10.0.0.0/19
/ip firewall mangle
chain=forward action=mark-packet new-packet-mark=que-104-down
passthrough=no src-address-list=!local dst-address-list=que-104
out-interface=vlan0002 log=no log-prefix=“”
/ip firewall mangle
chain=forward action=mark-packet new-packet-mark=que-104-up
passthrough=no src-address-list=que-104 dst-address-list=!local
in-interface=vlan0002 log=no log-prefix=“”
/que type
name=“que-104-up” kind=pcq pcq-rate=50M pcq-limit=500KiB
pcq-classifier=src-address pcq-total-limit=1000000KiB pcq-burst-rate=0
pcq-burst-threshold=0 pcq-burst-time=10s pcq-src-address-mask=32
pcq-dst-address-mask=32 pcq-src-address6-mask=128
pcq-dst-address6-mask=128
/que type
name=“que-104-down” kind=pcq pcq-rate=50M pcq-limit=500KiB
pcq-classifier=dst-address pcq-total-limit=1000000KiB pcq-burst-rate=0
pcq-burst-threshold=0 pcq-burst-time=10s pcq-src-address-mask=32
pcq-dst-address-mask=32 pcq-src-address6-mask=128
pcq-dst-address6-mask=128
/queue tree
add max-limit=4G name=all-down parent=global queue=default
add max-limit=2G name=all-up parent=global queue=default
add limit-at=2M max-limit=4G name=que-104-down packet-mark=que-104-down parent=all-down queue=que-104-down
add limit-at=2M max-limit=2G name=que-104-up packet-mark=que-104-up parent=all-up queue=que-104-up

Is there a more effective way to limit the speed under such large flow conditions

Thank you very much for your guidance

Still unresolved

Bandwidth enforcement is CPU intensive, so you’ll have to pour in more CPU. There is no way to optimize the current setup without affecting its behaviour (pcq per each individual user simply must create as many virtual queues as there are users). And yes, it will require a reconfiguration of the topology, at least you’ll have to give half the users another gateway IP, so that they were sending their traffic through the second router.

so, that means, Both ccr1036 and ccr1072 may not be satisfied.
I need a x86 server with Dual xeon gold cpu and routeros to meet the needs

A scalable solution with load distribution is always better than replacing one all-in-one-box by another single one with more horsepower. The highest step to cross is the one between “one” and “more than one”; whether “more than one” is actually two or twenty doesn’t matter much.

Both PPPoE and DHCP are designed for redundancy and load balancing - you can run multiple servers in the same L2 domain, all of them will respond to clients’ discovery requests with a random delay, and each client will choose the first response and finish the address acquisition/tunnel establishment with the server that has sent that response. So at the client-facing side of the server cluster, the load distribution among the individual servers is this easy. In the DHCP case, each DHCP server has to assign itself as the default gateway to each served client; if you assign public IPs to your clients, the task is a bit more complex as you’ll need dynamic routing to divert traffic from one router to another at the internet-facing side, or in turn simpler if you statically link particular clients to a particular router instead, thus losing redundancy for them. With CGNAT addresses, it makes no difference for the clients what address they get, so a different subnet for each server and static routing at their internet facing side will do.

Thank you for your suggestion, load balancing is a better choice