One queue structure is limited to one CPU core, you have 2 queue structures (main parent queues in global) so from all cores your queues can use only 2. As soon as those two become a bottleneck, traffic is delayed and all other cores are locked (fully busy) waiting on traffic.
Bottom line, your queue implementation is far from optimal for your hardware. on x86 where indicidual core is powerfull this would work with no problems, but in CCR you need to adjust.
1) move away from parent=global, to parent=<interfaces>, it should allow you to have more parnent level queues == more used cores, less likely bottleneck.
if that doesn't solve the problem
2) try to consider changing queueing strategy - for CCR best setup is few thousands simple queues on the same level, maybe limit per client IP.
With great knowledge comes great responsibility, because of ability to recognize id... incompetent people much faster.