I had a simple config for queue tree, pcq and mangle on the CCR1036. Most basic stuff just to limit bandwidth. Once the bandwidth went above 1G (most likely ~1370Mbit/s or so) we hit a limit for single core performance on our CCR and I had the same problems. So what I did was create more of the same queue trees with ethernet/sfp interfaces as parent interfaces to balance the cores out, that solved the problem until the 10G sfp interface hit the single core limit again. Turns out that queue tree can only work with one core per queue tree(or multiple queue trees?) on the interface.
MikroTik made some changes to simple queues years ago that improved performance and gave them multi-core support (each simple queue item (or a group of them?) has a core for itself). So the solution was to use simple queues instead of queue trees. Oh boy and do I hate simple queues...well, turns out that adding about ~37500 simple queues halfbricks the router (reset needed) and the single core limit of ~1370Mbit/s still stands on the single core performance, so a single simple queue item can not limit more then the ~1370mbit/s on the ccr1036. And you need to have more then 32 simple queues for optimal load balancing on the cores, works ok with less tho.
Instead of just stacking simple queues you can improve the simple queue performance some more with grouping them into parent queues. It's hard to find any documentation on it so I can only point you at this video
https://www.youtube.com/watch?v=Ro3B1kQUokE as a reference.
As for the L7, can't you just skip it and just pull all google/netflix ipv4/ipv6 into an adress list and shape traffic with that
https://www.gstatic.com/ipranges/goog.json and just update the list from time to time? No L7, less issues. Or combine them both, not a google ip on the destination? no need to use l7 check on the packet.