I don't have experience with that configuration.
I have CCR1009 handling an address list based queue tree setup handling traffic for about 180 clients, 8 firewall rules in the forward chain, 2 nat rules (not heavily utilized), 90 mangle rules, and 30 PCQ queues under each of 5 parent interfaces. Only two of the parent interfaces see significant traffic at any one time. That CCR1009 runs at less than 20% CPU except for random spikes to 40% according to MRTG. That's not a terribly useful statistic since I'm only monitoring cpu0, (oid .18.104.22.168.22.214.171.124.126.96.36.199). It also passes traffic for the backhauls to other towers. We run around 230Mbps, 5 minute average, on the main backhaul for that tower, and 148Mbps, 5 minute average, of traffic is going through the queue trees during peak times.
It is my understanding that having only one parent queue is going to limit you to one CPU core, for queue processing, with the current simple queue handling setup in RouterOS 6.
How much traffic do you think your 400 queues will utilize?