CCR1036 - still restricting queue tree performance?

Before I have described my challenges when using dynamic simple queues auto-generated with PPPoE: http://forum.mikrotik.com/t/ccr-1036-pppoe-server/62520/8
At any time the CPU of one single core (not the same all the time) skyrocketed, and I was suspecting this to be limiting my total throughput.

Then I changed to queue trees: http://forum.mikrotik.com/t/simple-queues-bottleneck-on-ccr-1036-only-1-core-employed/75676/1 and at first glance it appeared as my problem was solved. CPU went down and download was fine when testing.
However it looks like total bandwidth never exceeds 70-80 Mbps, total available is 200M. I did an experiment: changed download for all 150 customers to 10M or better, but despite that total download did not increase.

In this thread: http://forum.mikrotik.com/t/ccr1036-queue-tree-low-speed/72450/7 Normis says that only simple queues are optimized for multiple core, and therefore queue trees will suffer on the CCR.

This is the strict opposite to what I have experienced, using simple queues one of the cores went 100% but with queue trees it appears like load is distributed better, no core exceeding 10-20%
I cannot find anything in the ROS 6 changelog that mentions multi core improvement for queue trees either.

So can anybody please shed some light on this topic?

  1. Mangle and Queue trees or simple queues?
  2. If the latter is the best, does that require me to manually create static simple queues instead of the auto-created ones?
  3. Will simple queues have to be organized under multiple parents to distribute the CPU load evenly between cores?