average CPU load is not really helpful, this device has 16 threads so make sure you're looking at individual thread load and none is reaching close to 100% (there's always one thread with way higher load than the others).
regarding your question, yes, make sure your bgp peers use affinity=alone for both input and output, this will significantly speed-up convergence time and avoid 1-thread bottlenecks.
other than that, i'd strongly advise reconfiguring the device to properly make use of the incredible ASIC to accelerate routing (l3hw). follow the documentation to one-by-one add all interfaces to a single bridge interface with proper vlan tagging for L2, add the vlan interfaces for L3, do that until every interface seems to be working and finally enable l3hw (beware MAC-based functions such as MAC-telnet and MAC-winbox will stop working when you enable it, but you can use
this and
this script to keep compatibility with those functions).
we use l3hw on the ccr2216 on a production environment and it works fine, however our use-case does not overflow the ASIC's memory, which your use-case (full bgp routes) would. i've heard from others that it works fine though, and there might be ways to optimize the acceleration for prefixes where you see the most traffic if you understand it well enough.
after you have l3hw setup and running, replace the firewall raw rules with switch rules.