I used a CCR1016 running traffic generator (2 UDP streams) as source for my tests, the sink was also a CCR1016. So the test looks like this:
CCR1016-1 (sfp1) ---> (eth5) CCR1009 (eth1/6) ---> (sfp1) CCR1016-2
All devices were running ROS 6.32.3
I used a packet size of 64 bytes, because once we know how many packets per second we can forward, it's easy to calculate throughput at any given packet size. I used ether 5 on the CCR1009 as "wan" port, and ether 1 and 6 as switched and non-switched "lan" port
Using fastpath routing only, i get the following results:
ether1: 1340 kpps; 12% cpu-used, /tool profile shows 79% idle, 13% networking, 8% ethernet
ether6: 1410 kpps; 11% cpu-used, /tool profile shows 88% idle, 11% networking, 0% ethernet
All CPU cores were more or less evenly loaded. I don't know what's up with the difference between the indicated CPU load (cpu-used) using /system resource monitor and the "idle" stat in /tool profile. One of them is clearly wrong.
I then repeated the tests with dst-nat and fasttrack activated, so that the CCR1009 would dst-nat the traffic stream to CCR1016-2, and not just plainly route it, giving the following results:
ether1: 300 kpps; 33% cpu-used, /tool profile shows 66% idle, 18% firewall, 4% ethernet, 13% networking
ether6: 320 kpps; 34% cpu-used, /tool profile shows 66% idle, 18% firewall, 14% networking
The CPU load was uneven - only three cores were used, and all of them were running at 100%. The other cores were idle at 0%. The connection showed up as fasttracked in firewall connection tracking, but the ipv4-fasttrack-bytes counter in ip settings was stuck at 0, so fasttrack does not appear to work with dst-nat? Again, there is some discrepancy between indicated CPU load and idle stat in /tool profile.
Summary:
Even if we use the slowest result I was able to get, the potential throughput (300k pps @ 1500 Bytes) with dst-nat is 3,6 Gbit/s - you can only push 1 gbit/s through the switch-group, so even using the slowest possible configuration the CCR1009 is still about 1.8 times faster (full-duplex) than it needs to be. There is a measurable speed difference between using a switchport and a directly connected port, but it's rather small. I am not sure what to make of the CPU load readings though. One of these readings is clearly wrong, but i can't say which one.
Edit: I ran the dst-nat tests again, with fasttrack disabled, and got increased throughput of around 420kpps on ether1 instead of 300kpps. I have no idea why. Following the fasttrack instructions in the Wiki, the dst-nat connections do not get fasttracked.
Edit 2: Okay, Fasttrack does not work with Traffic Generator because apparently in order for fasttrack to work, there needs to be traffic in both directions
Using Bandwidth-Test in both directions Fasttrack starts working, Bandwidth-Test in one direction (either send or receive only) also does not trigger fasttrack for some reason. I will leave the "slow" results up though, because I think it's good to show a "worst case". I also want to note that even when fasttrack is working, the CPU load is still spread unevenly, only 3 cores are doing work, the other 6 are idle. Fasttrack with NAT is around 4 times faster than not using fasttrack (16% cpu-used at 600k pps; bandwidth-test is not multithreaded and cannot generate more packets on a CCR1016 on one core only).