If someone could hear us out… We are in deep trouble. We bought 12 CCR2216’s to replace our aging 1072 fleet (we are a mid sized FTTH/Wireless ISP). We took the leap because of the ASIC L3hw-off capabilities on this new generation, which is a standard approach on most vendors like Juniper.
Out of 3 deployed so far, we have only been able to run L3HW offloading on the Edge (very simple BGP only config), and it works awesome with very low CPU usage at +18Gbit/s traffic with the full internet IPv4 and IPv6 table. But on the other 2 AGG/BNG routers deployed, we haven’t been able to turn on the L3HW offloading. As soon as we turn it on, a lot of stuff breaks. The config is correct for single linux bridge approach, and as long as we keep it running on CPU, no issues, but they run at +40% CPU usage.
When offloading is enabled, CPU drops as expected, everything seems alright, but a lot of routes break, like they are incorrectly offloaded to the ASIC. We have a downstream AC router that contains over 2,100 pppoe connections, so lots of /32 routes there, and a LOT of those routes become unreachable from the upstream CCR2216 BNG as soon as L3HW is turned on. Turn it off back to CPU, problems solved. There is another weirdness going on with the fasttrack offloading also, but it is too much stuff to detail here. We opened up a ticket with Mikrotik, but it hasn’t even been seen yet. We are kind of screwed here, because going back to the 1072’s is a no-no, the config was heavily modified to accommodate the single vlan filtered bridge approach. These BNG routers run a lot of carrier stuff like MPLS, VPLS transports for commercial customers, OSPF, BFD, CGNat, dhcpv6 with ND, etc. So whatever is breaking L3HW-offloading is a combination of our carrier setup with something going on in RouterOS. We upgraded to the last 7.18.1 version in hopes that the bridge and l3hw stuff included would fix it, but nope.
I’m at a loss here. Maybe someone here at Mikrotik could hear us out? We are HEAVILY invested in top end MikroTik equipment, and it would be devastating for us to have to scrap it to jump over to Juniper or some other vendor. We’ve been using MikroTik since we started the ISP in 2011.
I opened a forum account in 2017, but never really had any specific reason to post here, just lurked around since everything has worked out pretty nicely for us with Mikrotik until this last generation jump.
They are running the last 7.18.1 and run OSPF, BFD, MPLS/VPLS, CGNat with FastTrack, DHCP Server for v4 and v6 with ND, PPPoE Server, etc. The firewall filter is only set for fasttrack, and NAT is doing CGNat NetMap, very simple firewall setup. MPLS and LDP is filtered to only loopback addresses, since we only use it for VPLS services. In general just common carrier services that were already running on the previous CCR1072 for years with no issues. The upstream port that does firewalling is set to “L3-HW-OFFLOADING=no” as expected for this type of config. The rest of the ports, and the downstream port for the downstream AC router (CCR2216) are set to HW Offload. Right now, everything is working perfectly with L3HW-Offloading turned off running at full CPU processing, which is not sustainable with very high CPU usage and no room for growth, but as soon as we turn on L3HW Offloading at the Switch level, CPU drops as expected, connection fasttracking is offloaded as well, but here is where the weirdness starts…
Some routes, mostly /32 routes pointing to the downstream AC router will become unreachable from the upstream port to the Core Router (which is still a CCR1072 btw), but they are still reachable from the AGG (traceroute will reach it at the downstream AC). As minutes pass, the fasttracked connections will gradually drop from the ASIC and go back to CPU, until there is no benefit from running L3 offloaded fasttrack. And lastly, some customers at the AGG dhcp server (100.64.0.0/19 cgnat local connected route) will not receive incoming UDP 5060 port packets for SIP VoIP services. But the local pppoe connected routes at this router are not affected by anything at all.
And again, we turn off L3HW-Offloading at the switch, and voila, everything gets fixed as soon as the CPU takes over.
1 Depends on the complexity of the routing table. Whole-byte IP prefixes (/8, /16, /24, etc.) occupy less HW space than others (e.g., /22). Starting with RouterOS v7.3, when the Routing HW table gets full, only routes with longer subnet prefixes are offloaded (/30, /29, /28, etc.) while the CPU processes the shorter prefixes. In RouterOS v7.2 and before, Routing HW memory overflow led to undefined behavior. Users can fine-tune what routes to offload via routing filters (for dynamic routes) or suppressing hardware offload of static routes. IPv4 and IPv6 routing tables share the same hardware memory.
2 When the HW limit of Fasttrack or NAT entries is reached, other connections will fall back to the CPU. MikroTik’s smart connection offload algorithm ensures that the connections with the most traffic are offloaded to the hardware.
3 Fasttrack connections share the same HW memory with ACL rules. Depending on the complexity, one ACL rule may occupy the memory of 3-6 Fasttrack connections
Yes, how much routes fit into the switch ASIC’s TCAM for l3hw depends on the kind of routes and the specific TCAM implementation. There is AFAIK also no way to show the available TCAM space on MT switches with l3hw.
Still it is a reasonably expectation that routes are loaded into the switch ASIC’s TCAM as far as they fit, and the rest is handled by the CPU. What we saw in our tests with l3hw on CCR2x16, is that routing in general starts to struggle if the l3hw TMAC is full and ROS has to dynamically decide what routes to be handled by CPU and what to hand-off to the switch ASIC. This is admittedly no trivial problem to solve, but as long as this is an issue, it makes running l3hw risky.
Yes, we caught that, but in this case, what happens is that it eventually drops to 0 fasttracked connections on the L3Hw-offloading monitor (we have been using it with CLI, the winbox monitor was just added on this last version, and shows the same). I can understand that a limited amount will be candidate for offloaded fasttrack, but it drops until it doesn’t work at all.
Correct, we have been using the CLI tool until 7.18, and still the GUI tool shows the same. When offloading is started, it reports somewhere around ~3500 fasttracked connections, and CPU drops to 0% - 1%, but the routes problem drops at least half the traffic of the downstream AC, and fasttracking HW keeps dropping until it reaches 0 and you can see the CPU rise as it drops, until there is no more hw fasttrack. We are evaluating building a loaded x86 server to run RouterOS and replace the CCR2216 while this gets sorted out, which will probably take a long time
talking about Offloading Fasttrack Connections in L3 hw offload scenarios:
are you aware that interfaces (/interface/ethernet/switch/port) involved in NAT need to be configured with l3-hw-offloading=no ?
In general just common carrier services that were already running on the previous CCR1072 for years with no issues. The upstream port that does firewalling is set to “L3-HW-OFFLOADING=no” as expected for this type of config. The rest of the ports, and the downstream port for the downstream AC router (CCR2216) are set to HW Offload.
Quoting part of my original post, which is probably a little bit too long, TL;DR which is understandable. But yes, the config follows the expected parameters for linux bridge approach with vlan filtering. It works perfectly on CPU, but breaks when handed off to the ASIC with L3HW enabled.
I was reading the changelog for the new 7.19, and there seems to be a good amount of new stuff for *bridge, so I’m keeping my fingers crossed to see if I lucky with this one.
Any packets with a label will hit the routers CPU. The only saving grace is the limited MPLS/VPLS FastPath support that will somewhat accelerate the software based forwarding.
Because of this, CCR2116 and CCR2216 are really only suitable for very basic ISP’s doing pure L3 forwarding. As soon as you start to scale to more than a single AS ingress/egress point, you will need MPLS for Traffic Engineering and Mikrotik don’t currently have a router that can provide high performance MPLS Push/Pop operations.
In theory Mikrotik could add ASIC offload for MPLS label operations on the CCR2116/CCR2216 but as the Marvell Prestera CX/DX switch ASIC’s are designed for Datacentre/Campus switching, I expect that when used for ISP edge use cases that there will be issues with the number of prefixes in the ASIC’s forwarding database, and with the speed of adding/updating/removing prefixes/labels in the ASIC’s FDB. There are also a number of issues with MPLS on the Marvell Prestera CX/DX ASIC’s which I expect that Mikrotik would also experience if they added support.
Yup, we are aware of this, and I would LOVE that they would add more VPLS/MPLS offloading capabilities (RFC2544 fails on VPLS over CPU). But in our case, we use LDP/MPLS solely for loopback VPLS transport capabilities, it is filtered for loopback addresses only, so no customer or backbone traffic runs over MPLS on our network. In the extensive testing we’ve done trying to figure out what is causing this issue, we even completely disabled LDP/MPLS with absolutely no change in the outcome. Like I mentioned before, offloading does what it is expected to do (drops cpu usage over to the ASIC), but breaks routes in the system.
One curious detail we noticed, is that each time l3hw offloading is enabled, it always breaks different routes. With each ON/OFF cycle, it affects a different group of customers that becomes unreachable. It would appear that when the routing table gets sent to the ASIC, some of the data is incorrect or corrupted. I’m just speculating here. But as I mentioned before, as long as everything stays in the CPU, no issues at all, but CPU usage is high and we cannot keep it like this long term, much less scale for growth.
If you have not already done so, I would open a support ticket with Mikrotik, work with the support team and if the “root cause” is not found quickly, then ask your local distributor to escalate the ticket to make sure it is taken seriously.
the ammount of problems related in this topic simultaneously (l3 hw off, fasttrack hw offload, mpls), leads me to think that you are combining too much roles on the same device
maybe you must consider in your design the possibility to segregate/separate different roles/functions
the divide and conquer principle
if you are not mixing features in the same machine i think you must create a separated topic for each specific issue to make it clearer
Not really, it would appear like so, but no, it is just a common AGG/BNG router. Our network is pretty segmented by function. The Edges have their BGP for peering/ip-transit role only, the cores are for internal bgp and ospf routing only, Aggregators and Access routers deal with cgnat and customer transit as expected, nothing is really at a point where there is a box running too much roles, and this is the same setup that has been running trouble free for the last 5-6 years at least on the CCR1072 routers we are replacing.
Like I said, MPLS is just there for VPLS if needed, and it is filtered for loopback addresses only, and there was no difference with it being turned off. I don’t see how this would be too much roles for an agg router. The current functions on this AGG are:
OSPF with BFD for route propagation.
MPLS/LDP for loopback addresses only in case VPLS service is required for a customer.
Cgnat for the downstream AC routers networks (we also tried with NO nat/firewalling at all, all ports hw offloaded, same problems).
The total config lines for the export .rsc of this router is amazingly short, the previous CCR1072 was running all this stuff like nothing, but they are at least 5 years old, and we went to upgrade since it was a “CCR1072 drop in replacement”.
Oh well… As for the support ticket that someone else asked, I we have one opened 3 weeks ago, no reply…
i think it was a marketing statement and in the perspective of product segmentation it can be true
but
ccr1072 and ccr2216 are fundamentally very different machines, only if a plethora of very specific factors align you can expect it to be a direct swap and replace operation
Yep… What sold us was the ASIC capabilities on the CCR2xxx series, even though the CPU is inferior to the 1072. In the end, the hardware is more than capable of getting the job done, or at least on paper it is… But the software doesn’t want to cooperate with us Just by looking at the block diagram, they are completely different beasts, but the CCR2216 with the ASIC in the middle, should blow the 1072 out of the water, but like I said, on paper… Software needs to back it up