CRS326-24S+2Q+RM issues with L3 hardware offlioad

We had 2 instances of CRS326-24S+2Q in our network where it breaks L3 hardware offload very weird

Here’s the scenario:

  • Running ROS7.9 for more than a month already and L3 hardware offload was stable
  • It does OSPF (PTP) to get neighbor loopback addresses and BGP for default route
  • We are using 40G as uplink

It started with some weird issues where it wasn’t forwarding any traffic to certain destination, even though there’s an ACTIVE HW OFFLOADED route for it and even if the return path from the destination has the same. We disabled its downstream path to mitigate the problem by disabling the IP ADDRESS section, just so it doesn’t break L3 hardware offload. Then after few hours, it crashed. Prior to that, I noticed high CPU profile for networking process. As it reboot, this was the scenario:

  1. No traffic being sent by the switch
  2. bridge interface is running (/int bridge)
  3. bridge VLAN interface (/int vlan) configured with 9000 IP MTU isn’t running. There are 3 fields in this section, MTU, actual MTU and L2 MTU.L2 MTU is set to 10K, MTU was manually configured to 9000, even before it crashed, Actual MTU was set to 1500. That’s the weird part, why would it use 1500? After flapping the VLAN interface, it started to set the Actual MTU to 1500.
  4. Traffic started to work afterwards. However, for our peace of mind, we rebooted it.
  5. After reboot, issue #3 came back. We deleted VLAN interfaces (/int vlan), we deleted the bridge, disabled L3 HW offload, deleted L2 VLANs.
  6. Readded back bridge, L2 vlans, L3 VLAN interface, IP address
  7. Rebooted the device
  8. Re-enable L3 hardware offload
  9. Traffic works
  10. Rebooted again. After reboot, issue is back at #3
  11. Upgraded ROS7.10.2. Post-upgrade, same issue as #3
  12. Wiped out config using system reset-configuration no-defaults=yes
  13. Loaded up backup config, issue #3 appeared again.
  14. Downgraded back to ROS7.9 issue #3 appeared again

After numerous trial and error, we keep getting the same problem. Note that all physical ports, including the bridge itself is configured to the highest possible L2 MTU it can do, hence, there’s no reason why 9000 IP MTU will not work.

Right now, the switch is only doing L2 function and all the problems we had with the routing issue has disappeared.

Note that this happened twice already in our network, using the same device. There was no changes on that switch that could stop hardware offload to stop working. We can’t see anything that could have triggered this problem to happen. No changes nor network events matches the timestamp. It just stopped working.

Anyone had similar problems like this?