This issue has been the bane of my life for the last year and a half, or so. I thought I'd post here with my experience as it's always useful for others to be able to relate.
We were previously running 1036s but "upgraded" to 1072s as a future proofing measure. These were put in with identical configuration, with the exception of the following:
- Slight change to VRRP priority
We never saw a single unexplained reboot on the 1036 platform and almost immediately after switching to the 1072s, we began to see sporadic watchdog reboots. These could be anywhere from a few hours to a few months apart, so it was really difficult to identify a cause. One trait we did identify was that these reboots would only occur during times of network use (we didn't see a single OoH reboot).
A case was opened with support, but predictably was unresolved, with suggestions being almost verbatim what others have posted here (seems like they've seen this enough to have a copy/paste response ready). Things that were suggested (and tried):
- Upgrade to the latest Long Term firmware.
- Disable watchdog reboot and see if anything is echoed to the serial console.
- Upgrade again to the now latest Long Term firmware
- Abandon the Long Term tree and install the latest Stable firmware
It was obvious they had no idea what the issue was and no in-depth investigation was going to be taking place, so pretty much gave up on getting a solution from support.
Off my own back, I tried the following:
- Rolled back firmware to the version we were previously running on the 1036s before we had experienced any reboots
- Reverted VRRP priority changes that were made when installing the 1072s
- Setup verbose logging for various firewall features.
- Minimised SNMP monitoring on device as I could see a lot of activity in the verbose logs and wasn't sure if SNMP was single threaded, leading to a single core becoming overloaded
- Various other config tweaks in the desperate attempt to find something that was causing the behaviour
Having subsequently discovered this thread I'm certain that this is a 1072 platform issue, rather than a configuration/load issue (we are running connection tracking, but also were previously). This was suspected from the beginning, but we had no facility to validate this without reverting to the 1036s which, given subsequent network expansion wasn't the preferred option.
We've purchased some 2216s as a potential "solution" to this problem, however step one is to get to ROS7. We've reluctantly upgraded to 7.5 and the plan now is to evaluate stability on the 1072s as we'd like to use them if they'll remain stable. If we see another reboot, we'll then be swapping them out for the 2216s.
Initial stability and performance seems good, however it's early days. CPU usage appears enormously improved in 7.5. You can see one day to the next, with similar traffic profiles across the days, CPU usage is hovering at around 5% vs around 30% pre-upgrade.
Screenshot 2022-09-23 185148.png
I'll update on this thread again when we have some more long-term information on stability.
You do not have the required permissions to view the files attached to this post.