Over the past couple years, we have seen an increasing number of devices spontaneously reboot, followed by the "rebooted with proper shutdown by watchdog timer" log entry. We aren't using the watchdog function to ping anything, so I can only assume this is a built-in or hardware watchdog triggering the reboot.
Needless to say, this is extremely disruptive--and especially intolerable for SIP users. We don't want to have to regress to 6.40 or earlier to increase the stability of these devices. Has anyone else been seeing this problem? Or better still, found a solution for this problem?
FWIW, on most of our production RouterBoards, this never happens; it happens only with a few specific RouterBoards, in some cases on a daily basis:
10/13 OmniTIK 5 PoE ac (RouterBOARD OmniTIK PG-5HacD)
3/18 hEX PoE (960PGS)
3/5 PowerBox Pro (960PGS)
3/8 SXT SA5 ac (SXT G-5HPacD)
1/3 hAP ac (RouterBOARD 962UiGS-5HacT2HnT)
1/1 RB921GS-5HPacD r2 (921GS-5HPacD r2) mANT19S
So, 77% of our 13 installed OmniTIK 5 PoC ac devices have this problem; 26% of our 23 installed RB960PGS boards; 38% of our 8 installed SXT SA5 ac devices; etc. Other devices do this occasionally but it's rare.
The OmniTIK 5 PoE ac is by far the most affected. The first of those started spontaneously rebooting a year or two ago--and as we have upgraded ROS in the hope of clearing up these reboots, it has only become more frequent and started occurring among other devices. In terms of configuration, for the most part they're running a handful of mangle rules and Queue Trees on some or all of their interfaces for SIP; bridging interfaces; running nv2 in ap-bridge mode.
Sending supouts to MT won't help; I have already been told that supouts generated by watchdog timeouts don't include debug information that would be useful in determining the cause of the timeout. The only way to get useful debug information is to disable the watchdog, wait for the device to lock up, and physically power-cycle the device--NOT an option in a production environment.