RB2011 watchdog stuck after upgrade

Hello,

During last 2 weeks I’ve upgraded several RB2011 to ROS 6.38.5 and latest firmware (3.33). All of them are dialing an openvpn connection to a central server for management purposes. Also, all have a watchdog rule to reboot if they cannot ping the VPN management IP.
Yesterday, that management IP became unavailable for a minute and 3 of the routers become unresponsive. They were answering to ping, but that’s all; I was not able to log on them anymore and they were not routing traffic. We have setup a syslog server and I could see there that all 3 of them were sending this to log, every 10s, before fully crashing: “cannot ping address , rebooting”.
The only thing that helped was a power cycle and I noticed the boot time was fairly long (about 1 minute) compared to the way they boot up after I issue a reboot command.
On one of the router I could see this lines in the log, after it was rebooted: router rebooted/kernel failure in previous boot.
Routers run the same basic configuration (notable differences are LAN range and openvpn certificates) for over a year. We did not make any change lately, except upgrading the ROS from varios 6.3x versions. We have frequent watchdog reboots due to some issues we have with uplinks (that’s the reason watchdog is highly required in our setup), but this never happened before; there are about 15 routers running the same basic configuration for 1-2 years and I’ve never had one router frozen like that, then now there are 3…

Is anyone else seeing this behavior? Any idea what I could do, besides disabling the watchdog?
I still have one I was not able to powercycle, if you see any test I could run/info I could extract after it reboots.

Thank you!

Rebooting and disabling watchdog temporarily fixes the issue. One day later, all the windows are empty and the log shows scrolling messages saying “timeout while waiting for program X”, where X=16,20,44,96.
The issue is happening on the same routers that had issues with kernel failure.
Screen Shot 2017-04-04 at 22.10.08.png
Screen Shot 2017-04-04 at 22.10.28.png

Hello,

I know this is an old topic but I want to ask that if the problem happened again ? Did you find the main reason of the problem ?

Any further information appreciated.