So 2 days ago we installed a pair of new RB1100Dx4 dude editions (running 6.42.1) to replace some old rb1100ahx2’s, which had been running fine since installing them. That’s until today, one of the units has stopped responding on snmp, ssh and winbox entirely (whilst the other is still running ok). We can still ping all IP’s, it’s still forwarding traffic just fine, BGP and OSPF sessions are all still up - just can’t ssh/snmp/winbox it. snmp occasionally responds back, but with odd results (e.g. system temp came back as -240deg, and uptime as 0).
We have syslog running, and it’s even still exporting logs to it - but nothing out of the ordinary is mentioned. CPU wasn’t high just before it happened either (last successful poll was about 15%).
anyone ever seen this happen before? almost like all management processes have crashed… next step is for us to physically powercycle it.
Yes, twice. I logged a ticket with support but have not been able to respond to their reply yet. They asked for a supout file but in both cases I could not make one while the problem was occurring (one because I could not access the router, and the other because generating the supout file from Winbox or the console generated an error), and I have not had a opportunity to send the supout files that I did make after the fact.
Try to get to the console and generate a supout file before you power cycle the router, if you can, and send it to support. In the body of the message (not subject) you may want to reference my ticket number 2018041922005816, because this is almost certainly the same problem I observed.
I also deployed 2 non-Dude x4 versions around the same time and did not have a problem, but that may just be a coincidence. So far it seems to be only Dude versions. All routers were running 6.42 or 6.41.x at the time of the problems. All have since been upgraded to 6.42.1
Rebooting both routers did fix the problem. One I had to power cycle more than once to get it to respond. That one was programmed but not deployed, so I netinstalled it with 6.42.1, and restored a backup to it (I edited this post to correct this statement. I originally said I had programmed it manually to restore settings after the netinstall but that was a different router).
thanks for the info, worrying that it’s happened but i’ll try and console to it for a supout dump. hopefully just a software bug that can be patched, instead of ripping this out again
So we consoled into it, which was a dead end - it asked for a login, but once entered it just went blank/stopped responding (on console). After that, nothing came up on console (not even the login prompt for further connections).
Only option was to power cycle it, and it’s since come good. Going to keep a close eye on it to see if this happens again…
I just experienced the same thing on a unit running 6.43.8 with the latest boot loader. Management is locked up but it is still natting traffic for me. I may try getting the support file if I can otherwise I will likely remove the internal SSD (which we are not currently using).