Right, got an interesting problem and I’m after some ideas on what to test when I go back on the weekend.
Situation:
I got a pretty simple network. Six or so remote networks connecting to a central node over the internet. There are probably 25 devices total. Over the last 10 days I’ve had 3 die that I could confirm. Now, it looks like there are about 3 more but it’s remote and I can’t confirm until the weekend. The dying units are concentrated in one network except for one unit.
Death looks like this. The unit powers on, but none of the link lights come up.
I’ve been able to Netinstall two back to life. The third unit I tossed, but that might have been a bit hasty.
I don’t see any abnormal CPU usage, scripts, users etc. I have a script that does automatic upgrades weekly, so they stay pretty current. The network where they are dying is behind a CG-NAT system which is pretty tough to get in to, so while I can’t rule it out, I see no evidence they have been hacked.
A hack coming from the wireless side is unlikely, this place is in the arse end of nowhere.
So I’m after ideas.
As to what might be going on and how to test it?
I’m going back on Saturday so we have a few days to think. I can see some of these units remotely still so we can do some testing in the mean time.
One idea I had was to get the units I’ve Netinstalled and see if they work still after being off power for a few days.
I'd stop doing it. Currently it seems that upgrades go wrong more often than we'd all want (and way more often than they used to up tk something like 6.47 or so).
Lol Rex, that first pic didn’t show until I started on the reply, nice…
MKX, I certainly respect the opinion, but I’m not getting these issues when doing the upgrade which happens Sunday at 0100hrs. It seems to be on a power cycle at some other random time.
M and K, I haven’t turfed the power adapter for the device I tossed so I’ll check that tomorrow.
Auto software upgrades is not recommended… on almost ANY platform.
While that’s probably not killing your tiks, probably doesn’t help. I’d verify that your power at the site is within expected range and is protected with either a UPS or other battery solution.
I remoted in and found it was only one dead device at the remote network. It was/is feeding POE power to another device hence it looked like two dead.
I managed to netinstall remotely to bring the dead device back to life, so I’ll see how I go with that.
The unit that was at work, I tested it’s power supply, it was fine. DC volts good, only 4mV AC. I didn’t test it under load so might still do that.
The two devices that I had NetInstalled were both fine after being powered off for a few days.