Problems with units bricking during upgrades

I’ve recently begun doing a systematic upgrade of our wireless backhauls and site routers from various MT versions ranging from 5.4 to 5.19, upgrading to 5.20 (the most recent version we’ve tested stable). Out of about 50 wireless backhaul units upgraded so far, I’ve had 2 completely brick on me (as in, not detectable via MAC Telnet, not pinging by IP, not linking up via wireless). Both were running in station mode, and we run all of our wireless units in wds bridge mode, but I don’t have recorded what the pre-upgrade firmware version was. I do know both units I lost were RB435G (though that particular unit probably represents at least 30% of our total installed base). Both units I lost were connected to RB493G routers by ethernet, and powered by Pac Wireless PoE-24i supplies. Both units were negotiating at 1Gbps prior to the upgrade, and I believe the version of the PoE-24i they were using is the Gigabit capable one. At least one of the two units had been having an issue where the ethernet port would stop communicating with the RB493G router until the RB435G unit was hard power cycled, but the lockups had gone away when we changed some settings on the RB435G and RB493G.

Trying to do as good of a post-mortem as I can while I wait for the climbing guys to be able to get the 2nd unit down and bring the 1st unit back to me… anyone have suggestions on what to look at?

Some more information about the bricks we’ve had:

Units successfully upgraded: 112 (21 are RB435G units)
Units bricked: 3 (all RB435G units)

The failed RB435G units were running 5.18, in bridge mode; 5.18, in station-wds mode; 5.19, in station-wds mode.
The successful RB435G units include 6 running 5.18, 1 running 5.19, and 14 running older versions (mostly 5.7 or 5.12)

At least one of the failed units (the one I have back in my hands) is kernel panicing:

RouterBOOT booter 2.41

RouterBoard 435G

CPU frequency: 680 MHz
Memory size: 256 MB

Press any key within 2 seconds to enter setup..
loading kernel from nand… OK
setting up elf image… OK
jumping to kernel code
execve: No such file or directory
Kernel panic - not syncing: Attempted to kill init!

The unit that’s outputting the above was upgraded from 5.19 and is booting off of the 2.41 boot loader.

I think rebooting first before uploading the npk is better then uploading npk and rebooting.

First reboot step may help clean out any bugs, memory leaks, whatever, before the second reboot for the actual installation of the new version.

The bricked boards can be reinstalled with NetInstall and work again.

Did you discover anything on the electronics side on the boards and power supplies they use?

What about humidity and temperature at time of upgrade?

Thanks.

All three of the failed units are showing “execve: No such file or directory” right before they kernel panic when I boot them up on the bench. All three of them successfully re-installed with NetInstall and were able to recover/keep the old firmware image. All three units were RB435GAUH units according to the silkscreening on the board, though they don’t all have the same exact chips on the top surface of the board (I think probably memory? – two units have Nanya chips and one has Hynix chips)

I can provide the unit serial numbers if necessary.

These units were upgraded early to mid-morning, Ohio weather. So warmer than freezing, colder than room temperature at surface level. All three units were mounted outdoors in the KamBox Kamfab enclosure that has a backplate with pre-drilled ohles for a bunch of RB4xx units and the RB800. The box is metal, painted white. Units were 30-100m off the ground, so there is a possibility ambient temperature was below freezing at the location the units were while they were upgrading. As far as I remember, it was not raining on any of these days, so humidity was probably just coming off of the dew point.

When I read your reply I have condensation in my mind, I don’t know if that sounds logical though.