We are having many failures of rb433 and rb411. The boards all run 5.14 with updated bootlader. Boards boot fine for a couple of times, then stop working. Boot loader goes but nand is corrupt. Have to reflash with net install. Happens with a variety of radios (rb52hn, xr5, rb52h) does not matter which. We now have about 10 failures in the last month. Big issue. Power is ubiquiti 24v Poe.
Anyone have ideas? Customers are pissed. We have tried everything.
I had one upgrade to 5.12 with odd problems, downgraded to 5.11 and kernel locked. Netinstalled 4.17, updated firmware, installed 5.11 and it works fine now.
I don’t know if these are related or not, but I’m curious.
This is what we see from the boot loader after the board fails. Somehow nand is corrupted. These are brand new boards from factory. All we do is install 5.14 if it is not already on and update boot loader to 2.39. If you pull power a few times, the boards will eventually crash.
Boards are from different build lots. The only thing common is using 5.14. We do not have this issue with other versions of ROS. We have had 10 board failures this month.
Not related to power supply, as we have tried several with the same results. It might be related to higher power miniPCI cards. We see this with XR5, XR3, and RB52Hn, but not with RB52n. At least we have not replicated with lower power board. Might just be lucky.
RouterBOOT booter 2.39
RouterBoard 433AH
CPU frequency: 680 MHz
Memory size: 128 MB
Press any key within 2 seconds to enter setup..
loading kernel from nand… OK
setting up elf image… OK
jumping to kernel code
Starting…
/etc/rc.d/rc.sysinit: 32: cannot create /var/run/utmp: Directory nonexistent
System halted.
What we are seeing is increasing bad blocks reported under system resources. It appears that with 5.14 and 5.15 that every power off reboot (pulling power) causes incremental nand corruption. This seems to continue until it reaches critical mass of 3 or 4% and then the system loses some critical files and has to be reimaged using netinstall.
We thought that maybe it was related to SMB, so we disabled all shares, users, and made sure SMB was off, which it already was. This made no difference.
We thought maybe it was the web proxy store, which is created by default, so we deleted it. This made no difference.
We are not writing any logs to disk.
We have no idea what is causing this, but it is definitely happening. Anyone else watching their bad blocks and seeing them increase under 5.14?
I had similar issues with 3 boards. I net installed them few times untill i initially got 0 bad blocks and now they are woring fine. These boards showed 3 percent bad blocks the day i got them brand new. they crashed and then i net installed and everytime i net installed the bad blocks increaed and event\ually went upto 60 percent but then all of a sudden net installed again and the bad blocks disappeared and the boards are running fine since
We know that we can “fix” the problem by reformatting flash or using check-disk, but this is not a solution. We sell a lot of units to end users and we are having failures in the field. We need to root cause this so that we can eliminate the issue, not put a band aid on it.
The boards that are failing all are built on the 94V-0 1111 PCB and use the Samsung K9F1208U0C Flash. We have seen no issues with boards built using the Hynix chips.
We can confirm, 100% of the time, for boards built with the Samsung flash, that power off restart of the board causes incremental flash corruption. .1% per reset for the first 10 or so resets, then it starts escalating.
We are going to try downgrading to 4.17 and check downgrading bootloader.
Tried downgrading to 4.17 and the number of bad blocks held at 11.
One thing I noticed was that the system was continuing to has a DSA SSH key at every reboot. I don’t recall this being the case and that the SSH key should only be hashed when the file system is wiped or the system reinstalled.
I upgraded back to 5.15. Now the number of bad blocks is 0. And the DSA SSH key is no longer being hashed. I think that the problem might be related to hashing the SSH key, or perhaps when the drive keeps getting corrupted it is happening where the key is stored. In either case, downgrading then upgrading seemed to fix the problem.
Does this mean that we have to do this for every system we receive in in order to assure that the file system is not going to corrupt? Major pain if that is what we have to do.
MT support has not responded to my trouble ticket. Any ideas.
There may be a correlation between having Winbox open at the time of power off reboot and corruption. When system is only attached via serial and console is open, there appears to be no corruption on reset. However if Winbox session is open at time of reset, we see .1% increase in bad blocks on each reset.
This only happens with 5.14 and 5.15, not with 4.17.
Seems like something is writing to file system even if you are not saving anything. We have exhausted our ability to diagnose any further. This happens with all default values
Here is a repetitive set of reboots. Note that each reboot increments bad blocks by .1%. Seems that warm reboot is less likely to corrupt flash than hard power off.
We have now confirmed that RB433 built with Samsung NAND and DTC RAM combination are bad. In every single case, regardless of OS (4.17, 5.14 or 5.15), every power reset causes incremental bad blocks. We have replicated this problem on 15 different boards from inventory.
This does not happen with Samsung and Hynix or other memory. Memory tests run from bootloader report no error.
Once NAND corrupts to critical mass, the board ceases to boot.
Yes I also can confirm this on 433AH Bord have more than 20 failures after updating 5.9
no meter I downgrade failures continue
tomorrow I have to change 2 boards every 6-8 hours they stop responding and this start to happen after updating from 5.9
This DOES seem like a serious issue. and you’ve spent quite some time to track it down. please contact the support team with those information. this has to be fixed.
We have sent support full documentation of the failures and all details. They say they want more information. We suggested that they sample RB433AH built with DTC memory and Samsung NAND to verify our test results and then, if appropriate, issue a recall on the boards.
This is a very serious manufacturing failure. It is not easy to detect or diagnose. It was only through weeks of trial and error testing that we were able to nail this down to a root cause. Should not be the customers responsibility to trouble shoot MT boards for them.
While RAM and NAND are commodity items, what this proves is that not all suppliers of commodities are equal. Since I could not find any details on who DTC is, my suspicion is that they are a clone factory in China with poor quality control. My suggestion is to go back to Hynix or other brands which have proven to be reliable.
This is a customer relations disaster for us and we are trying our best to fix the damage done. Not easy. What we need now is for MT to acknowledge the problem and fix it.