RB4011 and RB1100 AHx4 "bricks" randomly

Hi everyone,
since the last week of may, strange thing has happened to Router Boards which i manage.
issue - Router simply bricks, what i mean by that is - you cannot connect to router in any way (stays on logging in and nothing more happens), APs that are connected lose all config from dude, SNMP stops working and so on.
in the same time, from computers which are connected to switch, internet is working. also i can ping that router
this can be resolved only by hard reset (take power cable off/on)
issue has happened only on RB4011 and RB1100AHx4 Dude edition
Router Os - 6.44.3
previously this has never happened.
these RB are in different countries.

so far this has happened only once but with every RB4011 we have and one RB1100 AHx4
does anyone else has seen this?

Hello! Do you have bridges in your network implementation? How many hosts are passing traffic among these “random bricking” devices? Could you provide us more info? So we can understand your problem and we’ll be able to help you better.

We have a similar issue, and we are suspecting about the bridge host table size and (possible) memory exhaustion problem. It only happens in new arm devices (RB4011 and RB1100x4).

Symptoms are loss of connectivity, manageability and it’s impossible to access the device in any way, but it keeps working as a switch. After a reboot (unplug power cable/ replug) all begin to work fine and we can see a lot log lines like: snmp, warning timeout while waiting for program XX (where xx is a variable two digit number)

Regards

Hi ccardenas,
yes, we have 3 bridges in RB4011 and in RB1100AHx4.
hosts in RB4011 most of the time are - 30-40
in RB1100 not more than 5 connected directly, it is used as dude server for monitoring.

one more thing to point that problem is in these RB4011 is that we have a lot of RB2011 with the same config, and they work perfectly, without any problems. also ROS version in all ar is the 6.44.3

symptoms are identical to yours.

Itr seems that cpu is getting exhausted over time.
this particular ARM based WAP60g is 149d up.
arm-cpu.jpg

one more thing to point that problem is in these RB4011 is that we have a lot of RB2011 with the same config, and they work perfectly, without any problems. also ROS version in all ar is the 6.44.3

Hello!! Yes, totally true. RB2011 and RB1100Hx2 in the same place, in the same network, in the same situation and nothing happens, they never block. We’ve opened a support ticket to Mikrotik and they told us to plug a serial cable and wait the device to block, then try to access it via console and make a supout, but we have a couple of them with cables attached and now they never block! :cry:

Other devices within the network keep blocking randomly. In the meanwhile we have scheduled a reboot (lame solution, but it saves the day) at nights a couple of times a week until we find the real problem, but it seems that some process inside the routerboard hangs or collapses the memory, making the another processes fall in cascade and block the access to the device.

If someone is experiencing the same problem, please share with us, maybe we can find a hint in the meanwhile, until I can get a good supout file and send to Mikrotik support.

Regards!!

I have a RB1100ahx4 Dude edition which has the same behavior. What happens is that the memory hogs and the device will become inaccessible. I have a script that reboots the device when 70% is used. During the times that it is inaccessible I tried to make a supout, via console, but that didn’t work.

Same issue here with 3 different RB1100AHx4. The situations happens with < 7 days uptime.
During the incident, the devices don’t accept new SSH, Winbox, SNMP or PPTP connections to the router itself.
Also logins via CLI aren’t possible (serial connection is possible; login doesn’t work). We tried to have a running serial connection to the affected devices. If the issue occurs, we are able to type commands into the cli, but generating the supout or doing real actions (e.g. initiate a reboot) doesn’t work.
Other traffic goes through the router smoothly. The issue can temporary be fixed by powering off an on the device. Temporary solution is a scheduled reboot.

Starting to see this issue also on 4011s. 6.44.1 on the last one it happened to.
Power cycle restores all functionality..
SNMP polling becomes very intermittent right before this happens, pointing to the CPU issues..
Only bridge configured on the devices is an empty bridge for loopback.

What version are you using.

Does everybody experiencing this run Dude on the device?
I have had the exact same issue on my RB1100AHx4 shortly after setting up Dude. Happened 3 or 4 times over a month or two before i connected the dots. Disabled Dude server and no hickup since (6-8 months ago…).

Latest bad experience was with 6.45.7.
Time for experience with 6.46 was too short.

After some mails with the MikroTik support during the last month, I finally got this answer:

Unfortunately, this problem seems to be caused by a hardware issue.
Please contact the seller and return the router for warranty repairs, if the router is still covered by it. You can refer to this ticket number - SUP-3012.

So now I am talking to the reseller about refund.

I had the issues on several devices without running the Dude.

Hi !

Same here with RB4011iGS+5HacQ2HnD und RB1100Dx4.

Both devices ´freezes´ several times a week.
The RB1100 sometimes two or three times a day.

I tryied with ROS 6.45.7 and 6.46beta59.

Anyone has found a solution to this problem ?
Maybe SNMP disable?

I’ve got problematic 4011 and since few days it started to die.

I have two 1100AHx4 with this issue, and an open case with Mikrotik (Ticket#2019100722004559). I have just been able to connect in with a serial cable and generate a supout file.

I am not running Dude on them.
SNMP is enabled though.

Hello!.

Same problem here!!! http://forum.mikrotik.com/t/problem-with-rb4011igs-loggin-stop-in-connected/135487/1

I don’t know what else to do. It already happened to me 4 times in the week.
There is no high CPU consumption, no RAM consumption or anything strange.

It is a very big problem for me.

It is incredible that since June this was reported and nobody from Mikrotik did anything.

In my case they suggested connecting via serial cable but I am 800 km from my node. It is impossible for me to do that. It is easier to restart it but it is not very serious for an internet provider to cut the service every day at the same time. Needless to say, if it “hangs” at 10 am I have to wait until the next day to access again.

since last problem, I 've got winbox session opened on screen a this problemtic 4011 is up and running for 5d+
I wonder if it has something in common with
http://forum.mikrotik.com/t/rb4011-wlan1-disabling-itself/125605/1

I suspect that they did not have enough information to replicate the issue and develop a fix. Like you many of my devices were a long way away, and they asked to have a serial cable plugged in. This was difficult, but I now have one connected and have sent a supout to support.

With a few more people reporting the issue now, hopefully they will be able to find a fix.

I wouldn’t be so optimistic regarding fast tracking the problem down.
it seems that very similar issue with rb4011
http://forum.mikrotik.com/t/rb4011-wlan1-disabling-itself/125605/1
was reported in dec’18, and in april’19 first response from Mikrotik staff that they recognise problem and working on solution,
and it seems that they haven’t solved it till now.
maybe there is some serious flaw in hardware, as we see it above

Unfortunately, this problem seems to be caused by a hardware issue.
Please contact the seller and return the router for warranty repairs, if the router is still covered by it. You can refer to this ticket number - SUP-3012.

either the way i’ve got my winbox opened on my problematic 4011 and 7d12h uptime. maybe winbox opened does the trick? I doubt it but whatever, when it works.
Merry Xmas :slight_smile: