Possible fix for hAP ac2 rebooting randomly

Davis · November 17, 2019, 9:24pm

Update! The issue has reappeared (after update to RouterOS 6.46/6.46.1) and procedure mentioned in this post didn’t help.
A working solution (verified by several users on several devices) is removal of NTP package. More info in this post.

One of my hAP ac2 (RBD52G-5HacD2HnD-TC) routers was rebooting randomly (at random intervals varying from a few hours to around 20 days - on average a reboot every 2-7 days). The issue started soon after updating to RouterOS 6.43/6.43.1.
Router had quite complex configuration (including IPsec and several bridges).
Reinstalling router (/export + netinstall + running the exported configuration script in console) didn’t help.
Changing power supply did not help. Changing router (changing hardware to a new hAP ac2 device) helped for couple of months, then reboots started to happen again.

As I suspected a thermal issue (overheating) I tried to underclock the router to 672 MHz by
/system routerboard settings set cpu-frequency=672MHz
/system reboot
After underclocking the random reboots dissappeared. After around 30 days I restored the default clock speed of 716 MHz by
/system routerboard settings set cpu-frequency=716MHz
/system reboot
Restoring the default clock speed did not cause reboots to reappear. Currently there have been no random reboots running at the default clock speed for 45 days.

Has anybody else faced random reboots of hAP ac2 (or possibly other ARM devices)?
I would be glad to know whether underclocking (+ rebooting) + restoring the default clock speed (+ rebooting again) would have helped for anybody else…

P.S. In case I would have any updates, I will post them here.

Zacharias · November 18, 2019, 10:51am

A reboot can happen in case the device for some reason stays unresponsive for a minute. This is due to watchdog.
So, since you say you have a complex setup, in case the CPU stays high and the device freezes then watchdog would reboot the device.
But that’s an assumption.
Why dont you update to latest firmware and ROS ?

Kampfwurst · November 18, 2019, 1:06pm

I had the same.
http://forum.mikrotik.com/t/hap-ac2-crashes-every-week/134158/5

I replaced it and for now it runns stable, (since over one week)

maybe there are some faulty prozessors in the hAP ac2.

Davis · November 18, 2019, 5:14pm

The router is always kept up to date.
To be more precise, when the reboots reappeared after the hardware swap they become more often - the maximum time between reboots went down to 11 days.
And 11 days after underclocking (and disappearing of the random reboots) I have installed RouterOS 6.45.6, so there is a chance that it was a coincidence during those 11 days and RouterOS 6.45.6 was the actual fix (however I don’t think this is the case, because 6.45.6 changelog does not mention stability fixes for issues introduced before 6.45).

Majority of the time router has almost no load (0-10 Mbps traffic) and there are a few hours long traffic spikes, however there is no observed correlation between load and the reboots.
Reboots always happened with message “system;error;critical: router was rebooted without proper shutdown” and supout file was never created by the reboots.

Very interesting! Symptoms sound familiar.
In case (after some weeks) your new router would start to randomly reboot as well, please try underclocking+rebooting+restoring the default clockspeed+rebooting again!
Currently I suspect that underclocking + restoring the default clockspeed (and rebooting after each change) is what fixed my router (currently I don’t think running 30 days underclocked is what made the difference).

Zacharias · November 18, 2019, 5:23pm

I ve worked with hundreds hap ac and i had never a single problem like yours. Ofcorce this does not mean that your hap may not have a hardware problem.
However your problem does not happen every day so that makes it difficult for me to be hardware related.
I would netinstall the device and program it from the beginning again…

Davis · November 18, 2019, 6:30pm

The router is hAP ac2 (hAP ac2 and hAP ac have different hardware).
As the OP states: Changing router (changing hardware to a new hAP ac2 device) helped for couple of months, then reboots started to happen again.
As the OP states: Reinstalling router (/export + netinstall + running the exported configuration script in console) didn’t help.
The issue seems to be resolved now. As the OP states: Currently there have been no random reboots running at the default clock speed for 45 days.

Zacharias · November 18, 2019, 8:35pm

When a router reboots after 20 days, it is not a hardware problem to me.
It cant have a faulty hardware and work perfect for 20 days!

The issue started soon after updating to RouterOS 6.43/6.43.1

Also, since the problem started after update it is obvious it is not hardware related…

Davis · November 18, 2019, 8:49pm

Not all hardware issues are equally easy to troubleshoot. There are non-trivial things like single bit errors in memory and overheating…
As swapping hardware helped for some time (and netinstall didn’t help at all) I like to assume that hardware could be one of the factors. And as underclocking + restoring the default clock speed (+ rebooting after both changes) has fixed the issue (or at least helped for some time) there might be some software factor involved as well.

Zacharias · November 18, 2019, 9:06pm

There are non-trivial things like single bit errors in memory

Memory errors and no supout file ?
Also, why would it overheat ? Unless it is out in the sun…

Davis · November 18, 2019, 9:21pm

In the same way I could say “Software issue and no supout file?”…
These are just guesses and thanks for your contribution!
The router is in 18-25 °C room, with no airflow obstructions (not in closet, not behind curtains etc.), and never gets direct sunlight.

Zacharias · November 18, 2019, 9:58pm

You say you changed the hap ac2 to a new one…
What are the possibilities the first and the second one as well are faulty ?

sindy · November 18, 2019, 10:03pm

The chances are quite high if they come from the same production lot for which the same faulty lot of some chips was used (field experience, not with Mikrotik in particular).

Davis · November 18, 2019, 10:10pm

Software-only scenario doesn’t explain why swapping hardware helped for some time…
And I don’t think this conversation is providing any new information on the topic (just consuming time of others who will read this topic).

This shouldn’t be the case as the new hAP ac2 seems to be manufactured several months later (had different amount of RAM and newer factory software version).

Zacharias · November 18, 2019, 10:11pm

Ok, now we made a whole production lot faulty.
And we also diagnosed that is particularly a faulty chip!

Davis · December 14, 2019, 5:10pm

Another 27 days without reboots (in total 72 days running at default clock speed without reboots - more than after changing hardware).

I have also disproved my second hypothesis - that reboots might have been caused by “/interface bridge nat” rules. As I had made changes to “/interface bridge nat” rules together with underclocking there was some probability that “/interface bridge nat” rules were causing the reboots. To test this around 28 days ago I had reverted “/interface bridge nat” configuration to the state where it was before underclocking (during the time period where random reboots were happening). And I have not observed any reboots for 28 days (with the same “/interface bridge nat” rules that were present before underclocking).
Other configuration changes done together with underclocking (and during first 11 days after underclocking) were only to “/ip firewall filter” and “/ip firewall mangle” rules that were created after the reboot issue started (so I don’t think changes to these rules might have affected the reboots).

To recap - the reboot issue has been resolved for around 72 days (running at default clockspeed).
Most likely the reboots were fixed by underclocking (+ rebooting) + restoring the default clock speed (+ rebooting again).

Davis · January 4, 2020, 8:41am

Within a few days after installing RouterOS 6.46 similar random reboots started on another hAP ac2 router (with very similar configuration).
After some more days (and after installing 6.46.1) also the first router randomly rebooted, however the second router experienced the reboots more often.
Underclocking the second router to 672 MHz did not prevent the reboots.
So underclocking is not a solution (at least not with RouterOS 6.46/6.46.1, and the long period without reboots might have been mainly caused by “lucky” software versions).

gotsprings · January 5, 2020, 3:07am

Sure it’s not a voltage thing or something else rebooting?

I have some pretty intricate settings on a few hAP AC2s that have had several months of uptimes. I never messed with the processor speed.

I always updated routerOS and firmware.

I have had some issues with cap radios on the running cap. But nothing that auto rebooted.

Davis · January 6, 2020, 12:08am

Most likely it isn’t a power interruption or something like that. Once the second router rebooted while I was in the room and ceiling light was on - I didn’t see a glitch in the light. Also I have tried to change power adapter of the second router - it didn’t help either.
The first router is near UPS (router itself is not connected via UPS) and there are no data about power fluctuations in UPS logs.

Are you using IPSec on your hAP ac2 (RBD52G-5HacD2HnD-TC) routers?

sindy · January 6, 2020, 5:06am

I had an issue like this a few weeks ago, but much less random, on one of my hAP ac2s. Whenever traffic was forwarded between a GRE over IPsec tunnel and a policy-based IPsec tunnel, both passing through the same WAN interface, the router was consistently restarting (although some of the traffic did pass through each time, so I assume it was a particular size or contents of a packet being forwarded that crashed it). The remedy was to replace the policy-based IPsec tunnel by a GRE over IPsec one; replacement of the power adaptor was the first thing to come on my mind but it didn’t help.

As usually with configurations involving IPsec and 3rd party networks, I hesitate to send supout files to support, so I haven’t reported the issue to Rīga as I haven’t had time yet to reproduce it in the lab.

Davis · January 7, 2020, 1:50am

Could that actually be a race condition (instead of a particular traffic)?
Did the reboots occur some seconds after the connection establishment? Was the time between connection establishment and reboot varying notably (e.g. sometimes being 5 seconds and sometimes 30 seconds)?

What algorithms and PFS ground are you using? I am using AES-256(-CBC), SHA256 and modp4096.