Did anyone else see any TILE-based RouterOS devices go unresponsive at leap second insertion today? (00:00 UTC)
Three of my CCRs (one running 6.27 and two running 6.29.1) became unresponsive at 00:00 UTC on the dot. LCD screens were also unresponsive, and I equally couldn’t get any output via serial console. The only fix was a hard power cycle.
I can confirm that some CCR units experienced a crash due to introduction of leap second
Only those CCR units were affected, that use the client inside NTP npk package. It currently seems the issue was in linux kernel, the bug was fixed, but RouterOS did not have this kernel fix yet.
If the CCR uses the default SNTP client (ie. NTP.npk is not installed) then nothing happened.
All of my border routers (ALL CCR) that were synced with an ntp.org pool crashed. All of my edge routers were synced to the border routers so i didn’t have to power cycle those. Still caused a significant outage. Unfortunately, this is the last straw and i will be replacing all these devices, border or not, with more reliable cisco equipment.
If mikrotik worked more closely with the open source linux community, I’m sure this wouldn’t have happened.
For this one, yes, but next leap second will be added in around 2 years.
Could you please tell me if you had NTP package on all the servers, or you used SNTP?
Unless a bug in the hardware driver of some NTP server triggers an unexpected leap second (like it happened to me on 1st April, http://forum.mikrotik.com/t/all-ccr-crashed/86731/1 , or unless a malicious user wants to bring down an entire ISP network by hacking one public NTP server.
Besides the probably driver/NTPd bug in the kernel I don’t understand why the routers hang and why were not restarted by the hardware watchdog? As I remember Tilera processors have hardware watchdog and seems it doesn’t function properly!
Should we trust in the watchdog in these cases? The main problem is that operators had to restart the routers on-site.
Is it just me or is Normis incapable of say “sorry - we screwed up”?? I have read through all his replies and I don’t see the apology anywhere - but frankly I am not in the least bit surprised…
Let me explain a bit more what happened tonight in my situation:
Being aware of the problems that the leap second caused accidentally to my routers on 1st April, I disabled NTP ( /system ntp client set enabled=no ) on every router, except those I could reach easily. The result is that the routers with NTP disabled didn’t crash.
The ones where NTP hasn’t been disabled all crashed. Including those who hadn’t NTP package installed. This means that if the NTP package causes the problem, there are chances that it causes something else to fail, for example it could be a BGP routing update that triggers the bug/crash and someway propagates it to other routers (this is just a guess).
The point of this post is to warn and emphasize that I found MANY routers in an irresponsive state and only a couple of them had the NTP package installed.
That is very interesting, maybe those units used a different NTP server? Because NTP package and Kernel were not changed in 6.18 or even since any v6 version