all CCR crashed

I am writing from my mobile phone since I still didn’t revive all my network. Suddenly, soon after 00:00 UTC (maybe between 00:00 and 00:15) all the CCR routers in my network crashed hard at the same time. The only way to make them working again is to physically disconnect power.

It happened on ~60 routers at the same time, various models, the only thing in common: they are all Ccr.

I read quickly online that an unexpected leap second caused troubles to various network equipments. I don’t know yet if this is related.

Is there any info/advice to avoid it happening again? Our network is big and there are routers installed far apart.

It is the 1’st April’s joke or IMHO there was an attack on your equipment.
I could not imagine that different CCR models, from different production batches, with different uptimes, with different power source, with … are crashing all at same time. There should be external reason.

Brace yourself

It’s not an april fool. It just happened in the wrong day of the year. Unfortunately I spent all the night back and forth all my nodes to restart the routers.

I would exclude an attack because it would have been made starting from the ending nodes backwards in order to reach all the routers, and who did it would have known the exact topology of the network. This happened to exactly ALL the CCR at the same time.

If it was a DOS attack, anyway, it means that every CCR router is in danger and can be crashed anytime.

Or it could be something caused by a bug in the IGP routing, (OSPF or BGP). It’s the only thing all the router have in common.

Yours CCRs could be hacked earlier and poisoned today.

It could be everything. If this happened only to me and nobody else, then it’s either an attack directed to my network or something triggered by my setup.

All the other routers, which include RB1100 (various models), RB1200, RB2011 (various models), RB750, RB951, CRS (various models), Netmetal, Netbox, SXT, QRT were not affected. I may think this trouble is related to the TILE architecture.

Since it has been mentioned, there is something weird concerning leap seconds that could be related.

We had a number of Linux hosts incorrectly adding a leap second last night:

2015-04-01T01:59:59.003687+02:00 fe-a-01 kernel: [9475817.256006] Clock: inserting leap second 23:59:60 UTC

we are still investigating the cause. I also heard rumors of unrelated (ie: not MT) network devices having issues at midnight UTC because of spurious NTP leap seconds insertions, but can’t comment much on those, will ask for more details.

not sure whether this will apply to each device running linux, but there are vendors out there pretty much concerned about the “leap seconds bug”
https://access.redhat.com/articles/15145

There is at least a report of spurious leap seconds observed in Italy on march 31, 23.59.60, with a possible explaination:

http://lists.ntp.org/pipermail/pool/2015-April/007338.html

Given what happened. I would try and simulate the addition of a leap second on the CCRs well before june 30.

cheers,
L.

Any news on this ?

No news. I wrote a script that connects and removes NTP servers from the configuration of every CCR router on the last hour / last day of the month and adds it back an hour after the first day of the month.

I don’t have enough resources to mock a NTP server and introduce a fake leap second to reproduce the bug.

There is an open ticket [Ticket#2015040166000161] where I provided all the details, supouts, and everything I could document about this.

Tonight a leap second will be introduced by NTP servers from all the world. I disabled NTP on all CCRs, excluding a few ones that I have at hand. Let’s see how it is going this time.

Well at about 00:00 UTC we had pretty much all of our CCRs crash and lock solid .. at the same time

Thankfully we’ve already removed (and ebay’d) most of the CCRs so these were just a few edge cases and not the whole network

Still a pita :frowning:

I confirm that the bug is due to the leap second. All the CCRs where the NTP configuration has not been removed have crashed very hard.

Last time it happened because of a bug in a few italian NTP servers, now that the leap second was introduced officially, it might have happened worldwide.

There is a ticket opened for this issue but it hasn’t been taken too much in consideration: [Ticket#2015040166000161] .

I think it should be investigated and solved because it’s critical and dangerous. It happens only on CCR routers and not on MIPSBE / PPC.

I also had all CCRs reboot themselves by watchdog at 17:00 PST (probably after locking up). Even the one running 6.29.1 crashed. :angry:

Really disappointing.

Same here :frowning:

3x CCR1009’s all hard locked no winbox/serial. Had to go on-site and power cycle.

2x Running 6.29.1 + NTP package (whether that makes any diff). And 1x running latest rc22.

I can confirm all my border CCR crashed at 01:00BST. The common factor was BGP and NTP server. All other CCR in my network were just using OSPF and NTP client. What a pile of shite - seriously!!!

All running v6.27 and were CCR1036-8G-1S

CCR1016-12G
RouterOS 6.27
Rouberboot 3.19
NTP Server running

At 0:00 UTC (2:00 Spain local time), it died

Not reachable by MAC ping, or ping. All interfaces working and blinking. Can’t see it on IP > Neighbors from other devices.

After rebooting, all working again.

We’ve 3 more CCR1016, but they’re 12S-S+. All them remained working fine (they had no NTP package installed)

Any explanation from Mikrotik?

WTF!?

I had 7 CCR1009’s crash at 5:00PST!

What is going on?