The big CCR2004 reboot thread (was 2004 hardware issues?)

Hi,

We started deploying 2004s into our network and have issues with one we are trying to add into our bgp core.

It rebooted every 10-14 days so we took it out and replaced in with another one. We can ran memory test and got memory errors on the first one.

This was 12 days ago…

Today the new rebooted same way making lots of issues in OSPF 0.0.0.0 to the extent we had to reboot 2 other routers running 6.45.9 to make traffic resume again. Another CCR1016 got so corrupted that /export compact, snmp etc dident work.

The second CCR2004 (6.47.1) is connected with console cable now and doing remote memory test we get errors on this one too. Broken batch or something else? Feels like too much coincidence perhaps?

Error in address=0x00000000C0004768, W=0xC0004768 R=0x00000000 X=0xC0004768
Error in address=0x00000000C000476C, W=0xC000476C R=0x00000000 X=0xC000476C
Error in address=0x00000000C0004770, W=0xC0004770 R=0x00000000 X=0xC0004770
Error in address=0x00000000C0004774, W=0xC0004774 R=0x00000000 X=0xC0004774

etc etc etc.

/M

Something like this is better sent to support@mikrotik.com to start a real case - this is a discussion forum not a proper support channel.

Hi,

I asked the question here because it’s a forum. I’m fully aware that it’s not a support channel.

Mikrotik has no answer for the reboots. Seems they were able to reproduce the memory tester issue so it’s confirmed that the memory tester is broken.

They don’t think it’s related to the reboot.

The system reboots randomly but seldom. I’m trying to attach a console to get output but this is at production site.

If anyone experiences 2004 reboots and tests memory like I did it’s not faulty memory being the cause.

The one we have issues with runs 3 full bgp feeds but not much traffic.

The 2004s with no bgp has not rebooted.

/Mikael

I’m having lockups on one of our two 2004s as well. It’s an edge router with 4 bgp sessions (no full tables). About every 36 hours, it locks up where we can’t access it via winbox, ssh and snmp stops. Sometimes it passes traffic through it, other times no traffic will pass.

We got a console server on it now. Last lockup reported nothing at all in the console but I still had console access. When I tried to get it to generate a supout.rif file via console, that failed but immediately after, it came back to life and I could generate one via winbox.

Our crash 2 days ago it was passing traffic but we lost all access to it including console. Had to pull power to reboot it.

Support ticket is open but so far, no info.

I have been told by Mikrotik the bootloader memory test has been fixed and fix will be included at next release.

As for the logging we have connected to another mikrotik at the same site and logging to locai file - hoping for a crash to happen.


/M

What are you all logging to “echo” in hopes of getting useful info in case of a crash? We have “critical, warning, health, system and event” echoing and nothing was on the console at all for our last crash.

Not from the log but supposedly there should come some crash information on the console disregarding logging settings.

Bootloader fix was included in 6.47.2

We are seeing same issues with one of 8 CCR2004s: uninvited reboots 1-2 weeks apart. Each one is running 6.47.1, each one is running BGP, so it would seem that it may be a hardware problem. SFP28 is in use on all, so it’s also not the reason.

I am shipping one 2004 back under RMA and our other one reboots every 1 -2 weeks. Was on 6.47 and I just put it on 6.47.2 two days ago. Both are running BGP and OSPF.

Mikrotik is running some special debug packages on one of our routers. Its either a software bug or something deep in the hardware since they mentioned involving the CPU vendor.

Without the debug packages nothing came on console at crash time, so hoping for a new crash soon so this can be resolved. It has currently been up 7 days.

We have 24 2004s in boxes so would be nice to solve this issue…

we are running the same debug firmware on ours. Every one of our deployed 2004s is having some sort of problem (either random reboots or crashes or both).

2004 that used to reboot every now and then seems stable so far:

/system resource> print 
             uptime: 3w15h17m35s
            version: 6.47.1 (stable)

/system routerboard> print 
             model: CCR2004-1G-12S+2XS
  factory-firmware: 6.46.3
  current-firmware: 6.47.1

I gave it something to do, so it didn’t die out of boredom:

> /tool bandwidth-test (...) direction=both protocol=udp local-tx-speed=20000000000 remote-tx-speed=20000000000
                status: running
              duration: 1w6d20h41m35s
            tx-current: 20.0Gbps
  tx-10-second-average: 20.0Gbps
      tx-total-average: 20.0Gbps
            rx-current: 20.0Gbps
  rx-10-second-average: 19.9Gbps
      rx-total-average: 20.0Gbps
          lost-packets: 0
           random-data: no
             direction: both
               tx-size: 9000
               rx-size: 9000
      connection-count: 20
        local-cpu-load: 70%
       remote-cpu-load: 74%

…but it seems there’s either no correlation between load and reboots, or there is and load prevents reboots :wink:

Anyway, my $0.03 to the case. No instabilities seen on other 2004s either.

We are seeing the same problems on two CCR 2004 our of 10 deployed. No Connection tracking enabled. Support says that unless we have the debug package installed with the console, there’s no way to catch the problem. Today it happend again on the same units with version 6.47.2: it looks like an hardware issue to me, we will deploy the debug and console and see if we can help finding the problem.

We just had our 2004 crash with the debug firmware installed. The last line of the console output is:

[admin@AUW-LOOKOUT-EDGE-02] > LOOPER: read_raw read failed: EOF
died with signal

Nothing before that for hours. After the crash, we got 2 physical link up/down messages in console (about 2 minutes after the crash). Nothing else. Router will not respond to console input and we can’t log into it. Nor is it passing traffic. I have sent the debug log to support with our open ticket with them. Hopefully this console message is useful to them and they can get this fixed.

Any new info about this?


I wonder if this has something to do with it →

What’s new in 6.48beta40 (2020-Sep-14 13:34):
*) arm64 - improved reboot reason reporting in log;

That is related. It was added to help support troubleshoot this. We are running that firmware at the request of Mikrotik to gather more information when it crashes. Has not done anything yet to stop the crashing.

We also have issues with CCR2004, there is no BGP/OSPF in our case. Reboot fully random, sometimes few times per day, sometimes once in two weeks.
CPU load do not exceed 6%, average load 20-25Mbps with rare spikes to 50Mbps. ROS: 6.47.4.
We love MT devices, …but those reboots are horrible.

Hi,

We are running 6.48beta48 on some of the 2004s that was rebooting, It seems to have solved the reboots, but it seems that we now face an issue where the routing protocols stops working instead.

Are any of you experiencing the same?

/Mikael

Just had the first unexpected reboot on our CCR2004. Running 6.47.4, no autosupout.rif on flash. Very light load, 20-30Mbit/s of traffic.