We have several CCR1072s in our core and in the last 2 days we have had 2 watchdog reboots. one with the router on 6.38.5 and the other with the router on 6.39.2. What could be causing it and what should I do to prevent it in the future.
We fighted too for same reboot. We’re talking with the support and they suspect an hardware failure, but i can reproduce reboots always with same conditions and are:
-
If you use traffic-flow with selected interfaces (i mean anything different then interfaces: ALL), we have continous reboot every 3-4 hours
-
If you overclock CPU to 1200Mhz in some condition it reboot.
This is what i did:
- Forced the firmware upgrade (it was already 3.33 version, but i forced the reinstall). Revert back CPU from 1200 to 1000Mhz. Put interfaces to ALL in traffic-flow configuration with that values
/ip traffic-flow print
enabled: yes
interfaces: all
cache-entries: 2M
active-flow-timeout: 2m
inactive-flow-timeout: 1m
With that changes i not experienced more reboots (i still monitor it has only two days passed, but prior i have more frequent reboot).
If you can, attach serial console and keep it open, this is what i saw after reboots:
MikroTik Login: (0,0) hv_warning: L2$ correctable data ECC error at PA 0xf8a8ff30
(0,0) hv_panic: got processor error: PC 0xffff_fff7_0051_e7c0, ICS/PL 0x6
(0,0) SBOX_ERROR: 0x0000_0000_0000_0000
(0,0) MEM_ERROR_CBOX_ADDR: 0x0000_0000_f8a8_fd78
(0,0) MEM_ERROR_CBOX_STATUS: 0x0000_0000_001c_0405
(0,0) L2 data ram 2-bit error detected.
(0,0) MEM_ERROR_MBOX_ADDR: 0x0000_0000_0000_0000
(0,0) MEM_ERROR_MBOX_STATUS: 0x0000_0000_0000_0000
(0,0) XDN_DEMUX_ERROR: 0x0000_0000_0000_0000
Thank you very much for the input. I will see what that does in the next maintenance window.
I too am experiencing this.
We do however have our units overclocked to 1200Mhz… I wonder if that may be the issue.
Yes, we have downgraded to 1000Mhz and we not had more unexpected reboot
I’m going to give this a try, thanks a lot.
Same symptom as I reported here. Check PSU.
1036 is single PSU, 1072 is redundant. Also we experienced same issue on 3 different CCRs.
Is a different case
I have the same issue on CCR 1072 - cpu is 1000 MHz - fw is 6.40.1
Any ideas?
me too on lab router. Disabling watchdog we see the CPU goes to 100% due to networking process and router became unusable.
Downgrade to 6.40 fix the issue.
I have the same problem, CCR1072 with 6.40.3, I only use BGP router mode, I do not use IP Traffic, the peak load never exceeds 6% of all CPUs, and the bandwidth never exceeds 1Gbps, anyway I suffer spontaneous reboots at any time, days can pass without problem and suddenly the Watchdog reboots the system. Any ideas?!
We started experiencing these watchdog reboots on our CCR1072s running 6.40.1. Is this a reported bug? Does downgrading to 6.40.0 really stabilize it? Is it fixed in newer versions?
nov/02/2017 05:13:15 system,error,critical router was rebooted without proper shut
down, probably kernel failure
nov/02/2017 05:13:15 system,error,critical kernel failure in previous boot
nov/02/2017 05:13:15 system,error,critical router was rebooted without proper shut
down, probably kernel failure
hi,
at the moment we have 21 ccr1072 with 6.41rc44 all up with 17 days without issue. We do bgp + filtering + ospf. Nothing else. Try to upgrade to this release, if you still experience reboot you can exclude these service as reboot cause.
We experienced reboots with cpu upgraded to 1200Mhz, at 1000mhz never experienced it.
We currently are not running any dynamic routing protocols. The purpose of our CCRs is NAT. They also manage dhcpd and upnpd with lacp, multiple rfc1918 vlans 20G in and one vlan 20G out.
Hi berlo, in the same way when we knew we had it in 1200Mhz of CPU we suffered much more frequent reboots but now we have two 1072 to 1000Mhz of CPU and suffers the reboots only one of them, it is the router that has less workload !!!. So it is a contradiction!
have you tried disabling whatdog and keep serial console opened?
You should see the error
I just tried to disable the watchdog, but not the console, I’ll try it and comment! Thanks
You need to have console opened, because if is kernel panic or memory error or similar you can’t see the error message
It is still working without any issue. What is the uptime of ccr1072 now.
Yes and now ccr was raised to 28 in all Europe. All are working fine and we never experienced more random reboots. Also we experienced better performance on routes with > 1kk routes installed disabled route cache. You loose some % CPU, about 10% more, but you will not experiencing packetloss/stop forwarding when router will forward > 2mil pps