2216 stops fowarding all packets

Hello,
We have a CCR2216 in our network. We are facing an erratic issue with it. We have multiple ISPs connected to the router. Sometimes the router works for 20+ hours, sometimes it works for under 5 minutes. The qsfp28 links remain up, but the router stops responding to all ping packets and stops forwarding traffic on all the qsfp28 interfaces. Winbox shows the interfaces in R state even when the traffic stops forwarding. On one of the providers, we have a static IP class routed to the interface, even that stops responding.

The ether mgmt which has a LAN connected with an IP schema from our provider keeps on working, the router can be connected onto the ether mgmt IP and winbox works as well. Only main interfaces stop working


We have tried 7.6 stable and 7.7rc1 uptil 7.7rc3, all show the same behavior. We have to keep on rebooting the router till at one time it starts being stable for some time and then can work for 20+ hours before it stops again.
Anyone else facing this sort of issue as well?

Have you been updating the firmware on the the hardware when you do the upgrades? I had some strange issues when I first received my 2004, long time ago now, but it cleared up when I matched the firmware and software versions. I would also make sure that the qsfp28 modules you are using are on the compatibility list.

System->RouterBOARD->Upgrade

Hi,
Yes when we upgraded to 7.6 stable, we also upgraded the firmware to 7.6. I have not upgraded it to 7.7rc3 as its not stable yet

Hi,

I’m now seeing the same on a hEX S. Brand new unit not even 4 weeks old.

Up until this morning thing was stopping between 3-4 days on 7.6 and had a watchdog in place to just reboot it.

Now we are down to 5 - 15 minutes when it dies. Have now also upgraded to 7.7 and still.

Remote syslog does not even show any errors, just stops logging and 1st entry after reboot is the NTP correction entry.

@Mikrotik how do we look at this and get this resolved?

Thank you,

JE

Do a /export show-sensitive file=backup
Download the backup.rsc to your computer
Do a netinstall of the device without default config
Connect using winbox to the MAC address, verify the config is empty (open terminal, type /export, verify there is nothing. when there is dhcp client, delete it).
Upload your backup.rsc and /import verbose=yes backup.rsc

This is just a factory reset, does not address the possible cause.

Hi,

FURTHER UPDATES:

  • Upgraded to 7.7
  • Upgraded board firmware to 7.7
    → Still dies
  • Asked for help (thanks Tim!)
  • Compared running states and found hEX S CPU was running between 20-30%, which for the rules is to high
  • Moved some firewall rules around, and got some improvements
  • Enabled verbose syslog and noticed high volumes of DNS queries against local ROS cache
  • Throw some of the CDN DNS queries directly towards ISP (different project and discussion) and rest port forward towards internal Pi-Hole → CPU down about 10%
  • Then noticed in syslog some delays on ROS cname lookups, changed these to a records and now down to 1 - 2 % CPU

OBSERVATION:
There "may"be a problem with the ROS DNS server and specifically cname resolution within itself. I have to use the ROS DNS because the ROS is also the DCHPd and I build static DNS records from the DHCP server.

Hopefully someone can find the same and at least we may be able to move forward.

Thanks,

JE

This is for the hEX S, not for the 2216. It surely helps to solve such issues when the router was originally running v6 and has been upgraded to v7.