Hello,
We have a CCR2216 in our network. We are facing an erratic issue with it. We have multiple ISPs connected to the router. Sometimes the router works for 20+ hours, sometimes it works for under 5 minutes. The qsfp28 links remain up, but the router stops responding to all ping packets and stops forwarding traffic on all the qsfp28 interfaces. Winbox shows the interfaces in R state even when the traffic stops forwarding. On one of the providers, we have a static IP class routed to the interface, even that stops responding.
The ether mgmt which has a LAN connected with an IP schema from our provider keeps on working, the router can be connected onto the ether mgmt IP and winbox works as well. Only main interfaces stop working
We have tried 7.6 stable and 7.7rc1 uptil 7.7rc3, all show the same behavior. We have to keep on rebooting the router till at one time it starts being stable for some time and then can work for 20+ hours before it stops again.
Anyone else facing this sort of issue as well?
Have you been updating the firmware on the the hardware when you do the upgrades? I had some strange issues when I first received my 2004, long time ago now, but it cleared up when I matched the firmware and software versions. I would also make sure that the qsfp28 modules you are using are on the compatibility list.
Do a /export show-sensitive file=backup
Download the backup.rsc to your computer
Do a netinstall of the device without default config
Connect using winbox to the MAC address, verify the config is empty (open terminal, type /export, verify there is nothing. when there is dhcp client, delete it).
Upload your backup.rsc and /import verbose=yes backup.rsc
Compared running states and found hEX S CPU was running between 20-30%, which for the rules is to high
Moved some firewall rules around, and got some improvements
Enabled verbose syslog and noticed high volumes of DNS queries against local ROS cache
Throw some of the CDN DNS queries directly towards ISP (different project and discussion) and rest port forward towards internal Pi-Hole ā CPU down about 10%
Then noticed in syslog some delays on ROS cname lookups, changed these to a records and now down to 1 - 2 % CPU
OBSERVATION:
There "may"be a problem with the ROS DNS server and specifically cname resolution within itself. I have to use the ROS DNS because the ROS is also the DCHPd and I build static DNS records from the DHCP server.
Hopefully someone can find the same and at least we may be able to move forward.