CCR1072 Massive CPU spikes

We’ve been using the 1072 series now for a bit over a year, just below 2G sustained normal traffic on port 1 coming from a 10G upstream port. Nothing crazy traffic wise.

We are BGP announcing nothing major, 9 routes, 3 of which are IPv6. So nothing major route wise nor traffic wise. Everything has been good, no major issues. We aren’t using Queues except for two VLANs, but they have very minimal traffic and we are limiting them to 1G total each.

All of this being said, nothing has changed in our config, firewall, nothing other than removing one of our IP ranges in over 3 months! The only thing we do week to week is add or remove ranges, that’s it. Firewall rules haven’t changed in months.

As of this morning, everything was perfectly fine, at 10AM I took one v4 subnet off and stopped advertising it. That’s it, no other changes. For a reason I still cannot figure out at noonish the CPU started spiking like no ones business! Router never reboots or goes completely offline, but traffic gets slow, if I’m in winbox I get booted for a few seconds. When I get back in CPU is back down at 6% where it normally hovers 99.999% of the time, we never get CPU spikes. When its about to boot me it goes up to 80% then straight to like 95% and boots me from winbox.

I’ve checked likely culprits, traffic first, traffic is actually low for today, hovering around 1100Gbps. Logs, nothing out of the ordinary, no attack login attempts (which are blocked by firewall anyways), nothing crazy.

Checked firewall rules, of course no changes in it, only thing that is a bit odd is our blocking of port 23 (service is turned off anyways on CCR), large amount of packets today out of normal, but not crazy, like 300 in the past hour (that’s not normal), but still not stupid crazy amount.

I checked torch, nothing crazy traffic wise coming to our interface ip. I checked profile, it shows Firewall at 17 and total at 20, everything else down under 1, right before it spikes, then it spikes and disconnects me so I can’t see if any specific thing spiked. But when I got back in, nothing has changed much in our firewall packets for individual rules.

I went so far, I hated to but was out of options figured some bug in the OS was going crazy, I rebooted :frowning:

Came back up, same deal, nothing changed. I’m at a loss, nothing is jumping out and saying this is it. Not to mention, nothing other than removing an IP range has change today, everything was fine until noon. I figured if we had an attack I’d see traffic, and be able to catch it, nothing. Our DDoS filters before us upstream don’t catch anything today either, checked logs, nothing reported.

We are using ROS 6.48.6, same OS we’ve had on here for a year. Any thoughts? We’ve had the same config for a year, no issues, always runs at 6% roughly.

Every once in awhile when it boots me Winbox gives me this error: ERROR: Router does not support secure connection, please enable legacy mode if you want to connect anyways.

I simply his connect again and it goes. Is this a bug??

first upgrade to the latest 6.x
export the conf
run the tool profiler

if you want to optimize the cpu utilization, upgrade to v7

I know it sounds crazy - we have had similar issues with the 1072 for the past month without changing anything and for no reason we have been plagued with random packet loss every 30s - its about 1% causing issues with udp traffic like gaming, voip , RTP. Running 6.48.5 which has been rock solid for long time now.

Same traffic patters, same bgp peers, same firewall rules - even the same amount of traffic passing.
It just feels like something is causing this behavior like a bad packet or something in the BGP changed causing this issue.

Mind you - upgrading to 6.49.10 makes it even worse at that. The only solution is moving to v7 as that seems to solve the issue however its really hard to do at the moment.

Thank you!!! I’m so glad I’m not crazy and I’m not the only one! It was Ghost CPU usage and made no, no, no sense at all!!! :smiley:

So can you confirm, upgrading to 7 fixes it for sure? If so, we might plan a maintenance window as that was craziness yesterday! We’re planning on a secondary route on a CCR2216 within the next 3 months, so I wonder if I can make it last without major issues till then, then once it’s up just do the upgrade and let auto-failover run.

The thing i did not mention is that we have two of these 1072 devices at our core edge. Both of them upstream to a different ISP getting full tables and are interconnected again.
Both of them started exhibiting the same problem at the same time - running the same version. This is why we suspect its something with the BGP or possibly some kind of malformed packet.

Due to our flexibility we managed to migrate one of the two devices to version 7 - it was a nightmare converting all the BGP rules, testing, getting everything working. After weeks of effort and sleepless nights we managed to get it fully running and its working nicely.

So it would seem that the answer is moving to v7 - however the big question still remains - What is going on???
Unfortunately Mikrotik stopped really supporting v6 and due to the nature of the problem as well as the high level of use these devices are in - its really hard to troubleshoot. Additionally as v7 fixes the issue i really doubt anyone will have a look at this issue.

We also have a CCR2216 in the DC - trying to put it in action - yes it does not exhibit this issue probably due to the v7 its running.

Hopefully you are running into a similar issue and can migrate to v7 - unfortunately its a nightmare if you have many iBGP peers and many filters.

Thanks for the info! Its very strange that its just randomly doing it, almost like new packets on the web that v6 just don’t like :confused: Very strange.

Thanks for letting me know about v7, never done a full BGP migration to v7 yet, what’s involved in converting bgp tables and filters to v7?

We currently only have one BGP peer, so no biggie there, planning to implement another at the same time as the 2216, so maybe we’ll wait. That being said, our filters are very small right now, so that’s good, BGP table is also relatively small at only 6x /24’s and 4x /48 IPv6’s. So not a lot to migrate luckily, but by the sounds of it, I don’t want to try it without the secondary in place.

Upgrading to v7 in general for TILE actually goes really smooth.
If your employing IPv6 you need to add blackhole routes in order to advertise them.
For IPv4 make sure network sync is off so that it will automatically add the blackhole routes for you.
Make sure all peers have update source IP and avoid using BFD for now - even though it works sometimes.

Since you do not use full tables - you are not going to run into that much trouble.