Hello,
Lately we had a major problem with one of our core routers, this device is basically receiving a full feed and sending it to other 2 customers, plus a default and partial internal routes (around 12 prefixes only) to other 4. These customers are only sending us 3-6 prefixes each.
Each customer is being filtered, in the sense, we only allow their previously stated prefixes to get into our network and nothing else and we also only send them either the full or just a subset of internal routes (12 routes) or just the default.
The device itself is a CCR1036 with the latest bug fix only (6.37.4) and 8 GB of ram, it's also configured with OSPF and making adjacency with other 2 devices.
The issue started with an indirect BGP failure for 2 customers, the hold timer for both stated they were down, and then came back up. They basically flapped like this 3 times in a period of around 10 minutes. These customers are only receiving 12 routes from us and the default one, we are only getting 3 routes from each.
What we find absurd is the fact that both ospf adjacency on this core router were lost at the same time the flapping with these 2 clients started. The bgp went up then down and also the adjacency went down. This repeated 3 times until the peering got stable. The customers are in different ports of the router going to different devices and also the adjacency lost are on different physical ports directly connected to other devices (not via a switch, are ptp links). We don't use wireless at all, assume that all connections are either fiber or copper.
We had moments where everything was ok, but 3 minutes after ospf goes down exactly at the very moment bgp hold timer for these clients goes to 0 (thus the bgp session with them is lost), then everything was ok, 3 minutes after same issue. Again this all ended when the flapping stopped.
I need to know why this is happening, aside from MPLS the device doesn't have any other configuration at all, I can't believe that a flapping makes this device behave like this. Also I am sure that it didn't lock itself because it was registering everything on its internal log.
No other devices in the network had any issues at all.
I do know that bgp is slow on CCR and makes a single CPU go to full usage. But the device still got other 35 to keep working.
We are also noticing how it gets freezed on winbox (as in, it won't answer to winbox but via ssh we can access it
) when anything BGP happens. And we see this behavior on all the CCR we have.
I'd like if someone can help on this issue we faced.