Serious issue with BGP and OSPF

shaoranrch · Sun Feb 19, 2017 3:39 pm

Hello,

Lately we had a major problem with one of our core routers, this device is basically receiving a full feed and sending it to other 2 customers, plus a default and partial internal routes (around 12 prefixes only) to other 4. These customers are only sending us 3-6 prefixes each.

Each customer is being filtered, in the sense, we only allow their previously stated prefixes to get into our network and nothing else and we also only send them either the full or just a subset of internal routes (12 routes) or just the default.

The device itself is a CCR1036 with the latest bug fix only (6.37.4) and 8 GB of ram, it's also configured with OSPF and making adjacency with other 2 devices.

The issue started with an indirect BGP failure for 2 customers, the hold timer for both stated they were down, and then came back up. They basically flapped like this 3 times in a period of around 10 minutes. These customers are only receiving 12 routes from us and the default one, we are only getting 3 routes from each.

What we find absurd is the fact that both ospf adjacency on this core router were lost at the same time the flapping with these 2 clients started. The bgp went up then down and also the adjacency went down. This repeated 3 times until the peering got stable. The customers are in different ports of the router going to different devices and also the adjacency lost are on different physical ports directly connected to other devices (not via a switch, are ptp links). We don't use wireless at all, assume that all connections are either fiber or copper.

We had moments where everything was ok, but 3 minutes after ospf goes down exactly at the very moment bgp hold timer for these clients goes to 0 (thus the bgp session with them is lost), then everything was ok, 3 minutes after same issue. Again this all ended when the flapping stopped.

I need to know why this is happening, aside from MPLS the device doesn't have any other configuration at all, I can't believe that a flapping makes this device behave like this. Also I am sure that it didn't lock itself because it was registering everything on its internal log.

No other devices in the network had any issues at all.

I do know that bgp is slow on CCR and makes a single CPU go to full usage. But the device still got other 35 to keep working.

We are also noticing how it gets freezed on winbox (as in, it won't answer to winbox but via ssh we can access it
) when anything BGP happens. And we see this behavior on all the CCR we have.

I'd like if someone can help on this issue we faced.

StubArea51 · Sun Feb 19, 2017 5:17 pm

Sounds like you need to separate the customer feeds onto another router. 7 peerings that contain full or partial routes will definitely tax the resources on your 1036.

From a design standpoint, you don't want the peerings for customers to be on the edge router anyway. It's much better to connect them to a BGP RR dedicated to customer connectivity with a full table and then you only have 1 full feed that has to leave your edge router. CHR or x86 work very well for the task of BGP peering since the x86 architecture is more powerful for single core performance.

Here is an overview of that type of design using CHRs I presented at MUM Europe 2016.

Video
https://mum.mikrotik.com/2016/EU/agenda/EN

Slides
https://mum.mikrotik.com//presentations ... 817868.pdf

shaoranrch · Sun Feb 19, 2017 5:27 pm

Hi, thanks for the reply. Actually this is not the device connecting with out carriers, it just receives the routes from our edge devices and also replicate it to customers.

There is a point thought. I understand how bgp cripples a single CCR core. But I don't understand why having other available cores, the OSPF process had this issue as well. This, in theory shouldn't had happened.

Is there any chance this time the OSPF process was working on the very same core the BGP process was on?

Enviado desde mi SAMSUNG-SM-G920A mediante Tapatalk

StubArea51 · Sun Feb 19, 2017 6:03 pm

It's possible, though not likely. But in general, the more you can distribute workloads, the more stable your network will be.

Don't rule out the possibility of an attack on your customers either - there are many odd behaviors that can happen when a customer experiences a DDoS or other kind of malicious traffic.

shaoranrch · Sun Feb 19, 2017 7:14 pm

It's possible, though not likely. But in general, the more you can distribute workloads, the more stable your network will be.

Don't rule out the possibility of an attack on your customers either - there are many odd behaviors that can happen when a customer experiences a DDoS or other kind of malicious traffic.

Again thank you very much for your answers, it also comes to my attention that these peers even thought they were flapping aren't sending or receiving more than 20 routes. If we discard an attack, does such an small set of prefixes tax the CPU so much?

We will consider the distributed scheme but so far I'd like to know the limits of this technology. I've worked with these for years and this is the first time I see something like this

Serious issue with BGP and OSPF

Serious issue with BGP and OSPF

Re: Serious issue with BGP and OSPF

Re: Serious issue with BGP and OSPF

Re: Serious issue with BGP and OSPF

Re: RE: Re: Serious issue with BGP and OSPF

Who is online