OSPF crash with BGP aggregate

Hi MT,

Running 6.12, I have a PPPoE access server at a pop site. It’s an ABR between the backbone and the stub for the pop as well as being the PPPoE server. The router runs OSPF for connectivity only without any redistribution (i.e., no statics, connected, or other redistribution). This works fine and for the most part is stable. OSPF is not affected by any customer routes coming on or off the network.

On top of this, I have BGP to distribute internet routes around my network such as PPPoE pools, internet routes, etc. At my pops, I have BGP aggregations for the PPPoE pools. The aggregations are a new addition and also work well… Except… OSPF crashes whenever a PPPoE user disconnects whose IP address is inside one of the BGP aggregates. And then BGP will disconnect because it no longer has connectivity to the route reflector until OSPF re-establishes.

This is entirely repeatable and happens consistently every time.

For example. I have 10.128.0.0/24 as a BGP aggregate. OSPF doesn’t know about this range as it’s only configured on the physical interfaces of the routers (essentially 10.0.0.0/9). When a customer, with IP 10.128.0.50, connects to the network, the 10.128.0.0/24 is distributed via BGP to our route reflector and reflected onwards. When this user disconnects, OSPF (on this router) crashes and restarts every time. If I disable the BGP aggregate, it works fine with no crashing.

Here’s my config for the PPPoE router in my lab setup replicating this issue (x86):
/interface bridge
add name=lo0-mgmt
/interface ethernet
set [ find default-name=ether1 ] name=ether1-wan speed=1Gbps
set [ find default-name=ether3 ] name=ether2-cust speed=1Gbps
set [ find default-name=ether2 ] name=ether3-failover speed=1Gbps
/ppp profile
add local-address=10.128.0.2 name=internet-ppp-profile
/routing bgp instance
set default redistribute-connected=yes router-id=10.32.0.2
/routing ospf area
add area-id=10.32.0.0 default-cost=1 inject-summary-lsas=yes name=pop type=stub
/routing ospf instance
set [ find default=yes ] router-id=10.32.0.2
/interface pppoe-server server
add disabled=no interface=ether2-cust service-name=internet2
/ip address
add address=10.0.0.33/24 interface=ether1-wan network=10.0.0.0
add address=10.32.0.2/32 interface=lo0-mgmt network=10.32.0.2
add address=10.32.0.66/30 interface=ether3-failover network=10.32.0.64
/ppp secret
add name=ppp2 password=ppp2 profile=internet-ppp-profile remote-address=10.128.0.140
/routing bgp aggregate
add include-igp=yes instance=default prefix=10.128.0.0/24
/routing bgp peer
add name=rasgw1 out-filter=pop-bgp-out remote-address=10.1.0.5 remote-as=65530 update-source=lo0-mgmt
add name=rasgw2 out-filter=pop-bgp-out remote-address=10.1.0.6 remote-as=65530 update-source=lo0-mgmt
add name=pppoe1 remote-address=10.32.0.1 remote-as=65530 update-source=lo0-mgmt
/routing filter
add action=discard chain=pop-bgp-out prefix=10.0.0.0/9 prefix-length=9-32
/routing ospf area range
add area=pop range=10.32.0.0/24
add area=backbone range=10.0.0.0/9
/routing ospf interface
add authentication=md5 authentication-key=xxxxx dead-interval=15s hello-interval=3s interface=ether1-wan
add authentication=md5 authentication-key=xxxxx dead-interval=15s hello-interval=3s interface=ether3-failover
add passive=yes
/routing ospf network
add area=backbone network=10.0.0.0/24
add area=backbone network=10.32.0.0/26
add area=pop network=10.32.0.0/11
/system identity
set name=pppoe2

Does this happen in earlier versions as well?

Yep. Just tried it on 5.26 and same thing happens.

I would be curious to see if the same thing happens when RIP is used to provide reachability for BGP. Not advocating running RIP for production, but it would be helpful to see if the issue is specific to OSPF.

Yes, would be interesting to see. I’ll give it a try.

Is this not a fairly standard setup? Maybe I’m unique in having my pppoe routers on the same device as my abr/asbr. Just trying to reduce kit/power at the tower.

Rich

Tested with RIP and the same thing happens. Although, the effects are much less noticeable because RIP doesn’t terminate neighbour connections like OSPF does. Routing crashes, all the routes are momentarily lost, BGP thus terminates with its peer (route reflector) as it no longer has connectivity.

It’s possible my first thought that OSPF is crashing may be wrong. Maybe BGP is crashing the whole routing process, as it is directly proportional to whether BGP aggregation is turned on for IP’s associated with PPPoE interfaces. Whether RIP or OSPF is used seems unimportant.

I have this problem too, but I wasn’t able to isolate the cause (BGP aggregation). Thanks for the info!

Hope it will get fixed soon!

I’d suggest submitting your findings to support@mikrotik.com, and perhaps including a supout at time of routing crashing if possible.

It is interesting to note though. Is there a reason to use aggregates in your setup? Are you just trying to avoid larger route tables with /32’s all over?

There’s a few reasons really…

  1. it keeps things clean like you say. Keeps all those /32’s out of the tables.
  2. it speeds up users’ initial connection. The first user to connect will trigger a /24 aggregate to be announced for the pool. Subsequent users don’t need to wait for their /32 to propagate (although this is only a few seconds for full convergence, it’s s till better).
  3. reduces routing protocol chatter and convergence times. Each of our sites peers with a central route reflector, so all those /32’s will be advertised across every site.
  4. doing it right when we’re small is easier than fixing it when we grow. This solution scales.

Do I need to send this to support if I flagged the post as a bug report?

Rich

It’s best to email support anyway i think, include a link to this thread, and more details like supouts if you can.

So are the /24’s moving between routers dynamically?

Assuming not, since pools are generally statically assigned (unless you have funky API stuff going on in 3rd party app), the process of moving a /24 from one router to another requires work anyway, so perhaps adding the /24 into bgp networks? Synchronise off so it’s always advertised? Still, when it comes time to move a /24 around you could add/remove it from there.

It is only a small work around until the issue gets fixed but just another idea :slight_smile:.