Random OSPF State Down

We have a central core and several border routers that randomly go offline. We have noticed that all OSPF connections report down at the same time. We have 1 big MPLS circuit connecting everything together but we created a VLAN per site (currently 28 sites but planning on adding another 20) and setup /30 PTP connections in order to segregate and minimize broadcast. We also have MPLS configured and we are using BGP for all public addressing and only use OSPF for management and loopback addressing. We do not show any interface errors or drops. Anyone have an idea on why this might happen?

Error log

feb/17 21:19:33 route,ospf,info OSPFv2 neighbor 10.200.0.1: state change from Full to Init 
feb/17 21:19:34 route,ospf,info OSPFv2 neighbor 10.200.0.1: state change from ExStart to Down 
feb/17 21:19:48 route,ospf,info OSPFv2 neighbor 10.200.0.1: state change from Exchange to Down 
feb/17 21:19:59 route,ospf,info OSPFv2 neighbor 10.200.0.1: state change from ExStart to Down 
feb/17 21:20:14 route,ospf,info OSPFv2 neighbor 10.200.0.1: state change from Full to Down 
feb/17 21:20:38 route,bgp,info Failed to open TCP connection: Network is unreachable 
feb/17 21:20:38 route,bgp,info     RemoteAddress=10.200.0.1 
feb/17 21:20:49 route,ospf,info OSPFv2 neighbor 10.200.0.1: state change from Full to Down 
feb/17 21:21:04 route,ospf,info OSPFv2 neighbor 10.200.0.1: state change from Full to Init 
feb/17 21:21:05 route,ospf,info OSPFv2 neighbor 10.200.0.1: state change from Init to Down 
feb/17 21:21:26 route,bgp,info Connection opened by remote host 
feb/17 21:21:26 route,bgp,info     RemoteAddress=10.200.0.1

Sanitized Configs
Core

/interface bridge
add fast-forward=no name=LoopBack
/interface ethernet
set [ find default-name=sfp-sfpplus3 ] comment="MPLS" l2mtu=2024 mtu=2000
/interface vlan
add comment="Office - MPLS" interface=sfp-sfpplus3 name="MPLS - vlan3001" vlan-id=3001
/routing bgp instance
set default as=300 router-id=10.200.0.1
/routing ospf instance
set [ find default=yes ] mpls-te-area=backbone mpls-te-router-id=LoopBack redistribute-other-ospf=as-type-1 router-id=10.200.0.1
/ip firewall connection tracking
set enabled=no
/ip address
add address=10.200.0.1 interface=LoopBack network=10.200.0.1
add address=10.0.0.33/30 interface="MPLS - vlan3001" network=10.0.0.32
/mpls interface
set [ find default=yes ] mpls-mtu=2020
/mpls ldp
set enabled=yes lsr-id=10.200.0.1 transport-address=10.200.0.1
/mpls ldp interface
add interface="MPLS - vlan3001"
/routing bgp peer
add default-originate=always name=Core1-Office remote-address=10.200.0.10 remote-as=300 ttl=default update-source=LoopBack use-bfd=yes
/routing ospf interface
add interface="MPLS - vlan3001" network-type=point-to-point priority=2 use-bfd=yes
/routing ospf network
add area=backbone network=10.200.0.1/32
add area=backbone network=10.0.0.32/30

Office Site

/interface bridge
add fast-forward=no name=LoopBack
/interface ethernet
set [ find default-name=combo1 ] l2mtu=2024 mtu=2000
/interface vlan
add comment="Core1 - MPLS" interface=combo1 name="combo1 - vlan3001" vlan-id=3001
/routing bgp instance
set default as=300 router-id=10.200.0.10
/routing ospf instance
set [ find default=yes ] router-id=10.200.0.10
/ip address
add address=10.0.0.34/30 interface="combo1 - vlan3001" network=10.0.0.32
add address=10.200.0.10 interface=LoopBack network=10.200.0.10
/mpls ldp
set enabled=yes lsr-id=10.200.0.10 transport-address=10.200.0.10
/mpls ldp interface
add interface="combo1 - vlan3001"
/routing bgp peer
add name=Core1-Office remote-address=10.200.0.1 remote-as=300 ttl=default update-source=LoopBack use-bfd=yes
/routing ospf interface
add interface="combo1 - vlan3001" network-type=point-to-point use-bfd=yes
/routing ospf network
add area=backbone network=10.200.0.10/32
add area=backbone network=10.0.0.32/30

I have enabled OSPF debug, waiting for it to go down again.

I have experienced a similar problem and identified the symptoms to be “issues” on the layer 2 circuit between my CE and PE devices respectively. A packet capture done on my CE and PE device revealed that OSPF multicast traffic from my PE towards the CE devices were being dropped.

I then implemented a temporary fix by changing the OSPF network type from point –to point to nbma (note that nbma doesn’t make use of multicast for its hello messages) and this fixed the issue.

I will suggest you try and temporarily change the OSPF network type on both the CE and PE from P2P to NBMA and add the necessary NBMA neighbors respectively for one of the affected sites and see if that resolves the problem. Should this not resolve your problem, please do a packet capture on both your CE and PE device and analyses the OSPF packets.

Thanks, I will test that.

Setup NBMA on our link. Set the core as priority 1 and the site as priority 0. Also disabled BFD. Lets see how that works.

What kind of router is this? CCR?

Core - CCR1072
Client Sites -
CCR1009
RB4011
RB2011

What is the output of the below on all your core and office site routers:

/routing ospf interface print

Most likely cause is BFD, it may report link downs on CCR router even if link is ok. I would suggest no to use BFD on CCRs.

Had another outage last night that spiked the CPUs on the CCR1072 core to 100%. We disabled BFD on all OSPF and BGP links. Hoping that doesn’t happen again. The one test link stayed connected but could not route since the CPUs was maxed.

I have a similar issue but would not go even changing OSPF type from p2p to NBMA. I am running EBGP, IBGP & OSPF
My setup is
CCR1036-8G-2S+ ------------------CHR
|
CRS317-1G-16S+
|-------------------------------------| ------------------------------------------------------------|
CCR1036-12G-4S CCR1009-7G-1C-1S+ CCR1009-7G-1C-1S+

The OSPF keeps dropping down for all adjacent routers at once (same time) and drops iBGP too.
What could be the issue running ROS long term v 6.44.6. I ruled away my switch failure since there is a CHR directly connected to the core router and still experiencing the same.
Below is the common error experience
22:53:08 route,ospf,info OSPFv2 neighbor 102.xx.xx.6: state change from Full to Down
22:53:08 route,ospf,info OSPFv2 neighbor 102.xx.xx.17: state change from Full to Down
22:53:08 route,ospf,info OSPFv2 neighbor 102.xx.xx.18: state change from Full to Down
22:53:08 route,ospf,info OSPFv2 neighbor 102.xx.xx.19: state change from Full to Down
22:53:08 route,ospf,info OSPFv2 neighbor 102.xx.xxx.1: state change from Full to Down
22:53:08 route,ospf,info OSPFv2 neighbor 102.xx.xxx.1: state change from Full to Down
22:53:08 route,ospf,info OSPFv2 neighbor 102.xx.xx.6: state change from ExStart to Init
22:53:08 route,ospf,info OSPFv2 neighbor 102.xx.xxx.1: state change from ExStart to Init
22:53:08 route,ospf,info OSPFv2 neighbor 102.xx.xxx.1: state change from ExStart to Init
core
rout ospf int pr
Flags: X - disabled, I - inactive, D - dynamic, P - passive

INTERFACE COST PRIORITY NETWORK-TYPE AUTHENTICATION AUTHENTICATION-KEY

0 vlan51 10 1 nbma none
1 vlan50 10 1 point-to-point none
2 vlan52 10 1 point-to-point none
3 vlan49 20 1 point-to-point none
4 vlan1205 25 1 nbma none
5 vlan58-CHR 10 1 nbma none
6 DP loopback0 10 1 broadcast none

Did you ever find a solution to this problem ?
i suspect we experience the same.
all ospf and bgp sessions drop at the same time. and return a few seconds afterwards..

even bgp sessions that do not depend on ospf drop..
running 6.48.6 on ccr 1072’s and ccr 1036’s

This particular problem is solved in ROSv7 by running OSPF and BGP in separate processes.

@Wolfraider

maybe that will be silly question , but that is the part of the troubleshooting.
Can u confirm whether L1 is all good?

yes L1 is good. also some of the bgp peers that drop are on a separate port and dedicated fiber outside of the ospf L1

the problem happen when a paloalto or fortigate firewall with ospf to the ccr’s fail over between the active and the passive firewall device.
but fortigate ospf (and unrelated global bgp on different L1) should not drop when paloalto do a failover, or vice versa.
it simply seems like the whole routing engine on ccr dies and restarts.

it is good that ospf and bgp do not die together in ROS7.

If you want to stick to v6, then increasing OSPF dead-interval and BGP hold-time may help. Also, make sure that timers are at least set to default values or higher.

there are no dead timers or hold intervals that time out. the same instance the failover happen. that same instance all bgp and ospf sessions go out.
also the interface list and peer list is empty for a fraction of a second.

we are testing v7 in a lab. so the plan is to eventually move.

Old thread, but we are having a very similar issue. When a customer HA Palo Alto router fails over, it is killing the OSPF process on our side.

Any resolution besides ros7?

We have not yet tested ros7 in this network due to the complexities of OSPF, MPLS, VPLS, and L3VPN.