We have a central core and several border routers that randomly go offline. We have noticed that all OSPF connections report down at the same time. We have 1 big MPLS circuit connecting everything together but we created a VLAN per site (currently 28 sites but planning on adding another 20) and setup /30 PTP connections in order to segregate and minimize broadcast. We also have MPLS configured and we are using BGP for all public addressing and only use OSPF for management and loopback addressing. We do not show any interface errors or drops. Anyone have an idea on why this might happen?
Error log
feb/17 21:19:33 route,ospf,info OSPFv2 neighbor 10.200.0.1: state change from Full to Init
feb/17 21:19:34 route,ospf,info OSPFv2 neighbor 10.200.0.1: state change from ExStart to Down
feb/17 21:19:48 route,ospf,info OSPFv2 neighbor 10.200.0.1: state change from Exchange to Down
feb/17 21:19:59 route,ospf,info OSPFv2 neighbor 10.200.0.1: state change from ExStart to Down
feb/17 21:20:14 route,ospf,info OSPFv2 neighbor 10.200.0.1: state change from Full to Down
feb/17 21:20:38 route,bgp,info Failed to open TCP connection: Network is unreachable
feb/17 21:20:38 route,bgp,info RemoteAddress=10.200.0.1
feb/17 21:20:49 route,ospf,info OSPFv2 neighbor 10.200.0.1: state change from Full to Down
feb/17 21:21:04 route,ospf,info OSPFv2 neighbor 10.200.0.1: state change from Full to Init
feb/17 21:21:05 route,ospf,info OSPFv2 neighbor 10.200.0.1: state change from Init to Down
feb/17 21:21:26 route,bgp,info Connection opened by remote host
feb/17 21:21:26 route,bgp,info RemoteAddress=10.200.0.1
I have experienced a similar problem and identified the symptoms to be “issues” on the layer 2 circuit between my CE and PE devices respectively. A packet capture done on my CE and PE device revealed that OSPF multicast traffic from my PE towards the CE devices were being dropped.
I then implemented a temporary fix by changing the OSPF network type from point –to point to nbma (note that nbma doesn’t make use of multicast for its hello messages) and this fixed the issue.
I will suggest you try and temporarily change the OSPF network type on both the CE and PE from P2P to NBMA and add the necessary NBMA neighbors respectively for one of the affected sites and see if that resolves the problem. Should this not resolve your problem, please do a packet capture on both your CE and PE device and analyses the OSPF packets.
Had another outage last night that spiked the CPUs on the CCR1072 core to 100%. We disabled BFD on all OSPF and BGP links. Hoping that doesn’t happen again. The one test link stayed connected but could not route since the CPUs was maxed.
I have a similar issue but would not go even changing OSPF type from p2p to NBMA. I am running EBGP, IBGP & OSPF
My setup is
CCR1036-8G-2S+ ------------------CHR
|
CRS317-1G-16S+
|-------------------------------------| ------------------------------------------------------------|
CCR1036-12G-4S CCR1009-7G-1C-1S+ CCR1009-7G-1C-1S+
The OSPF keeps dropping down for all adjacent routers at once (same time) and drops iBGP too.
What could be the issue running ROS long term v 6.44.6. I ruled away my switch failure since there is a CHR directly connected to the core router and still experiencing the same.
Below is the common error experience
22:53:08 route,ospf,info OSPFv2 neighbor 102.xx.xx.6: state change from Full to Down
22:53:08 route,ospf,info OSPFv2 neighbor 102.xx.xx.17: state change from Full to Down
22:53:08 route,ospf,info OSPFv2 neighbor 102.xx.xx.18: state change from Full to Down
22:53:08 route,ospf,info OSPFv2 neighbor 102.xx.xx.19: state change from Full to Down
22:53:08 route,ospf,info OSPFv2 neighbor 102.xx.xxx.1: state change from Full to Down
22:53:08 route,ospf,info OSPFv2 neighbor 102.xx.xxx.1: state change from Full to Down
22:53:08 route,ospf,info OSPFv2 neighbor 102.xx.xx.6: state change from ExStart to Init
22:53:08 route,ospf,info OSPFv2 neighbor 102.xx.xxx.1: state change from ExStart to Init
22:53:08 route,ospf,info OSPFv2 neighbor 102.xx.xxx.1: state change from ExStart to Init
core
rout ospf int pr
Flags: X - disabled, I - inactive, D - dynamic, P - passive
Did you ever find a solution to this problem ?
i suspect we experience the same.
all ospf and bgp sessions drop at the same time. and return a few seconds afterwards..
even bgp sessions that do not depend on ospf drop..
running 6.48.6 on ccr 1072’s and ccr 1036’s
yes L1 is good. also some of the bgp peers that drop are on a separate port and dedicated fiber outside of the ospf L1
the problem happen when a paloalto or fortigate firewall with ospf to the ccr’s fail over between the active and the passive firewall device.
but fortigate ospf (and unrelated global bgp on different L1) should not drop when paloalto do a failover, or vice versa.
it simply seems like the whole routing engine on ccr dies and restarts.
it is good that ospf and bgp do not die together in ROS7.
If you want to stick to v6, then increasing OSPF dead-interval and BGP hold-time may help. Also, make sure that timers are at least set to default values or higher.
there are no dead timers or hold intervals that time out. the same instance the failover happen. that same instance all bgp and ospf sessions go out.
also the interface list and peer list is empty for a fraction of a second.
we are testing v7 in a lab. so the plan is to eventually move.