I've got a recurrent problem, one interface of one router stops passing traffic (layer3, MAC telnet works), sometimes restarting the interface (interface disable and interface enable) solves the issue but sometimes a reboot is required, all the other interfaces work as intended. I have done the following without any positive effect:
- Use another interface
- Change the router
We have a provider (transit-provider) with who we do BGP on vlan1200, then there are two EDGE routers who will do VRRP on the provider interface (vlan1200) and then OSPF to all the other core routers (rou-edge-02 has increased cost on the interface and the core routers also have increased cost on the OSPF interface facing rou-edge-02) this two routers are only on backbone area and redistribute default route.
The core routers all have the same config:
- OSPF on area0 to the edge routers or between them (rou-core-03<-->rou-core-04 and rou-core-05<-->rou-core-06) only rou-core-03 redistributes a static routes (only one).
- Then every core router has an different OSPF area (1,2,4,5) that is used to share routes with rou-access routers, every area has 2 to 10 routers, in the diagram every area is represented by different rou-access colors.
- When the router stops passing traffic I can connect to it using mac-telnet but ping / ssh / ospf stops working (so I assume all layer3 stops working)
- PPPoE clients and OSPF neighbours connected to the internal area (purple one) continue to work as if there is no problem in the device.
- A quick (and ugly) fix I applied while investingating the issue is to put a netwatch on the rou-core-01 pointing the IP of the rou-edge-01, and when the netwatch is "down" I disable and enable the affecfed interface (spf-sfpplus2), 9 out 10 times it works and after 12-15 seconds everything is working "fine".
- rou-core-02 is a backup (manual) of rou-core-01, on the backbone area it has different vlans, IPs and cost to OSPF interface and its running (so I can connect to rou-core-02 loopback IP) then on the areaX side it has the same config (various vlans and PPPoE servers, no addresses declared) but disabled so if anything happens to rou-core-01 I can enable all the disabled config on rou-core-02 and after some seconds all the pppoe clients and routers will start to reconnect to this router, this procedure has been tested and it works.
- If I interchange rou-core-01 (active to standby) and rou-core-02 (standby to active) the problem is also reproduced
- This problem tends to happen at the start of the working hours (08:00-09:00 morning / 15:00-16:00 afternoon), during nights and weekends it super rare.
- Also, I cannot force it to happen, it can work 10 days perfectly, then fail, recover and fail 4 times the next 2 days
Code: Select all
#rou-edge-01
/routing ospf instance
set [ find default=yes ] distribute-default=always-as-type-1 redistribute-other-ospf=as-type-1 router-id=10.0.0.0
/routing ospf interface
add network-type=broadcast passive=yes
add interface=vlan30 network-type=point-to-point
add cost=15 interface=40 network-type=point-to-point
add interface=vlan10 network-type=point-to-point
add interface=vlan20 network-type=point-to-point
/routing ospf network
add area=backbone network=10.8.0.32/29
add area=backbone network=10.0.0.0/32
add area=backbone network=10.8.0.48/29
add area=backbone network=10.8.0.16/29
add area=backbone network=10.8.0.0/29
#rou-core-01
/routing ospf area
add area-id=0.0.0.1 default-cost=1 inject-summary-lsas=no name=area1 type=stub
/routing ospf instance
set [ find default=yes ] router-id=10.0.0.6
/routing ospf area range
add area=area1 range=a.b.c.d/25
add area=area1 range=a.b.c.f/27
/routing ospf interface
add network-type=broadcast passive=yes
add interface=vlan30-edge01 network-type=point-to-point
add cost=20 interface=vlan31-edge02 network-type=point-to-point
add interface=vlan100-ptp-location1 network-type=point-to-point
add interface=ipip-location2 network-type=point-to-point
add cost=20 interface=location03-path01 network-type=point-to-point use-bfd=yes
add interface=location03-path02 network-type=point-to-point use-bfd=yes
/routing ospf network
add area=backbone comment=edge01 network=10.8.0.32/29
add area=backbone comment=edge02 network=10.8.0.40/29
add area=backbone comment=loopback network=10.0.0.6/32
add area=area1 network=a.b.j.0/23
add area=area1 network=10.8.10.136/29
add area=area1 network=10.8.10.168/29
add area=area1 network=10.8.200.0/30
add area=area1 network=10.8.200.4/30
add area=area1 network=a.b.k.0/24
add area=area1 network=j.k.f.0/22
#rou-core-03
/routing ospf area
add area-id=0.0.0.2 name=area2
/routing ospf instance
set [ find default=yes ] redistribute-static=as-type-1 router-id=10.0.0.2
/routing ospf area range
add area=area2 range=k.m.l.d/28
add area=area2 range=h.m.t.d/29
/routing ospf interface
add interface=gre-location5 network-type=point-to-point
add cost=20 interface=gre-location7 network-type=point-to-point
add network-type=broadcast passive=yes
add cost=15 interface=vlan11-edge02 network-type=point-to-point
add interface=vlan10-edge01 network-type=point-to-point
add interface=vlan80-core03 network-type=point-to-point
add interface=vlan350-access01 network-type=point-to-point
add interface=vlan351-access02 network-type=point-to-point
/routing ospf network
add area=backbone comment=location5 network=10.55.55.20/30
add area=backbone comment=location7 network=10.55.55.24/30
add area=backbone comment=loopback network=10.0.0.2/32
add area=backbone network=10.8.0.8/29
add area=backbone network=10.8.0.0/29
add area=backbone network=10.8.0.64/29
add area=area2 network=10.8.12.0/22
add area=area2 network=k.m.d.f/22
add area=area2 network=w.z.d.f/24
add area=area2 network=r.g.v.d/23
add area=area2 network=10.10.0.0/16
Any help or advice would be really appreciated.
Thank you!