Community discussions

MikroTik App
 
gerardtik
just joined
Topic Author
Posts: 6
Joined: Wed Nov 07, 2018 2:05 pm

L3 traffic stops passing on specific interface (OSPF related maybe)

Tue Apr 28, 2020 6:36 pm

Hello all,
I've got a recurrent problem, one interface of one router stops passing traffic (layer3, MAC telnet works), sometimes restarting the interface (interface disable and interface enable) solves the issue but sometimes a reboot is required, all the other interfaces work as intended. I have done the following without any positive effect:
  • Use another interface
  • Change the router
A far as I've read it may be a OSPF problem (OSPF is working at the specific interface), i'll try to explain how the network is setup, first a network diagram (simplified):

Image

We have a provider (transit-provider) with who we do BGP on vlan1200, then there are two EDGE routers who will do VRRP on the provider interface (vlan1200) and then OSPF to all the other core routers (rou-edge-02 has increased cost on the interface and the core routers also have increased cost on the OSPF interface facing rou-edge-02) this two routers are only on backbone area and redistribute default route.
The core routers all have the same config:
  • OSPF on area0 to the edge routers or between them (rou-core-03<-->rou-core-04 and rou-core-05<-->rou-core-06) only rou-core-03 redistributes a static routes (only one).
  • Then every core router has an different OSPF area (1,2,4,5) that is used to share routes with rou-access routers, every area has 2 to 10 routers, in the diagram every area is represented by different rou-access colors.
All the routers work fine except rou-core-01: this is the one that stops passing traffic on the backbone (area0) side of OSPF (interface that connects to rou-edge-01 and rou-edge-02): important things to take into consideration:
  • When the router stops passing traffic I can connect to it using mac-telnet but ping / ssh / ospf stops working (so I assume all layer3 stops working)
  • PPPoE clients and OSPF neighbours connected to the internal area (purple one) continue to work as if there is no problem in the device.
  • A quick (and ugly) fix I applied while investingating the issue is to put a netwatch on the rou-core-01 pointing the IP of the rou-edge-01, and when the netwatch is "down" I disable and enable the affecfed interface (spf-sfpplus2), 9 out 10 times it works and after 12-15 seconds everything is working "fine".
  • rou-core-02 is a backup (manual) of rou-core-01, on the backbone area it has different vlans, IPs and cost to OSPF interface and its running (so I can connect to rou-core-02 loopback IP) then on the areaX side it has the same config (various vlans and PPPoE servers, no addresses declared) but disabled so if anything happens to rou-core-01 I can enable all the disabled config on rou-core-02 and after some seconds all the pppoe clients and routers will start to reconnect to this router, this procedure has been tested and it works.
  • If I interchange rou-core-01 (active to standby) and rou-core-02 (standby to active) the problem is also reproduced
  • This problem tends to happen at the start of the working hours (08:00-09:00 morning / 15:00-16:00 afternoon), during nights and weekends it super rare.
  • Also, I cannot force it to happen, it can work 10 days perfectly, then fail, recover and fail 4 times the next 2 days
I'll show the OSPF config of rou-edge-01, rou-core-01, and rou-core-02:
#rou-edge-01
/routing ospf instance
set [ find default=yes ] distribute-default=always-as-type-1 redistribute-other-ospf=as-type-1 router-id=10.0.0.0
/routing ospf interface
add network-type=broadcast passive=yes
add interface=vlan30 network-type=point-to-point
add cost=15 interface=40 network-type=point-to-point
add interface=vlan10 network-type=point-to-point
add interface=vlan20 network-type=point-to-point
/routing ospf network
add area=backbone network=10.8.0.32/29
add area=backbone network=10.0.0.0/32
add area=backbone network=10.8.0.48/29
add area=backbone network=10.8.0.16/29
add area=backbone network=10.8.0.0/29

#rou-core-01
/routing ospf area
add area-id=0.0.0.1 default-cost=1 inject-summary-lsas=no name=area1 type=stub
/routing ospf instance
set [ find default=yes ] router-id=10.0.0.6
/routing ospf area range
add area=area1 range=a.b.c.d/25
add area=area1 range=a.b.c.f/27
/routing ospf interface
add network-type=broadcast passive=yes
add interface=vlan30-edge01 network-type=point-to-point
add cost=20 interface=vlan31-edge02 network-type=point-to-point
add interface=vlan100-ptp-location1 network-type=point-to-point
add interface=ipip-location2 network-type=point-to-point
add cost=20 interface=location03-path01 network-type=point-to-point use-bfd=yes
add interface=location03-path02 network-type=point-to-point use-bfd=yes
/routing ospf network
add area=backbone comment=edge01 network=10.8.0.32/29
add area=backbone comment=edge02 network=10.8.0.40/29
add area=backbone comment=loopback network=10.0.0.6/32
add area=area1 network=a.b.j.0/23
add area=area1 network=10.8.10.136/29
add area=area1 network=10.8.10.168/29
add area=area1 network=10.8.200.0/30
add area=area1 network=10.8.200.4/30
add area=area1 network=a.b.k.0/24
add area=area1 network=j.k.f.0/22

#rou-core-03
/routing ospf area
add area-id=0.0.0.2 name=area2
/routing ospf instance
set [ find default=yes ] redistribute-static=as-type-1 router-id=10.0.0.2
/routing ospf area range
add area=area2 range=k.m.l.d/28
add area=area2 range=h.m.t.d/29
/routing ospf interface
add interface=gre-location5 network-type=point-to-point
add cost=20 interface=gre-location7 network-type=point-to-point
add network-type=broadcast passive=yes
add cost=15 interface=vlan11-edge02 network-type=point-to-point
add interface=vlan10-edge01 network-type=point-to-point
add interface=vlan80-core03 network-type=point-to-point
add interface=vlan350-access01 network-type=point-to-point
add interface=vlan351-access02 network-type=point-to-point
/routing ospf network
add area=backbone comment=location5 network=10.55.55.20/30
add area=backbone comment=location7 network=10.55.55.24/30
add area=backbone comment=loopback network=10.0.0.2/32
add area=backbone network=10.8.0.8/29
add area=backbone network=10.8.0.0/29
add area=backbone network=10.8.0.64/29
add area=area2 network=10.8.12.0/22
add area=area2 network=k.m.d.f/22
add area=area2 network=w.z.d.f/24
add area=area2 network=r.g.v.d/23
add area=area2 network=10.10.0.0/16
All public IPs are changed to something like a.b.f.g

Any help or advice would be really appreciated.

Thank you!
 
User avatar
floaty
Member
Member
Posts: 321
Joined: Sat Oct 20, 2018 1:24 am
Location: 52°08'32.34"N 14°39'05.0"E

Re: L3 traffic stops passing on specific interface (OSPF related maybe)

Tue Apr 28, 2020 9:38 pm

Since the problem occures very sparsely, it's not so probable that you catch the trigger for the event by a "lucky punch".
Anyway if you haven't yet, setup centralized syslogging and keep all you routers in time-sync.
The rough timestamp of the event and the surrounded syslog-events from all other routers can give you a hint what's going on (also for future use cases).
Make a dump from routing and forwarding tables for both cases (working and fail) then check the differences.
If this is a routing-failure (which seems highly presumeable) there should be an "odd" in there, which explains whats going on.
Check CPU and RAM-usage while the router is "off" (should be possible over the still working mac-telnet).
If you can afford, switch off the redundant router for a while and check if the incident also occures.
In case you are distributing the default-route, check that's it fully done in a "dynamic way" ... you mentioned a static-route reditribution ... (which is always a candidate for a loop).
Apart from that ... happy hunting !
 
gerardtik
just joined
Topic Author
Posts: 6
Joined: Wed Nov 07, 2018 2:05 pm

Re: L3 traffic stops passing on specific interface (OSPF related maybe)

Tue Apr 28, 2020 10:35 pm

Hi
Thanks for the response, just to summarise on what you said:
  • All router use the same NTP server
  • I have a logging server (Graylog) and I have enabled the OSPF topic without any luck, do you recommend another topic to be enabled? Also just after the failure I stop receiving logs (no l3 traffic) so I should install an SD card to write logs on local to check if there something weird but before the failure there is nothing weird on the logs.
  • How do I make this route and forwarding table dumps?
  • If I catch another error I'll try to see if there is any RAM o CPU spike, all the routers have SNMP enabled and the poling (every 5 min) do not show anything relevant.
  • If I switch to the redundant router the problem occurs, what I think is also worth mentioning is the redundant one do not fail until it becomes the "primary" so, there is something on the internal area that hast to case an error on the backbone area :/
  • How do I check if is it done in a "dynamic way"? The only static route I redistribute (from router rou-core-03 which is working fine) is this network: "192.168.222.0/24" so I do not think it should be a problem, I can add this route to the edge ones and try to remove it from OSPF but I doubt there is any problem with this.
This router has declared about 10-12 QinQ vlans, on every one of this internal vlans there is a PPPoE server, could this affect somehow? Is it preferable to put all those vlans on a bridge (and configure the same horizon on every "vlan-port" to ensure some isolation) and then only create a unique PPPoE server on top of this bridge?

Thanks again.
 
User avatar
floaty
Member
Member
Posts: 321
Joined: Sat Oct 20, 2018 1:24 am
Location: 52°08'32.34"N 14°39'05.0"E

Re: L3 traffic stops passing on specific interface (OSPF related maybe)

Wed Apr 29, 2020 12:52 am

.
Main problem is, It's not possible to reproduce the issue with (yet) known tasks or tools ...
.
graylog is exactly the tool I would use for such investigations ... did you mentioned interface up/downs from one of your routers when you timeframed the occurence of the incident ...
any route chances ?
.
recommend another topic
there are passive interfaces in your ospf-config ... and changes should be in the ospf-logs ... but it can't hurt, to enable logging for changes in hardware-interfaces too.
... ... ospf on a bridge interface ... an underliing bridge-member skips ... I don't know ... hard to say what could be relevant.
.
You need to see your self-build network as an outsider* (easier said than done).
.
How do I make this route and forwarding table dumps?
you run these routers ... you should get that form the manuals ( to be honest: I did my time on OSPF, but not with OSPF on ROS )
... but this is a forum ... the word spreads, maybe there is one with a set of links and hints to do it quicker ( ... but sometimes, you should do it from scratch*)
.
there are dfferences between the ospf -represented routing-table and the kernel-installed routing-table (that's not uncommon, but sometimes there are contradictions, which should'nt be in there ...
[like learning a local-route, which is installed with a foreign gateway from an ospf-source [[not shure if this is a good example : ) ]].
.
I think the key for you is to reproduce the issue.
You can have a multi-k-bucks contract on CCO with your IOS-Iron ... if the problem isn't reproduceable :( : good night and good luck !
Last edited by floaty on Wed Apr 29, 2020 1:07 am, edited 1 time in total.
 
User avatar
floaty
Member
Member
Posts: 321
Joined: Sat Oct 20, 2018 1:24 am
Location: 52°08'32.34"N 14°39'05.0"E

Re: L3 traffic stops passing on specific interface (OSPF related maybe)

Wed Apr 29, 2020 1:03 am

and maybe it's worth to investigate the "router-isn't-usable-by-IP-anymore*-thing ...
on local interface ? no ping ? no arp ? no mac (! see ... you're doing mac-telnet ) ... there is mac-access ! what about arp ?
... so every little step ... like an outsider ( ... maybe like an intruder)
.
I had things cooking in my own networks ... when I found out; I've spend hours laughing about my short-sighted-expertism
 
gerardtik
just joined
Topic Author
Posts: 6
Joined: Wed Nov 07, 2018 2:05 pm

Re: L3 traffic stops passing on specific interface (OSPF related maybe)

Wed Apr 29, 2020 10:26 am

Hi,
Main problem is, It's not possible to reproduce the issue with (yet) known tasks or tools ...
Yes, that's my main concern, no be able to reproduce it.
graylog is exactly the tool I would use for such investigations ... did you mentioned interface up/downs from one of your routers when you timeframed the occurence of the incident ...
any route chances ?
Those interface ups and downs are what resolves the issue (not always) and its done by me (or script).
You need to see your self-build network as an outsider* (easier said than done).
That's right, I so difficult for me (also english is not my mother language) to try to explain it as I understand it. I'll keep trying to improve.
you run these routers ... you should get that form the manuals ( to be honest: I did my time on OSPF, but not with OSPF on ROS )
... but this is a forum ... the word spreads, maybe there is one with a set of links and hints to do it quicker ( ... but sometimes, you should do it from scratch*)
Yes, what I meant to say is something like to if you meant to dump the LSA table and route table of OSPF, but I get what you say.

I think my next move will be to wait until happens again a then enable rou-core-02 to be the main router so I can investigate more whats happening on rou-core-01 (leaving it on this "fail" state until I get more information).

Thanks again.

Who is online

Users browsing this forum: No registered users and 20 guests