OSPF state problems

jkroon · Fri Sep 15, 2017 12:25 pm

OSPF issues has plagued recent versions of RouterOS. Whether it's that the state corrupts, or as per current state where one side believes we're in Exchange and the other side believe we're in Full, it's been a difficult journey the last year. The current status is that we've found that once more, at least once a day one of our 40 or so OSPF enabled Mikrotik routers goes into a state where it believes we're in Full exchange, where it's peer is stating Exchange. On neither side any routes from the other side gets installed. After some playing around we realized that rebooting the MT on the side claiming Full solves the problem, and further investigation got us to disabling and re-enabling the OSPF instance.

In our setup we always set up a loopback IP address for our routers, and we use that IP as the router-id in OSPF, and that IP gets distributed using OSPF (redistribute connected as type 1). We cooked up the below script to detect this situation and restart OSPF (we don't have the time, or adequate skill, to dig into the RouterOS code to find and repair the bug currently).

On the routers we've tested this on so far we've had success with it. There is a use-case we've not bumped into yet, and that is on multi-route configuration where the peer's loopback might actually be accessible via another path. In that case we'll probably have to check that the gateway ($nexthop inside the do={} block for the ip route check command) matches the address of the neighbor as well in the decision path (we're only up if the remote-id is reachable via nexthop matching that neighbor).

Store the following in a file called check_ospf.rsc and scp it to the routerboard.

# Script to check if OSPF is functioning.  It relies on the fact that each
# adjacent neighbor will be reachable via it's router-id, and that if we're
# unable to exchange routes for whatever reason, the router-id will be
# unreachable.  By default a router that comes up will not permit itself to be
# rebooted until OSPF has come up at least once.  The cases we've seen where it
# fails the failing side states neighbor state Full, but the working side says
# Exchange.  So this check should prevent rebooting core routers, but alas, I'm
# not 100% confident of that.
#
# The logic here reboots if we're in a bad state with at leats one peer.
# Possibly this should be negated to only reboot if no peers are in a good
# state and at least one is in a bad state (more complex though).
# Reboot possibility will get enabled once at least one peer has managed to
# come up successfully.
#
# In case of multiple OSPF instances, if any one of them is functioning we move
# towards the mayreboot state, but we will only restart non-functioning
# instances.
:if ([/file find name=ospfstatus.txt] = "") do={
        :put "ospf status file doesn't exist - creating."
        /file print file=ospfstatus
        /file set [/file find name="ospfstatus.txt"] contents="no"
        :put "Done, please re-run the script."
        # Continuing with the rest of the script is pointless as our view of /file is
        # a snapshot in spite of /file set ... which above sets it to some arbitrary
        # content (looks like a file list).
} else={
        :local mayreboot [/file get ospfstatus.txt contents]

        # After initial set the file content is completely bogus ...
        :if ($mayreboot != "yes" && $mayreboot != "no") do={
                :set mayreboot "no"
        }

        :put "Checking OSPF (mayreboot=$mayreboot) ..."

        :foreach n in=[/routing ospf neighbor find where state="Full"] do={
                :local loopback [/routing ospf neighbor get $n router-id]

                :put "Remote OSPF $loopback @ $remoteaddress"
                /ip route check $loopback once do={
                        :if ($status != "failed") do={
                                :put "OSPF is functioning correctly."
                                :if ($mayreboot != "yes") do={
                                        :log info "OSPF restored - restoring"
                                        /file set ospfstatus.txt contents="yes"
                                }
                        } else={
                                :put "OSPF is not functioning correctly."
                                :if ($mayreboot = "yes") do={
                                        :local instancename [/routing ospf neighbor get $n instance]

                                        /file set ospfstatus.txt contents="no"
                                        :put "Restarting OSPF"
                                        :log error "OSPF malfunctioned - restarting."
                                        /routing ospf instance set [/routing ospf instance find name=$instancename] disabled=yes
                                        /routing ospf instance set [/routing ospf instance find name=$instancename] disabled=no
                                }
                        }
                }
        }
}

You can enable this to run every 30 seconds with:

/system scheduler add interval=30s name=check_ospf on-event="/import check_ospf.rsc"

Obviously you can embed the entire thing straight into the scheduler too if you prefer, we just found having rsc files makes it easier to bulk-update over an entire network.

changeip · Thu Oct 12, 2017 3:05 am

OSPF issues has plagued recent versions of RouterOS.

Do you know which version you started seeing this problem? Any reason why you don't roll back? I have been using 6.38.7 with no known ospf issues...

Thanks for the info, I will be cautious on upgrades now...

Sam

mducharme · Thu Jul 19, 2018 7:15 am

OSPF issues has plagued recent versions of RouterOS. Whether it's that the state corrupts, or as per current state where one side believes we're in Exchange and the other side believe we're in Full, it's been a difficult journey the last year.

We have just run into this with the following topology (OSPFv3):

[ core router with the following: bridge bridge-remote (1 port), port is eoip tunnel to-far-router ] <---------------------------> Internet <-------------------> [ Eoip tunnel to-core | far router ]

We have OSPFv3 neighbor between interface bridge-remote on core router and the eoip tunnel to-core on the far router. If we disable the eoip tunnel to-far-router for 60 seconds on core router, then it gets stuck in this corrupt state that you mention - the core router gets stuck in exchange and the far side in full.

However, very strangely, this only happens if the far router's router ID is higher than the core router's router ID. If the far router has a lower router ID than core router, you can disable and re-enable the "to-far-router" eoip tunnel on the core router all you want and it will recover correctly. For some reason the recovery from the issue seems to be broken depending on the relationship between the router IDs of the two routers, which is a lower value and which is a higher value. I have a ticket open with MikroTik on this.

Poundbury · Wed Dec 09, 2020 6:44 pm

Sorry to resurrect an ancient thread, but did Mikrotik ever acknowledge your router id assertion?
I think I may have an example of the same issue on recent (6.47) ROS.

Mike

jkroon · Mon Jun 06, 2022 10:34 am

Hi,

Not sure if this is still relevant for you (or if you'll even see this reply), but no, they did not. Working through hardware vendors also seems fruitless. And I've just posted another OSPF (potentially related to this now that I think about it) issue thread. I honestly don't know how to approach Mikrotik's RouterOS other than with a dustbin any more, and the value-for-money proposition is quickly becoming less and less attractive.

Still need to scope out RouterOS 7 though, but I'm very, very sceptical and definitely don't want to jump to 7 directly on our CCRs on which our network pivots.

Kind Regards,
Jaco

OSPF state problems

OSPF state problems

Re: OSPF state problems

Re: OSPF state problems

Re: OSPF state problems

Re: OSPF state problems

Who is online