My apologies in advance. This is a long post due to the nature of how long it has taken me to discover just where I have a problem, and the fact that I don't have many thought on how to solve the problem. I am trying to cover any potential questions in advance.
We have had trouble with OSPF for a while, but it was hard to pin down as OSPF being the problem. Let me lead you through why I am certain at this point that we are having OSPF issues, and then hopefully someone can tell me what to look at as to why.
We have an assortment of devices that include everything from licensed band links down to a CPE to CPE hop that gets around a hill for a single customer. All of this is routed through MiktoTik routers via Ethernet.
Private net IPs are used to manage the radios. Public net is assigned to the clients.
The public IPs enter the system through a pair of Cisco 3825s that are at fiber pops where we are required to use carrier class equipment. These use BGP to route to the Internet, and OSPF to distribute the default route to the internal network.
A public IP is used between sites to create the link between them so that trace routes are workable from both inside and outside the network.
OSPF settings seem to be pretty normal. The link subnets are added to the backbone areas of the routers they are on. Interfaces are added to allow cost control of lower quality links (lower quality = higher cost = backup link). Only the local router, and one instance of each neighbor shows up in the neighbors list. One router has a pptp link in the mix to allow pinging of private net IPs on a remote network from our monitoring server. All the others are just normal Ethernet links, so no other pppoe with OSPF issues exist. Connected and static routes are redistributed as type 1, but the default route, and other routing protocols are not distributed.
Loops of three to four sites are used so that if any one site fails, there is always another path for the other sites. In other words, we have done our best to eliminate stub sites, and of 22 AP sites, only 2 are stubs, and 1 is completely remote; IE: not on our main network.
At times, the inevitable happens, and a site fails. OSPF has not always automatically rerouted things. We have been driving out to a site to see what was wrong, and why it was no longer accessible, when it has come up. So instead of OSPF working in the 40 second dead link time in the settings, it has taken an hour to change over routes.
Sometimes a reboot was needed to get OSPF working, and since the site was not routing, we could not access the router or the remote reboot switch to do this.
That is the first thing that made me think there was a problem with OSPF. It was working very slowly, or after a reboot. But other events lead me to realize it is not working the way it should.
One of the stub sites used to have a switch. All the routing was done off another site. There was just one AP there, so the expense of a router was saved.
Due to customer growth, we eventually upgraded that site, putting in two more APs to sector the load. At that time we put in a small local router. The router was preloaded with all the IPs and other settings, and should have been a drop in replacement. The IPs were turned off at the remote router, the new one was put in, and nothing happened.
We found out that the OSPF neighbor relationship was not being established. They did not even show up as being in the init state, so we statically routed at the remote router to get the customers back online. It was late by then, so looking into other settings was postponed to the next day.
That next day, with no settings changes, OSPF was working, routes were propagated, and the OSPF neighbor relationship was full.
Again, this was OSPF working far slower than it should.
Now, we are moving off one site that is in the center of a city, and saturated with interference, to a hill that is about 15 miles away, and free of the problems. When the site was first hooked up, it was connected to the far side of the other stub. I could telnet from site to site, but OSPF did not propagate the routes to elsewhere.
Since the site was still being built out, I left things like that, and did not worry about it.
The next day, another backhaul was installed and aligned. I was able to route to the new site through the stub at that point, so another long delayed OSPF was happening.
The OSPF had already been set up with the network for the remaining links, and both sites saw each other, and went to init state.
I noticed everything was still routing through the stub, which would remain a stub once the noisy site was decommissioned, so I increased the link cost on both sides of the connection. This did not change the routing.
Suspecting the long delays, I decided to leave things alone and see what happened. Seven hours later, when I left for the day, the link was still routed through the stub. The next morning, it was not routing at all.
The higher link cost caused the new site to want to route through the other link. However, the router on the other side of that other link was still in init state, and never reached full OSPF routing. This caused the new site to become unrouted.
I rebooted both the new site, and the one that was in init state, they negotiated OSPF perfectly, and routed fine.
This reminded me that I have had to reboot routers before to get OSPF working on them and confirmed that something is wrong with OSPF here.
As for why I am still trying to get OSPF working properly, it is simple. It is a link state protocol that allows me to limit the use of lower quality links by increasing their cost. It also does less network overhead than say RIP. Lastly, only a few of the MikroTiks we have are BGP capable, leaving OSPF as the only viable option.
So my main question is what would cause OSPF to hang up so badly, or be so slow?
Lets say that each site has six ports with three subnets per port. The distribution is actually somewhat different, but the number, 396 subnets, is slightly larger than the actual number of routes in the network.
396 subnets does not seem that large a number, and should not use much in the way of resources. In fact, our routers use between 16 and 32 MB of RAM.
None of the routers seem overloaded. Smaller sites, which use older, three port MikroTiks, use about 16 of their 32 MB of RAM, and float between 25% and 50% CPU utilization. Larger sites have 256 to 512 MB of RAM, and uniformly use about 32 MB of it. The CPUs of larger sites range from 2% to 15% utilization, depending on their speed.
If a site is rebooted, it gets all the OSPF data correctly.
The only time we had a problem that required more than a reboot was when we switched from partial BGP routes to full BGP routes at the boarder routers. One of them was feeding BGP into OSPF, and it overloaded all the MikroTiks. Once the setting was changed, more than one reboot was needed at some sites to stop the overloading of routes to stop propagating because remote sites became the OSPF primary routers and were feeding back overloaded routes into the ones we had rebooted before.
Is there any way other than a reboot to clear the OSPF table in a MikroTik?
Is there anything else I can test to get me more clues?
About the only thing I can think of to do at this point is turn on the OSPF logging at this new site, and the other site it will link to, before we try that last connection. That way I may get a little more information if things happen again.
I don't know the source of the problem. I don't even know if hooking up another link will cause it to happen again or not. But since it has happened that way in the past, it is the only thing I can log, and hope to see results.
Unfortunately OSPF puts out so much information that trying to log it all the time adds up really quick.
Any thoughts or suggestions of what to look into would be greatly appreciated.
And if you got this far in the message, thanks a lot for taking the time to read all this.
Hostmaster for Cybertime