All,
For those of you who remember my post and subsequent thread about half a year ago regarding this problem where OSPF + PPPoE Server = CRASH (can still be found here: http://forum.mikrotik.com/t/bad-2-8-x-and-2-9-x-ospf-bug-was-ip-pool-bug/2564/1 ), I finally solved the mystery. I just sent MikroTik support the following e-mail. I figured for those who are still suffering from various OSPF-related ailments, you might find this informative since I suspect that a lot of the OSPF instability problems that people experience are related to this (Quagga bugs/issues).
The background to this message is that I decided to try my hand at setting up a Linux box that could speak OSPF (using Quagga) as well as act as a PPPoE server that does everything our current MikroTiks do (query RADIUS for authentication, set up queues for each PPPoE tunnel based on RADIUS parameters, inject Framed-Routes into the local routing table and have those get announced over OSPF, etc.) so that I could pit an “open” solution against the MikroTik RouterOS solution to see if the “open” solution had the same problems as MikroTik did. Having some “hands-on” experience with raw Quagga might shed some light on why RouterOS is so unstable when it comes to dynamic routing protocols such as OSPF, and at the very least, if my Linux solution worked, would also give us a platform to migrate TO if MikroTik never managed to come up with a solution.
I was successful in building my Linux-based OSPF+PPPoE server and having it mimic everything that our MikroTik PPPoE concentrators do, and in terms of performance and stability it beat RouterOS hands-down. After I managed to do this, I wanted to find out WHY it was working better than MikroTik RouterOS. Theoretically, since RouterOS itself is built on Linux and Quagga, there really should be no difference, unless the version of Quagga that MikroTik is using is so old (0.96.4 in 2.8.x all the way to .28)that it is simply full of bugs that have since been fixed.
(NB: I have not yet tried 2.9.11 or 2.9.11 + routing-test yet. It would be interesting to see how those respond to the load that I put on our RouterOS boxes during these tests.)
I sent the following e-mail to MikroTik Support today:
MikroTik Support,
Actually, believe it or not, I am VERY VERY HAPPY to report that I have finally solved the mystery regarding why MikroTik RouterOS OSPF crashes so much when coupled with the PPPoE server, and have been able to work around the issue by changing our configuration. Read on if you are interested in my solution.
As I informed you in my last e-mail, I have been playing around with Quagga a bit on my own outside of MikroTik RouterOS. While doing so, I discovered – since I tried to set up OSPF on Quagga as close to the way that we had it set up on the MikroTiks as possible – that because every single PPPoE tunnel that is brought up shares the same “local” IP, if there is a “network” line in the OSPF configuration that references the subnet that the PPPoE local IP is included in, then OSPF will try to establish neighbor adjacencies across those tunnels. This in itself is not a bug; this is the way OSPF was designed to work. I was just surprised to see that for every PPP interface that was coming up, Quagga OSPF was joining that interface to the OSPF multicast group; I was thinking about it from the other direction, assuming that my OSPF “network” statement had to include the list of IPs I was assigning to theremote end of the PPPoE tunnels.
Now, on all of our MikroTiks, we have a /32 assigned to a dummy “loopback” interface that gets announced out via OSPF. However, we have discovered that for some reason, OSPF on the MikroTiks will not announce it (even as an external connected route) unless we tell it that it is an “internal” network by specifying a “network” line for it. (I can get announcement of these /32s without a “network” line for them to work with Quagga no problem, but for some reason it didn’t work on most of our RouterOS boxes.) Since the “loopback” interface is just a dummy interface, OSPF traffic over it doesn’t actually go anywhere, but the OSPF software will try to run OSPF over the loopback interface nonetheless (shouldn’t hurt anything). Since we wanted that loopback host route to propagate out to the rest of the OSPF network, we added “network” lines for the loopback addresses on each OSPF core MikroTik.
The problem is that we apparently also decided at one point that when setting up the PPPoE server function on MikroTik RouterOS, we should standardize on setting the “local” IP for PPPoE tunnels to that same loopback /32 address. What we didn’t think about was that this means the OSPF software would try to speak OSPF over every single customer PPPoE interface. Now, this can’t really hurt anything since we have authentication set up (nobody should be able to inject routes into our network that they shouldn’t), but there is no point to trying to run OSPF over a customer connection, not to mention that that’s a LOT of interfaces that OSPF is going to try to run on.
In fact, it is this sheer number of OSPF-enabled interfaces that caused so many problems for us and gave us so many headaches. While speaking with a couple of the Quagga developers a bit on the Quagga-Users mailing list, it was brought to my attention that the version of Quagga that MikroTik has been using in their software has a very poorly-implemented communications mechanism between the core router daemon (zebra) and the OSPF daemon (or other daemons as well) which can “deadlock” under certain circumstances (they’ll stop talking to each other, probably because they’re both waiting on the other for something). This architecture isn’t really Quagga’s fault; they inherited it from the old GNU Zebra project which they based their software on when development on Zebra by the original developers sort of came to a halt. One of the circumstances that will cause the deadlock is…[drumroll]…trying to do OSPF over too many interfaces (100 is considered “a lot”).
The latest version of Quagga does not have this problem since the inter-daemon communication code has been completely rewritten using non-blocking I/O and a new queueing system. But it’s still considered “beta” (this rewrite is part of the current development tree, 0.99.x), so I’m guessing that it will be a while before we see this incorporated into RouterOS. (There are also a couple of new regressions in the 0.99.x code that I’m working with the Quagga developers on, and these bugs would cause me to recommend that you hold off from switching to the new codebase until these problems are resolved.)
Anyway, as I was saying, it turns out that every MikroTik that we have out in the field doing both OSPF and PPPoE has a /32 loopback IP that was also being re-used as the local IP for PPPoE tunnels, and also has an instruction to speak OSPF on that network (a “network” line for that /32). Once I realized this after observing “real Quagga” in action, I wondered if perhaps the sheer number of PPP interfaces it was trying to speak OSPF over might have something to do with why the 0.96.x and 0.98.x versions of Quagga kept locking up. I decided to test out this theory by going back to the old version of Quagga on my test box (which I managed to crash in a way similar to what we were seeing on the MikroTiks, and which was fixed by the latest version of Quagga/the new code), changed the PPP local IP to something OTHER than the loopback address, and then threw 500 PPPoE connections at the box all at one time (NO delay between tunnel connection attempts). It seemed to handle the load almost effortlessly, and didn’t crash this time! I then proceeded to simply take the “network” line out of ospfd.conf for the loopback /32 and then tried to use that IP again for the local PPPoE address, and again had no problems (whereas before this I could get it to crash 2 out of 3 times).
After this, I tried setting the PPPoE local IP to something other than the loopback address on both a 2.9.x MikroTik box and then on a 2.8.x box, and both boxes stayed up after connecting and disconnecting hundreds of PPPoE clients on those MikroTiks. Hooray!
These RouterOS boxes seem to be operating stably now after implementing this fix.
In conclusion, the problem was partly a bug in Quagga (inter-daemon blocking I/O communication, complete with deadlocks triggered by conditions such as too many interfaces running OSPF) as well as partly in how we had it set up. It is disheartening to see that MikroTik support was never able to catch this even though they had access to our running configuration through all the SUPOUTs I have sent to you over the last few months, though in your defense I will admit that it was a rather obscure issue. After learning how the Quagga software works and setting up a testbed of my own, I was finally able to solve the mystery.
In order to inform everyone else that this problem has now been solved, and in order to help other people who might be suffering from similar OSPF problems, I will post this information to the MikroTik forum. You might also want to inform people yourself from this point on that it is not a good idea to use an IP that is included in an OSPF network subnet as the local PPPoE IP address on PPPoE concentrators; in fact, not only is it not a good idea, but it is a recipe for disaster.
I still do not consider OSPF on RouterOS to be “stable” or “fixed.” There are a few circumstances in which we have needed to run OSPF over a fair number of interfaces (25+), and were prevented from doing so because of the deadlocking problems in Quagga 0.98.x and lower. RouterOS continued to crash under these circumstances when we tried to use OSPF in that way. On our PPPoE concentrators, we don’t have that many interfaces that we are actually speaking OSPF on, so we don’t run into the problem now that we’ve changed their configuration NOT to try to speak OSPF over every PPP interface that is created.
– Nathan