Hey, everybody,
For quite a while now, we’ve been having problems with random crashes of MikroTik routers which are doing PPPoE termination for us. We’ve tried all sorts of troubleshooting to try and narrow down the cause of the problem, but up until recently we were not able to discern any clear patterns.
However, it now appears that it has something to do with the IP Pool feature of RouterOS. This also seems to be reproducible across all versions of 2.8.x (we’ve upgraded and downgraded several routers with no change).
We find that, with the following setup, we can recreate the crash nearly 100% of the time. Could someone else out there (preferably a MikroTik employee) try these steps to see if you get the same result?
By the way, we’re using 1U rackmount Supermicro machines (2.8GHz P4/Celeron, 1GB RAM, RouterOS loaded on IDE flash module).
-
Configure a RouterOS machine (we’ll call this router #1) on one end to have an IP pool with a range of, say, 192.168.1.0-192.168.1.255.
-
Create a PPPoE Profile on router #1 and set the Remote Address to the pool you created in step 1.
-
Create and start a PPPoE Server instance on one of the ethernet interfaces on router #1 using the PPPoE profile you created in step 2 as the Default Profile for this server.
-
Create a single PPP Secret on router #1 with any username and password of your choice. Set its profile to the one you created in step 2 and the Service type to ‘pppoe’.
-
On a second RouterOS machine (router #2) which you should wire up to router #1, create several hundred PPPoE client interfaces (just make one and copy it several times; easiest with a script), all with the username and password you came up with for step 4, but leave them all disabled for now. You may set the Service name to match router #1’s PPPoE service name if you need/desire to.
/interface pppoe-client
add interface=ether1 user=username password=password add-default-route=no disabled=yes name=0
print
:for x from 1 to 511 do=[add disabled=yes copy-from=0 name=($x)]
- Now – again, best done with a script – enable each PPPoE client on router #2 one-by-one with a second or two of delay in between each. Watch to make sure the tunnels are coming up on router #1, and watch the system resources on router #1 carefully during this.
:for x from 0 to 511 do=[:delay 1;enable $x]
Results: The tunnels will successfully establish for a while and IPs from the pool will be handed out to the incoming tunnels in sequential fashion, as we should expect. However, once we get somewhere around 120 tunnels (at least on our test machines), router #1’s system resource usage numbers will spike dramatically: CPU load goes to 100% and nearly all of the entire 1GB of physical RAM gets used up. At this point, what happens next seems left up to chance: you might see things normalize and continue after a minute or so, you might see resource usage drop back to normal levels but router #1 will stop accepting new PPPoE tunnels, you might see CPU load remain at 100%, you might see existing PPPoE tunnels that were established successfully before the resource spike stop working completely even though they are still listed in the Interfaces list, or you might see router #1 drop off the network and never return (in the latter case, accessing router #1 via the console might show a kernel panic, or a login prompt that is non-responsive, or several other possibilities).
After you have confirmed the above behavior, try this:
- Go to router #2 and disable all of the PPPoE interfaces.
/interface pppoe-client
disable [find]
-
Go back to router #1, reboot it, then go to the properties of the PPPoE Profile you created earlier and set its Remote Address to a static value (like 192.168.1.1 or something) rather than pointing it to an IP Pool.
-
Go back to router #2, and re-enable the PPPoE client interfaces again one-by-one, like last time.
Results: All 512 PPPoE tunnels come up successfully on router #1 (albeit they all have the same IP assigned to them) with no resource spikes and no crashes.
This would seem to indicate that something in the IP Pool code is causing a memory leak or is getting stuck in a loop or something. The kernel then probably starts killing off processes after things start spiraling out of control. It may end up being successful or not, which is probably why we see the diverse array of unpredictable symptoms that we do at the end.
If it’s an IP Pool problem, I suppose it is possible that DHCP could also be affected by this, though I’ve never tried to reproduce this issue using DHCP in place of PPPoE.
– Nathan