RB2011UAS-2HnD stops responding spontaneously

Fraction · March 17, 2014, 12:56pm

Hi,

I have been suffering with a quite annoying problem with RB2011UAS-2HnD (ROS 6.10) last few weeks. This happened first time at same day than I upgraded from ROS6.9 to 6.10, so I’m not sure is it related to ROS-version or RB or what. I didn’t change my configuration at that time (which has been working almost untouchable over year now).

Anyway, my problem is that my RB stops responding to any network-traffic (including ping to localhost address from device itself) occasionally. This happens maybe once per week and reboot resolves the issue to the next time.

When the device is jammed I can connect to it via serial and it seems to be working as expected, but ping to even its own localhost address gives only respond: “132 (No buffer space…”.

Because the device is in unmanned location, I set up Watchdog timer to watch address 127.0.0.1 and it works as some kind of workaroud, it reboots the device and soon everything is working again.

Am I only one suffering this? I found some quite old threads with same symptoms but not actually any resolutions..

makuk66 · March 18, 2014, 10:20am

I have a RB2011UiAS-RM that seems to lock up regularly, with 6.10 and an older version before that. Without the watchdog, it ends up not routing or allowing login and the touch-screen becomes unresponsive but the ethernet ports still flash. With the watchdog it duly resets. This seems to happen at least once a day, sometimes more often. Monitoring health shows ok temps and voltage, monitoring the OS shows free memory and a partly idle CPU, the logs show nothing particularly dubious. I’m currently experimenting with excluding functionality (I was using Traffic Flow and Web Proxy), to see if I can make it more stable. I’m not set-up for serial right this minute, but that’s a good idea.

Any other suggestions appreciated.

JanezFord · March 19, 2014, 5:59pm

Hello,

I have experienced the same issue as Fraction described. The unit was on remote location and was rebooted by a helper using LCD display and pin number. Another time I believe the same happened on one of our rb450g. All systems run latest firmware and v6.10.

Edit: The first unit was RB2011UAS-RM.

JF.

Majklik · March 20, 2014, 9:55am

This problem with “No buffer space available” and stopped IPv4 communication is not ROS6.10 only related. I see it on some my routers (RB800, RB450G) for whole ROS6 line. If you look on “/ip route cache print” then you see that is full, so IPv4 stops communication (IPv6 works). In my configuration this problem is related to the SSTP server operations, if it is disabled then I have not this problem. The cache is full filled after few days. But after update to the ROS6.10 is the cache full after 12~24 hours. There is configured firewall that limits connections to the SSTP server only from allowed addresses. Only three SSTP clients connected to the affected routers and during whole routers uptime there are around 20~50 connections attemps made to the SSTP server (TCP/443 port).
I’ve other systems with similar configuration where I do not see this problem (RB1100AHx2, ROS6.5), on these systems is SSTP server unprotected with firewall and there is about 100 connected clients full time.
On affected systems I use this scheduled script to repair this state with reboot:

:local act [/ip route cache get cache-size]
:local max [/ip route cache get max-cache-size]

if (($max-$act)<=2048) do={
  /system reboot
}

JanezFord · March 21, 2014, 1:23am

TNX for sharing info and your script with us Majklik. I don’t use SSTP on any of those devices that suffered from this issue.

JF.

Majklik · March 21, 2014, 10:06am

This problem with route cache was there long time ago, with different ROS versions too and different confgirutations.
It is pity, that ROS do not allow show the contents of this cache and flush it. On linux this can be done with “ip route show cache” and “ip route flush cache”.
This problem probably definitively will be solved when ROS switch to the linux kernel 3.6 because in this version was the route cache removed from the kernel with these arguments:
“The ipv4 routing cache is non-deterministic, performance wise, and is subject to reasonably easy to launch denial of service attacks.”

uldis · March 23, 2014, 5:10pm

make sure that you all upgrade the RouterOS to the v6.11 and as well upgrade the RouterBoard firmware to the latest available (at least v3.11 or newer) - that will guarantee that the gigigabit ethernets don’t have temp hangs.

JanezFord · March 23, 2014, 5:59pm

Hello Uldis

I have already upgraded my devices to latest software/firmware. This “temp” hang lasted for about 12 hours before I could get someone to reboot my device (remote location)…Until this bug is confirmed fixed I prefer using watchdog (ping gateway or localhost) or Majklik’s script just to be on the safe side.

JF.

timk · March 25, 2014, 4:46am

I have also hit this problem on an RB2011 running L2TP/IPSec VPN server as described here:
http://forum.mikrotik.com/t/ip-route-cache-bug/70827/1

I obtained this info via the serial console, all other networking is unavailable:

uptime: 2d20h16m
version: 6.10

cache-size: 16384
max-cache-size: 16384

Rolling back to 5.25 fixed the issue, I haven’t had a chance to test 6.11 but MikroTik support said they could re-create my issue and would try have the fix included in it.

Cheers

Majklik · March 25, 2014, 7:07am

Yes, I’ve still this situation witn ROS6.11 on the RB800 and RB450G. No more that one day of the uptime. Disabled SSTP eliminates this problem. But there are used many other services so this bug is combination with something another (VRRP on all interfaces, bonding, bridges, VLANs, GRE/IPsec, SIT tunnels, OSPFv2/v3, BGP).

NathanA · March 25, 2014, 1:48pm

So, I have run into this bug recently, too. This post should probably be re-titled and moved to a different sub-forum, because this is not an RB2011 problem or even a RouterBoard problem. We are experiencing this bug with x86 RouterOS, too.

Ever since upgrading one of our x86 boxes to 6.11 from 5.26, the router has stopped responding once a day on account of this bug, and has needed to be rebooted. We are not running SSTP on this box, although we do run an L2TP server on it. I am not 100% convinced that L2TP is what is causing the Linux route cache to balloon in our case, though: usually after a reboot, the route cache size stays pretty low and doesn’t change much, even if I rapidly make several L2TP connections/disconnections to it in the span of a few seconds.

I once was able to catch it in the act of failing, though, before it had completely done so: I logged into the box, peeked at the route cache, and it was increasing at the rate of about 20-30 per second, and sometimes faster (100+). In the span of a few minutes, it had reached 24K entries out of a maximum (on this box) of 32K. Nothing that I attempted to do managed to stop the rate of growth, and this is on a box that has a pretty simple configuration, has a very small routing table, and doesn’t participate in any dynamic route exchange/forwarding protocols. So it was very strange to see this behavior. After a reboot, the route cache got up to a little > 100, and then stayed there. Part of me wonders if it is a bot mounting some kind of DoS/intentional route cache poisoning attack on vulnerable Linux boxes.

I don’t know yet what causes the route cache to go into a tailspin, but since this router’s utility is so small and isn’t configured to do much, I’m hopeful that if I try to replicate in a lab environment, I will eventually be able to find the trigger. I’ll file a report with MikroTik if I am able to do so. Hopefully they’ll be able to do something about it, even if the problem ends up being in Linux itself rather than anything RouterOS-specific.

Very much looking forward to RouterOS 7, which will surely use a version of the Linux kernel >= 3.6…

– Nathan

P.S. – Not sure whether it is wise to mention this or not, but I did at least run across a workaround. I’m not sure what the possible negative effects and implications of this workaround might be, but if you can gain access to the ‘devel’ account on your specific router (…that’s as much as I will say about that…), you can both manually flush the Linux IPv4 route cache as well as tweak the Linux route cache settings to auto-flush old entries at a much faster rate, which should prevent the cache from reaching max-cache-size once it starts going crazy.

This shell command will flush the cache:

echo 1 > /proc/sys/net/ipv4/route/flush

These two commands will dramatically increase the rate at which the route cache garbage collector expires entries, which should help it keep up when the growth rate decides to spontaneously explode:

echo 5 > /proc/sys/net/ipv4/route/gc_interval
echo 5 > /proc/sys/net/ipv4/route/gc_timeout

These changes are not permanent, and will revert to default settings (60 for gc_interval, 300 for gc_timeout) when you reboot the router.

EDIT: This workaround turns out not to always be effective; see my next post in this thread for details.

timk · March 25, 2014, 8:51pm

Nice discovery Nathan!

Have you tried the ‘-C’ option to the Linux route command within the devel login? It would be interesting to see what all the entries are!

Cheers

NathanA · March 25, 2014, 9:47pm

Neither the ‘ip’ nor ‘route’ commands appear to be part of the busybox binary that MikroTik ships with RouterOS, which is why I am interacting with the ‘proc’ virtual filesystem directly. I have plans to build and try a more complete, statically-linked busybox binary that includes the ‘ip’ command, though I have not checked yet to see how complete busybox’s version of ‘ip’ is.

– Nathan

EDIT: Update:

I now have a statically-linked busybox binary that includes both ‘ip’ and ‘route’. The busybox version of ‘route’ doesn’t support the -C parameter, but ‘ip route show cache’ does work, and if I run that on the box in question while the route cache size is going berserk, it doesn’t show anything abnormal: route cache size says 10000+ entires, but ‘ip route show cache’ only shows the 5 or so routes that I would expect to see on this particular router. So that’s interesting.

The problem (well, at least, my problem) is definitely related to MikroTik’s new PPP code, though. I’ll be attempting to put together a step-by-step method of reproducing the bug once I have it 100% nailed down, but right now it appears to be triggered if you are running a PPP-based server (PPTP, L2TP, SSTP, maybe even PPPoE, etc.) and you have several connections to it go up and down. Eventually, after one of the PPPs disconnect, it seems like there might be some kind of race condition that occurs when it tries to tear the PPP interface down. ‘dmesg’ output shows a huge number of messages like this on my router shortly after the route cache starts spinning out of control:

unregister_netdevice: waiting for ppp1 to become free. Usage count = 1
unregister_netdevice: waiting for ppp1 to become free. Usage count = 1
unregister_netdevice: waiting for ppp1 to become free. Usage count = 1
unregister_netdevice: waiting for ppp1 to become free. Usage count = 1
[...]

…and something just keeps repeating that message over and over again. This is despite the fact that ‘ppp1’ as an interface no longer exists:

# busybox ifconfig ppp1
ifconfig: ppp1: error fetching interface information: Device not found

…so it’s trying to unregister a device that doesn’t exist?

More bad news: once it gets to this state, it appears that it is impossible to flush the cache of the entries being added to it (I assume by PPP?). It would seem that something in the PPP subsystem has a lock on those entries and they can’t be freed. Tweaking with the route cache garbage collector values doesn’t make a difference, either…the number doesn’t go down. I know that those proc/sysctl values actually work because I tested them on a MikroTik with a fairly large route cache, but one that wasn’t spiraling out of control, and was able to successfully flush the cache and visibly see the garbage collector behavior change. Once this particular bug is triggered, however, the only thing that can cure it is a reboot. I even tried killing the ppp and ppp-worker processes, but although they cleanly exited after being sent SIGTERM, the route cache remained bloated and the “unregister_netdevice” console errors continued apace.

Finally, it may have something to do with MPPE. I notice that after the problem starts, the ‘ppp_mppe’ kernel module shows that it is in-use by something and cannot be unloaded, even after I have terminated all PPP tunnels and shutdown PPP services:

# lsmod | busybox grep mppe
ppp_mppe 5585 6 - Live 0x90d5d000

This might just be another symptom, though, rather than a cause, especially if people are also running into this problem with SSTP, which should have no use for MPPE.

EDIT 2: Well, now this is interesting. I think I have managed to find a way to reproduce a version of this bug, but now that I’ve done so, I tried manually flushing the route cache again, and this time doing so has an effect. It will continue to grow and grow on its own even after a flush, but executing the flush actually works this time and clears the cache (temporarily). Weird.

Majklik · March 26, 2014, 8:03am

I’m thinking too that there is problem relatet to the new PPP package. The problem with route cache is more worse from ROS6.10.
There is one another test, which I reported yesterday ( [Ticket#2014032566001708] ). I have two metarouters, one runs SSTP server with one dead connection (in some configurtion do not works keepalive timeout on the SSTP server side and the server do not close dead connection, this problem was primary reason for this test) and second metarouter is client which is trying connect to the server but connection fail because SSTP server allow only one connection. I see that route cache is slowly filled up until server stop responding totally after hours (if the SSTP server is disabled or is there only one live connection then metaroutet lives days). If I leave it at this state then after few hours metarouter reboots. If is SSTP server disabled and connection closed (before metarouter hangs), after some time is cache flushed.

NathanA · March 26, 2014, 8:18am

I came up with a similar test, but one that I run with L2TP instead, which I am preparing a description of for MikroTik at this moment. Rather than limiting the connection to 1, however, I purposefully mismatched the encryption requirement between the server and the client: the server requires encryption but the client refuses it. The client tries to rapidly connect to the server over and over again and this quickly causes the scenario that I described in my last post, where something gets “stuck” trying to tear down one of the old pppX interfaces. It also generates several holds on the ppp_mppe kernel module as well. Interestingly, every time the L2TP client tries to connect again, it actually causes the route cache to be flushed. But if you let the L2TP client repeatedly try and fail to connect for 2-3 minutes, and then disable it, the route cache on the server will have a mind of its own and just grow and grow and grow after this.

– Nathan

EDIT: Actually, I’m beginning to think there are 2 issues: 1) the explosive growth of the route cache when something in the PPP subsystem gets stuck, and 2) the route cache getting into a state where it cannot be flushed any longer. I know how to make #1 happen…that’s easy. However, when #1 is happening, the route cache garbage collector seems to be able to keep up with it, so if #1 is happening but #2 is not, you probably still won’t see a crash. The real problem happens when #1 is combined with #2, and that’s what I experienced when I tried to flush the cache after it started growing and found that I couldn’t…the cache size would not go down when I tried a manual flush, and the garbage collector was not doing anything. I don’t yet know how to reproduce that state of things, but I have observed it once.

iprob · April 3, 2014, 4:52pm

We are seeing this issue with router x86 machines that have a lot of inbound VPN connections. These are a combination of on-demand L2TP and site-to-site IPSec tunnels. Failure occurs in less than 24 hours.

We’ve implemented the check for the route cache to automatically reboot.

This issue has been around a long time but clearly was made MUCH worse with the 6.11 release. The 6.11 release is not really usable at this time.

iprob · April 3, 2014, 6:44pm

I know it isn’t nearly as useful as Nathan’s detail information…but here is an odd scenario that happened.

The MikroTik running routeros x86 crashed after about 16 hours after the upgrade
Rebooted and one L2TP connection was made
Crashed again within 45 minutes
Two L2TP connections made (saw the route cache flush happen when second connection was made). One connection was the same user as the previous session. The second connect was from a different IP/user.
Route cache memory look stable at this point (see int increase and decrease).

I don’t have all the detailed tools you are using to capture the data, Nathan. Thanks for your work!

iprob · April 3, 2014, 10:08pm

Up time was only 5 hours last go around. What version is the suggested downgrade? I’d prefer not to have to go all the way back to 5.24 since the queues are redone for version 6 and all the ipsec configuration scripts had to be updated with “aes-256-cbc” instead of aes-256. I don’t know why MikroTik changes simple things like that which break backwards compatibility with configuration scripts.

lotnybartek · April 28, 2014, 12:36pm

Same problem here using RB2011UAS-2HnD and latest firmware / software.

Happened few times already (I have this router for 3 weeks), always while L2TP/IPSec clients connected (last time crash - 5 clients connected). I can’t ping it, I can’t login into it (ssh, telnet, winbox, web). Only reboot fix this. Mikrotik - please solve this bug.

For the time being I’ll just use Majklik script.

makuk66 · April 28, 2014, 12:46pm

I upgraded to 6.12 13 days ago and have not seen a reboot since.