5.x routing cache bug (?) - dropped packets, lost network

glucz · April 22, 2011, 6:51am

I reported a problem to mikrotik earlier in which routeros 5.x looses network connectivity every day or so - depending on load.

Support told me that the reason for this was that my routing cache filled up. They suggested that my users were running p2p or the router is DDOS-ed. However my suspicion is that this is an actual bug and I would like to see if anyone else is experiencing this - so that maybe a fix for this could be seriously considered.

The visible effects are that memory usage on the router constantly increases and the following command:
[admin@MikroTik] > ip route cache print
cache-size: 29621
max-cache-size: 32768

shows a gradually increasing routing cache usage.

Once the max size is reached the router will start dropping packets and will eventually go into total silence. The router will never recover by itself as I would expect from a p2p or DOS event, as the cache entries expire etc …, but will instead require a reboot. Additionally, I don’t see the many thousands of connections that I would expect from DOS or p2p. This is mostly 30-40 users generating 2000-5000 connections. When I downgrade to 4.x, the problem goes away.

seany · April 27, 2011, 8:49pm

Hello,

This sounds exactly like what is happening to a friend. We’ll try to reproduce on 5.2!

glucz · April 28, 2011, 7:24am

There is a route fix in 5.2 . I hope that it was in response to this bug report. I upgraded 4 routers yesterday

So far I’m up to here:

[admin@MikroTik] > ip route cache print
cache-size: 2596
max-cache-size: 65536

I’ll just wait and see what happens.

glucz · April 28, 2011, 9:39am

[admin@MikroTik] > ip route cache print
cache-size: 15239
max-cache-size: 65536

krakenant · May 5, 2011, 1:52pm

I am starting to have issues with routes now. I have two routes (at least)that multiple x86 units, two on 5.2 and two on 5.0, won’t find. The route is there, I can export or print and it is there, but if I do a print or a find where I specify a dst-address, the query returns nothing.

It broke two of my scripts. They were working until yesterday when I added some mangle rules and a couple of routes based on the routing marks of those mangle rules.

Chupaka · May 5, 2011, 2:33pm

so it was working, and then became broken without upgrade/reboot/etc?

krakenant · May 5, 2011, 3:10pm

They were all working with 5.0. Then I added some mangle rules and a couple of routes that used those mangle rules. Since then my scripts that were working, no longer work due to being unable to find an active route with a dst-address of 0.0.0.0/0 despite there being one and nothing with that route changing and no new routes being added.

These are x86 boxes, I decided to update two of them to 5.2 and the issue still remains.
Two of them have not been upgraded/rebooted etc.

Here is one of the queries that is failing and the default route as well as the two routes that I added.
/ip route find dst-address=0.0.0.0/0 active=yes

[admin@00:60:E0:4C:A5:28] > ip route export
# may/05/2011 10:08:25 by RouterOS 5.0
# software id = C2PW-3RZV
#
/ip route
add disabled=no distance=1 dst-address=172.27.0.0/16 gateway=USER_VRRP2_SECONDARY pref-src=172.27.0.2 routing-mark=VRRP2 scope=10 target-scope=10
add disabled=no distance=1 dst-address=172.27.0.0/16 gateway=USER_VRRP1_PRIMARY pref-src=172.27.0.1 routing-mark=VRRP1 scope=10 target-scope=10
add comment="Default Route" disabled=no distance=10 dst-address=0.0.0.0/0 gateway=x.x.x.x scope=30 target-scope=10

babbage · May 7, 2011, 7:04am

Glucz,
Confirm to have the same issue. I scare to upgrade to 5.2 because of possible new bugs! Let me know if your issue is fixed. It’s a month I am in touch with MK and they are still unable to update me on the result.

I have diffetent x86 servers and 3 of them have this issue. All different configs!

glucz · May 7, 2011, 6:13pm

The problem is still present in 5.2

I have been sending supouts to support to help their work .. maybe things will improve in 5.3? I unfortunately need SSTP, so I must keep 5.2 on a few servers.

This also gave me the opportunity to test a few scenarios and found the following:
I have a demo PPTP/L2TP profile that is time and bandwidth limited. So routerOS cuts the user after some minutes. These dynamic server interfaces will also create fifo interface queues to manage the individual bandwidth limits. I believe that one of these causes the stale entries in the cache.

I have disabled this demo profile on 2 servers, so all disconnections are “clean” and the regular user accounts don’t create interface queues. On these routers the cache numbers stay within reasonable values.

Here are the actual numbers:
SERVER 1 with ROS 5.2 and active demo profile (no p2p):
uptime: 2d10h33m55s
cache-size: 16499
max-cache-size: 32768

SERVER 2 with ROS 5.2 active demo profile (no p2p):
uptime: 5h41m1s
cache-size: 5059
max-cache-size: 16384

SERVER 3 with ROS 5.2 active demo profile (no p2p):
uptime: 19h5m5s
cache-size: 7834
max-cache-size: 65536

SERVER 4 with ROS 5.2 DEACTIVATED demo profile (with p2p!!!):
uptime: 2d22h3m9s
cache-size: 1215
max-cache-size: 32768

SERVER 5 with ROS 5.2 DEACTIVATED demo profile (no p2p):
uptime: 2d10h32m52s
cache-size: 200
max-cache-size: 65536

As you can see my uptimes are rather low, so even where the cache problem is not present other lockup problems and lingering OpenVPN problems require reboots

As a comparison here is another server running ROS 3.X
[admin@MikroTik] > system resource print
uptime: 31w13h50m28s
version: “3.27”

I have no routing cache information, but uptime is much better

Same with 4.x … but I did have to reboot this recently, but usually it is very stable
[admin@MikroTik] > system resource print
uptime: 1w2d15m14s
version: “4.17”
cache-size: 196
max-cache-size: 16384

I don’t know if its important or not, but I also run a script that removes invalid server interfaces and the associated IP addresses and queues. I don’t know where they come from, but from time to time I just happen to get a whole bunch of them, then nothing for weeks. This is a RouterOS problem present since the late 2.x and early 3.x versions - ever since I started working with RouterOS. Maybe these removes are not clean on 5.x ?

GL

babbage · May 8, 2011, 4:30am

I am doing the same, sending different sutpout files to help them resolve the issue. I have 2 tickets open for a long time.
I think tt’s not related to interface or queue issues, cause I have a core router with only ethernet interfaces and problem happens there too.
I am impatiently waiting for the fix… I love the new 5.x’s PCQ burst feature…

OndrejSkipala · May 13, 2011, 2:19pm

Hi, I have the same problem on RouterOS 4.x (lastly on 4.16). My RB433 stops responding to IP traffic, only MAC works (but sometimes not very well). No ping, no routing. When I reboot it, works fine again. This happenes after 112 or 113 days on all routers. Also one of my RB600 collapsed in this way just after 72 days. When I connect to the device via MAC and try to ping, it says “timeout No buffer space available”.

I don’t think this is a DOS attack, because it happenes on all routers after 112 days, but not in the same moment. Also no attack is present according to the monitoring. I was writing several times to Mikrotik support and after “upgrade to new version”, “increase queue sizes”, “it is a DOS attack”… Maris from the support found out that the /ip/route/cache seems to be full. Althoug there was very little load. Memory is not full (a lot of space), disk the same, CPU 10%.

So I looked into a router that was running for 112 days and I was quite certain it is going to fail today as the others. And the cache was almost full (just a few entries). After 30 minutes, it happened - full cache, lost connection, no responding to IP traffic.

So it must be someting with this cache. The only thing that runs regularly on that router is 5 ICMP echo messages every 15seconds from our server and Traffic Flow enabled, that sends data to the same server. Not much of a traffic, but it is someting that is common to all our routers.

Do you also use something that regularly creates some kind of traffic?

reddrinker · May 22, 2011, 10:47pm

Hi,

We are experiencing a similar issue with RB1000’s with OS 5.2 however when I look at the route-cache it looks ok.

[admin@MikroTik] > ip route cache print
cache-size: 2423
max-cache-size: 65536

When I re-boot the router it all works ok but only for a few hours in our case, the we see symptoms as below creep back in, and another re-boot is required.
HOST SIZE TTL TIME STATUS
192.168.1.8 timeout
192.168.1.8 timeout
192.168.1.8 timeout
192.168.1.8 timeout
192.168.1.8 timeout
192.168.1.8 timeout
192.168.1.8 timeout
192.168.1.8 56 128 0ms
192.168.1.8 timeout
192.168.1.8 56 128 0ms
192.168.1.8 timeout
192.168.1.8 timeout
192.168.1.8 timeout
192.168.1.8 timeout
192.168.1.8 timeout
192.168.1.8 timeout
192.168.1.8 timeout
192.168.1.8 timeout
192.168.1.8 56 128 0ms
192.168.1.8 timeout
sent=20 received=3 packet-loss=85% min-rtt=0ms avg-rtt=0ms max-rtt=0ms

I have downgraded another RB1000 to 4.17 and also got a RB110 which came with 4.17 to try and see if the problems follows.

I am very new to the RouterOS so any guidance in trying to narrow down this issue would be appreciated.

Thanks in advance.

mm690 · May 23, 2011, 5:54am

SHAME ON mikrotik for removing my post about this issue.!!!

I was going to paste the link to a previous thread i created about this problem, but MT has deleted it I would imagine due to its title.

It was titled " Petition to remove 5.xx as stable release!"

I had some good info in there as to how to re create the problem etc.

Im frigging mad!

normis · May 23, 2011, 6:56am

if you want a problem to be solved, follow this advice:
http://forum.mikrotik.com/t/getting-the-most-out-of-this-forum/40983/1

and also this:
http://forum.mikrotik.com/faq.php#f0r0

glucz · May 23, 2011, 2:10pm

In case the original thread was removed and others experience the route-cache problems, lost pings etc … change your tarpit actions to drops

The problem is possibly due to a diffeerent route cache / tarpit implementation in the new linux kernel used by ROS 5, so this may be present in 5.3 or other future versions.

GL

goshawk · May 25, 2011, 6:50pm

Hi,

I have something simmilar problem with my fresh installed 5.2 on a PC.

I have multiple VLAN-s on the same ethernet interface and sometime when
I delete an IP address on a VLAN the route policy generated automatically
for that IP address not deleted. I can’t delete it too because its not a static route.

After that not only that Network will be unreachable but others too.
Reboot always solves the problem.

brainy · May 25, 2011, 6:56pm

5.3 was released today.

Maybe someone who is affected can try if the problem is fixed in 5.3?

Would be nice to know.

Regards,
Joerg

OndrejSkipala · September 26, 2011, 5:36am

Hi guys, I was trying to solve this problem with Mikrotik support for a hundred times and finally a fix was made. Version 5.6 of RouterOS really seems to work with route-cache. Now, it is not getting unreasonably big (it was growing and growing), so I dont need to restart my routers every 110 days like I had to (otherwise they would stop responding to IP traffic). Good job Mikrotik.

jrecabeitia · September 26, 2011, 4:36pm

Dear, I have a RB1100 who upgrade to v.5.7
The result was very bad.
The RB1100, stopped working. The memory is consumed until it stops and does not respond. (single ping)
This occurred within a few hours.
I went back to version 5.6
Note that changing the values of PCQ was stabilized but not enough to function normally.
Apparently, there are records in memory, which should fly obsolete. Honestly, I can not give more information about this bug.

I have other RB433, RB493 which have identical configurations in OSPF and only differ in that they do not use queues of any kind. These work well with v.5.7

patmcq · November 28, 2011, 5:10am

I have a similar issue with a 1200 (Release 5.8 ). I am trying to replace an HP router (dl7012?!) with RouterOS. I have an 80mb connection with 100mb on the way. I service about 60 customers\connections all 100mb Ethernet, many of them “Nat” their connection.

I replaced the HP with the 1200 and everything was fine… for about 2 minutes. Router would not respond to pings, to and through the device. CPU was about 2-7%.

I am not doing anything fancy at the moment. No Masquerade, no Mangle, No VPN, nothing. Just routing between two Ethernet ports. I tried this late on a Sunday night with literally no load and no traffic.

Any ideas.

Patrick