Memory Leak with BGP?

I recently turned up my first BGP sessions using ROS. It is with 2 ISPs. I am taking full routes from each ISP using a RB1000. My memory usage graphs are showing a disturbing trend that appears to be a memory leak.

Monthly:

Yearly:

As you can see, before turning up BGP, my memory usage was consistent. Since turning up BGP (big spike in memory usage as expected), it is on the rise consistently. At this rate, it might just be a few more days before I run out of memory. Is there a memory leak in ROS? Has anyone else using BGP seen similar? I’m using ROS version 4.9.

I wonder if route flaps are the cause of memory usage. Say you get 300,000 routes, that takes X chunk of memory. Each time there are updates possibly it is not releasing memory because it’s fragmented now. I have been running BGP for years and years on MT (3.30 latest in prod) and haven’t seen this before (that I can remember). maybe you can take a supout in the beginning and one a few days later and then open a ticket.

I’ve been experiencing the same thing for a while now. I first noticed it in the 3.20’s when I started taking 2 full bgp feeds. I’ve upgraded through 4.10 and am still experiencing the problem. I’ve tried different RB1000’s but haven’t tried x86 hardware yet. If I don’t reboot before it runs out of memory it will sometimes reboot on its own or lock until I power cycle.

I opened a ticket back in March about this. “[Ticket#2010030766000161] Memory Leak?”

changeip, are you taking full bgp? x86 or rb1000?

Thanks,
Gerard

x86 - but only taking about 20k - 40k routes.

I had to reboot my router the other day for upgrade to 4.10, but until that point, a sawtooth pattern was beginning to emerge on my memory usage. No instability encountered yet.

So far on 4.10 and upgraded RB1000 firmware to 2.27 the climb has begun again.

Gerard, please post here if you get any feedback on your ticket. Too bad it has been opened for 3 months. If you don’t here back in a couple days, let me know and I’ll open a 2nd ticket for this issue.

Here’s my graphs from my current router. The previous rb1000 had 1 gig of ram and did the same thing.
bgpmemory2.png
The time from March to June is when I was filtering most of the incoming routes with
add action=discard chain=in-att comment=“” disabled=no invert-match=no prefix-length=17-32

Gerard

Here is an upgrade from 2.9.51 to 3.30 on x86.

You can see the first half never decreased, only increased (releasing memory). The second half was the upgrade to 3.30 and you can see it actually seemed to release memory and change depending on how many routes there were. I have not dared to try 4.x on those routers yet.

I’m only taking about 20k-30k routes. Rather than filter using prefix length I filter on who’s directly connected to that peer (customer route), and then just use level3 for everything else. There’s not really any reason to take full peering tables from each peer unless you pay the exact same for all of them.

We have a RB1000 running 320k BGP routes @ ROS 4.9 and we didn’t see this behavior. We are graphing using “/tools graphing” and shows a stable and constant usage within the last 15 days…

We too are seeing a constant rise in memory consumption on an x86 dual-core router taking full ipv4 and ipv6 routes from one peer in another AS.

It is also the sawtooth pattern with a general rising trend. It seems to be losing around 3 or megabytes per day.

This router is running BGP with ipv4 in one peer, and ipv6 in another peer (to the same remote side), and OSPFv2 and OSPFv3, there are some firewall rules, and basically nothing else (it sends log messages to a remote host).

Its currently running 4.10, we also saw the memory problem in 4.5

We have been seeing this exact problem with Mikrotik after the mid 3.xx’s version release (I never kept notes so I don’t know which version). Basically, RAM is stable until BGP full-routes is installed (naturally as more prefixes are annouced the RAM should climb) but I was seeing a few meg’s per day at one point. The router would get low on resources and simply fail. Hard power cycle required to recover. This happens often. I’ve been rebooting the Mikrotik RB1000 once a month to avoid this happening.

Reliability is not so good :frowning:

Chaps. Have any of you seen any stability in this issue yet? Had a customers’ RB1000 with 4.10 on it do exactly the same this morning. If we clear the BGP peer at perhaps 450meg and the table reloads then it drops again to 250meg is. This is a single 350k routes peer at present.

Welcome anyones updates/comments.

Cheers,

Pete

There have been no changes or updates that I have applied since my previous posting that has corrected this error. I really love Mikrotik, but I have come to the conclusion that the BGP implementation is just no good. I have sent countless support.rif’s to Mikrotik and I get no meaningful help. I recently made the switch to another routing platform to handle BGP and I can actually get a good night’s sleep now. Mikrotik used to seem somewhat BGP stable in the 3.xx releases, but the 4.xx releases gotta be flawed somehow. A real shame considering the price. I do not believe I had any ‘crazy’ BGP configuration or anything like that. I have also noticed some crashing with Metarouters, but that is another story. I primarily use RB1000’s RB1100’s and 750’s. I used to do PC based mikrotik, but I liked the compact 1U design of their routers (and price, and ease of install - fits in a backpack). If any ‘official’ Mikrotik folks are reading this, all I can tell you is I played by the rules for over a year waiting for a fix and nothing ever happened. I was quickly told to switch of SNMP (which was a real problem for me) - I did that too, that didn’t solve the problem. Despite all this, I still love Mikrotik and wish there were more features for IPv6 and SFP (fiber) connectivity options. Would be great if they made an ADSL module too (or dual modules to do bonding)

I can also confirm that there is a memory leak in the BGP implementation of Mikrotik v4.16.

The graph is made on i386 arhitecture.

Also, CPU utilization starts growing (and going) crazy when a peer is lost, or full routes are being sent to a peer. Also noticed that routes come in pretty slowly. Not certain if the slow routes is due to RB1000/RB1100 hardware. Quagga on x86 runs far better, even on slower computers.

Indeed I thought that the routes were getting in pretty slow but that depends on the CPU quite a lot. If you have a quad cpu and a lot of bandwidth between the peers you’ll get them pretty fast at about 70-80 Mbps.

What is quite annoying is that if the router receives a lot of packets (more than it can handle) it won’t simply start dropping packets, it will break the peering sessions on which the packets arrive an possibly some others which are unrelated with “hold time expired” error message. Better interrupt internal management is needed.

Also I am curious if using Intel server boards with offloading will produce better results. In principle it should also receive a buffer of packets before doing one interrupt request and not 1 interrupt/packet like it does when using usual ethernet boards. Does anyone have any experience with such ethernet cards?

Does the memory usage is growing and growing until there is no free memory and router hangs? From graphs it doesn’ t look that way.

In my 1GB RAM RB1100 system, when the RAM gets to the 300-400MB level, it starts degrading rapidly, CPU utilization climbs to 100% then the router dies - I need to do a power cycle to get it back up and running.

Yes it does, but it takes weeks. Had 2 lock-ups at about 200-250 MB RAM free in the system and had to reboot it.

Edit: There you go, it just keeps rising.

Edit: There you go, it just keeps rising.

from 700kb to 1mb ? that graph is strange

If you can generate one supout right after rotuter is rebooted. Then wait until memory climbs to the point where “it starts degrading rapidly” and generate another supout. Both files send to support.