Loss of BGP function after 3-4 weeks

Hi everyone

I have a deployment of Routerboard kit around Europe and have a problem with one CCR1036 which seems to lose it’s BGP service from time to time.

The router is in Frankfurt, Germany and has a full-table transit. It also has an iBGP peer with another 1036 in London via a pseudowire service provided by a third party.

Approx every 3-4 weeks the BGP service on the Frankfurt router just seems to stop. On the console, if I do ‘routing bgp export’ I get the comments lines and then the interface hangs (until ctrl-c)

[admin@FRARTR01] > routing bgp export
# sep/07/2014 12:25:36 by RouterOS 6.18
# software id = KYWQ-UYW1
#

If I look in the web interface, both BGP instances and BGP peers are blank:

The only way to recover (that I have found) is to reboot the entire router. Incidentally, this takes a long time - approx 2 minutes from issuing the reboot command on the console until the router stops pinging. When it reboots BGP is fully operational again:

It is almost as if the BGP daemon ‘crashes’ and the router cannot restart it?

The router was deployed in June with routeros v6.15. This issue occured 3 times. I upgraded to v6.18 around 3 weeks ago and this failed with the same issue yesterday.

Has anyone else seen a similar issue with a service just stopping working? Is there any way to debug or find out why this is happening? I had hoped that a software change would have fixed it but having used two versions of the operating system, unless someone else can confirm the same bug, I am starting to think this is a hardware issue. Obviously that is a huge difficulty as the router is in Frankfurt and I am in London!

Thanks for any pointers offered.

Chris

Hello

no problem with BGP and CCR1036 but you must post more data about your setup,load and audit your L2 pseudowire

you are speaking about crash every 2 weeks but on your picture we can see that session are up for 13 minutes !

regards
Thierry

Yes, I took the screenshot after I had rebooted the router and the sessions had loaded fully, hence why they have only been up for a few minutes.

can you post configuration or send me off list

Can you make a supout.rif while you encounter the BGP problem?
It’s a long shot, but if you succeed in getting a supout, maybe Mikrotik support can find something out.

I have a 2x 1036s running 3 full v4 and v6 feeds.. one is at 47 days uptime on 6.17 right now… the other was up to about 60 days before I rebooted to test something (6.15).

Plus some queuing and simple policy routing.

Yes, this is what MT support asked me to do too. Next time it happens I will do that. I was really wondering whether anyone else had seen the same problem and found a solution for it.

Chris

That’s disturbing.
I have two CCR1036s in one of the centers and they are exchanging full BGP table with two upstream ISPs.
I’ve just checked the uptime - 7 weeks.
I’ve had some flapping BGP sessions during these 7 weeks, but it was always a L2 device in between that was breaking the connection.
What you describe is pretty similar to a problem I noticed on the CCR, it only happened once though. The problem was that almost all of the bridge interfaces were gone! A restart of the device solved it. But again, as you describe - in my case the bridge config was shown as empty.

why u use webfig and not winbox ?

just a question, i prefer winbox over webfig

I usually use SSH. It was just easier to get the screenshots from webfig.

Ok this just happened again, 21 days from last time. Have sent a supout to Mikrotik for analysis, will report back when they respond to me.

Chris

We had the same thing. It is not about BGP. Most prob It is about the latest configuration set is not saved somehow and is lost.

We had a few routes and IP addresses added to CCR1036 lately and a few days later all were gone.

Could you please confirm that BGP was one of the last config changes made to this router ?

Have a look at the memory usage. May be a memory leak?

I dont think a memroy leak. I think something about disks because whenever something like this happened to us, it cant do file operations.

Had this happen to a CCR 1009 of mine yesterday also on 6.18, upgraded to 6.19 so we’ll see what happens.

This effect I have seen on one of our CCRs.

There are no file operations possible and it wiped the admin passwd !!!
Thank god filtering kept the world to login to this ospf router and kill our network.

Yes you are right. Admin password is set to blank also. It is one of the symptoms of this bug.