Page 1 of 1

Loss of BGP function after 3-4 weeks

Posted: Mon Sep 08, 2014 3:00 pm
by bigcw
Hi everyone

I have a deployment of Routerboard kit around Europe and have a problem with one CCR1036 which seems to lose it's BGP service from time to time.

The router is in Frankfurt, Germany and has a full-table transit. It also has an iBGP peer with another 1036 in London via a pseudowire service provided by a third party.

Approx every 3-4 weeks the BGP service on the Frankfurt router just seems to stop. On the console, if I do 'routing bgp export' I get the comments lines and then the interface hangs (until ctrl-c)
[admin@FRARTR01] > routing bgp export
# sep/07/2014 12:25:36 by RouterOS 6.18
# software id = KYWQ-UYW1
#
If I look in the web interface, both BGP instances and BGP peers are blank:

Image
Image

The only way to recover (that I have found) is to reboot the entire router. Incidentally, this takes a long time - approx 2 minutes from issuing the reboot command on the console until the router stops pinging. When it reboots BGP is fully operational again:

Image

It is almost as if the BGP daemon 'crashes' and the router cannot restart it?

The router was deployed in June with routeros v6.15. This issue occured 3 times. I upgraded to v6.18 around 3 weeks ago and this failed with the same issue yesterday.

Has anyone else seen a similar issue with a service just stopping working? Is there any way to debug or find out why this is happening? I had hoped that a software change would have fixed it but having used two versions of the operating system, unless someone else can confirm the same bug, I am starting to think this is a hardware issue. Obviously that is a huge difficulty as the router is in Frankfurt and I am in London!

Thanks for any pointers offered.

Chris

Re: Loss of BGP function after 3-4 weeks

Posted: Tue Sep 09, 2014 11:35 am
by wulfgard
Hello

no problem with BGP and CCR1036 but you must post more data about your setup,load and audit your L2 pseudowire

you are speaking about crash every 2 weeks but on your picture we can see that session are up for 13 minutes !

regards
Thierry

Re: Loss of BGP function after 3-4 weeks

Posted: Tue Sep 09, 2014 12:53 pm
by bigcw
you are speaking about crash every 2 weeks but on your picture we can see that session are up for 13 minutes !
Yes, I took the screenshot after I had rebooted the router and the sessions had loaded fully, hence why they have only been up for a few minutes.

Re: Loss of BGP function after 3-4 weeks

Posted: Tue Sep 09, 2014 5:03 pm
by wulfgard
can you post configuration or send me off list

Re: Loss of BGP function after 3-4 weeks

Posted: Wed Sep 10, 2014 5:58 pm
by hedele
Can you make a supout.rif while you encounter the BGP problem?
It's a long shot, but if you succeed in getting a supout, maybe Mikrotik support can find something out.

Re: Loss of BGP function after 3-4 weeks

Posted: Wed Sep 10, 2014 9:33 pm
by roadracer96
I have a 2x 1036s running 3 full v4 and v6 feeds.. one is at 47 days uptime on 6.17 right now... the other was up to about 60 days before I rebooted to test something (6.15).

Plus some queuing and simple policy routing.

Re: Loss of BGP function after 3-4 weeks

Posted: Thu Sep 11, 2014 5:35 pm
by bigcw
Can you make a supout.rif while you encounter the BGP problem?
It's a long shot, but if you succeed in getting a supout, maybe Mikrotik support can find something out.
Yes, this is what MT support asked me to do too. Next time it happens I will do that. I was really wondering whether anyone else had seen the same problem and found a solution for it.

Chris

Re: Loss of BGP function after 3-4 weeks

Posted: Mon Sep 15, 2014 6:42 pm
by lz1dsb
That's disturbing.
I have two CCR1036s in one of the centers and they are exchanging full BGP table with two upstream ISPs.
I've just checked the uptime - 7 weeks.
I've had some flapping BGP sessions during these 7 weeks, but it was always a L2 device in between that was breaking the connection.
What you describe is pretty similar to a problem I noticed on the CCR, it only happened once though. The problem was that almost all of the bridge interfaces were gone! A restart of the device solved it. But again, as you describe - in my case the bridge config was shown as empty.

Re: Loss of BGP function after 3-4 weeks

Posted: Sat Sep 20, 2014 5:39 pm
by Petzl
why u use webfig and not winbox ?

just a question, i prefer winbox over webfig

Re: Loss of BGP function after 3-4 weeks

Posted: Wed Sep 24, 2014 6:58 pm
by bigcw
I usually use SSH. It was just easier to get the screenshots from webfig.

Re: Loss of BGP function after 3-4 weeks

Posted: Mon Sep 29, 2014 1:19 pm
by bigcw
Ok this just happened again, 21 days from last time. Have sent a supout to Mikrotik for analysis, will report back when they respond to me.

Chris

Re: Loss of BGP function after 3-4 weeks

Posted: Mon Sep 29, 2014 4:30 pm
by doush
We had the same thing. It is not about BGP. Most prob It is about the latest configuration set is not saved somehow and is lost.

We had a few routes and IP addresses added to CCR1036 lately and a few days later all were gone.

Could you please confirm that BGP was one of the last config changes made to this router ?
Hi everyone

I have a deployment of Routerboard kit around Europe and have a problem with one CCR1036 which seems to lose it's BGP service from time to time.

The router is in Frankfurt, Germany and has a full-table transit. It also has an iBGP peer with another 1036 in London via a pseudowire service provided by a third party.

Approx every 3-4 weeks the BGP service on the Frankfurt router just seems to stop. On the console, if I do 'routing bgp export' I get the comments lines and then the interface hangs (until ctrl-c)
[admin@FRARTR01] > routing bgp export
# sep/07/2014 12:25:36 by RouterOS 6.18
# software id = KYWQ-UYW1
#
If I look in the web interface, both BGP instances and BGP peers are blank:



The only way to recover (that I have found) is to reboot the entire router. Incidentally, this takes a long time - approx 2 minutes from issuing the reboot command on the console until the router stops pinging. When it reboots BGP is fully operational again:



It is almost as if the BGP daemon 'crashes' and the router cannot restart it?

The router was deployed in June with routeros v6.15. This issue occured 3 times. I upgraded to v6.18 around 3 weeks ago and this failed with the same issue yesterday.

Has anyone else seen a similar issue with a service just stopping working? Is there any way to debug or find out why this is happening? I had hoped that a software change would have fixed it but having used two versions of the operating system, unless someone else can confirm the same bug, I am starting to think this is a hardware issue. Obviously that is a huge difficulty as the router is in Frankfurt and I am in London!

Thanks for any pointers offered.

Chris

Re: Loss of BGP function after 3-4 weeks

Posted: Mon Sep 29, 2014 5:25 pm
by ste
Have a look at the memory usage. May be a memory leak?

Re: Loss of BGP function after 3-4 weeks

Posted: Mon Sep 29, 2014 10:22 pm
by doush
I dont think a memroy leak. I think something about disks because whenever something like this happened to us, it cant do file operations.

Re: Loss of BGP function after 3-4 weeks

Posted: Tue Sep 30, 2014 2:15 am
by FutileNetworks
Had this happen to a CCR 1009 of mine yesterday also on 6.18, upgraded to 6.19 so we'll see what happens.

Re: Loss of BGP function after 3-4 weeks

Posted: Tue Sep 30, 2014 12:01 pm
by ste
I dont think a memroy leak. I think something about disks because whenever something like this happened to us, it cant do file operations.
This effect I have seen on one of our CCRs.

There are no file operations possible and it wiped the admin passwd !!!
Thank god filtering kept the world to login to this ospf router and kill our network.

Re: Loss of BGP function after 3-4 weeks

Posted: Wed Oct 01, 2014 3:10 pm
by doush
I dont think a memroy leak. I think something about disks because whenever something like this happened to us, it cant do file operations.
This effect I have seen on one of our CCRs.

There are no file operations possible and it wiped the admin passwd !!!
Thank god filtering kept the world to login to this ospf router and kill our network.
Yes you are right. Admin password is set to blank also. It is one of the symptoms of this bug.