Disater ... all router dies at the sametime...

bennyng · Sun Jun 14, 2015 8:47 am

Last Friday night is a disaster.. and i would like to share if anyone can give me some clue

I am running 2 x CCR1036 as the core router as the ISP service. Running for more than 2 years without problem on version 6.7. Since it was being a while, and i would like to have a fastpath support for my nature of customers VOIP, i had decided to upgrade for 6.28, and using firmware 3.22. I'm running only the SYSTEM and ROUTING package. other unnecessary package had been removed to save resource.

After upgrade, i had simplify the configuration and turn off some feature like ip traffic flow, remove all filter, turn off conn tract and move my major customers from vLAN port to physical port, so that fastpath can work.

Finally, everything is up and running on Jun 1, 2015. having 4 bgp full feed, 1 exchange connection and a dozen of bilateral peer with the exchange connection. each peer have around 4-5 filter rules. and both CCR are using 2 x iBGP session for interconnection. Customer end using VRRP for redundancy protection, total traffic running around 300Mb total traffic. everything is ok and happy at Jun 1. CPU at peak is running only couple %, compare to 6.7 which was running ~ 20-30%. everything was looking good. except the bgp instance always have 100% cpu utilization, at any point of time, one out of 32 core are running at 100% and this is the bgp instance.

After running for a week, i found the bgp prefix aggregation is not working on one out of two of the router, and the aggregated prefix announcement flap every 2 minutes. so i turn off the aggregation on the problem router so that it will re-boardcast prefix from the good router to upstream via the iBGP session. Until now, i still have no clue what was going on..

The disaster hit on last Friday. (Jun 12 ~ 23.20). Both router are stop responding at the same time, exactly on the same second. All design redundancy had lost at the same time. The symptom is all network card are not reachable, but the system are accessible from console port. The worse things is there are no error appear in the log.

After couple of reboot, (cold boot, disconnect power and reconnect it), the interface had back up for couple minutes and become bad again. I thought it is a configuration issue, so i had done the following

1) /system reset-configuration <= can reset the configure, some problem happen
2) downgrade to 6.7 <= same thing happen.

In the midst of try and error, i try to see if there are something wrong in the route table, when the problem hits the route table was filled with all bgp route (ADb status), even the bgp instance is OFF, or the configuration is being reset to factory default.

The weirdest of all is when i use my backup CCR (yes, the 3rd CCR at the warehouse) when i'm in the middle of configuration to my 3 upstream, the same exact problem hits.. something in the bgp route that make the CCR totally unusable....

After all the problem, try and error, i had no choice to get my old 2 x c7200 to recover the service.. and cross my finger hope everything is fine.

Hope if anyone can give me some hints, for my frustration.

pukkita · Sun Jun 14, 2015 12:18 pm

I think fasttrack is too young to put on a "complex" setup production site like yours. If routers had 20% load, where's the problem?

I'd try returning to previous setup netinstalling the routers with either 6.19 or 6.27.

bennyng · Sun Jun 14, 2015 6:23 pm

I was trying to use fastpath but not fast tract. agreed with you fast tract is new, don't want to take the risk. Fastpath was there since early release of 6.xx.

I suspect the incident is related to the BGP, where when i reboot the CCR and hitting the issue. the old routing table still there, maybe by default the allocated memory for route table or bgp process had ran out, or the BGP process is too slow to cater for large routing table. As each of my CCR are taking 4 full feed (2 upstream, 2 iBGP from others are sending full routes) it might fill up all the bgp program allocated memory (~ 500k route per bgp session)

I saw lots of comment in different threat having the CCR using only ~ 1G or so memory. that might be the cause.

I don't mind the bgp process slow to calculate the routes, as long as the bgp connection keep-alive and other critical portion are running ok.

See if any Mikrotik insider, can check what is the bgp process limit for how many MB of memory being allocated, if possible make this figure adjustable during boot or make it as large as possible.

Also, need to have multi-thread support for bgp process.. without it, i cannot risk to put the CCR back on production....i like mikrotik alot as i have the same software built at Home, so that i can play around it.. but this problem just give me a big hit on my loyalty.

pukkita · Sun Jun 14, 2015 6:34 pm

Some CCRs get corrupted filesystem when upgrading which leads to all sorts of misbehaving (cannot remember which ROS version was), the only way of recovering being netinstall (reboot or upgrading simply doesn't fix it), that's why I suggested that.

StubArea51 · Sun Jun 14, 2015 6:40 pm

Sorry to hear about all your troubles....as others have said, it is usually a best practice to test code for critical environments before deploying.

VMs are a great way to do this if you don't have the budget to get a hardware test environment. You can replicate a topology and let it run for a few days before implementing.

We use GNS and VirtualBox in addition to our hardware lab to validate configurations for our customers to minimize risk.

http://www.gns3.net/

https://www.virtualbox.org/

Sun Jun 14, 2015 7:23 pm

Was the 3rd router also upgraded to 6.28 and 3.22 ?

bennyng · Mon Jun 15, 2015 3:06 am

Frankly speaking, testing I had done. But you will not do load test at such condition as a end user.

This is the job for vendor to test in different senerio.

Sorry to hear about all your troubles....as others have said, it is usually a best practice to test code for critical environments before deploying.

VMs are a great way to do this if you don't have the budget to get a hardware test environment. You can replicate a topology and let it run for a few days before implementing.

We use GNS and VirtualBox in addition to our hardware lab to validate configurations for our customers to minimize risk.

http://www.gns3.net/

https://www.virtualbox.org/

bennyng · Mon Jun 15, 2015 3:10 am

Thr 3rd CCR, Initially, on 3.11+6.7 and died during configuration. Upgrade to 3.22+6.28 still not working. Share the same behavior as the rest of CCR.

Was the 3rd router also upgraded to 6.28 and 3.22 ?

Mon Jun 15, 2015 3:54 am

So it is hard to belive that the 3rd spare MT has the same problem for old ROS & FW.

Just asking..are you sure that your/client/clients Infrastuctures have not changed or their configurations ?
Just thinkig if there is loop, VLAN id mismatch, BGP ASN mismatch, VRRP flapping etc..

bennyng · Mon Jun 15, 2015 4:00 am

I don't think there are infrastructure change

1) Friday night, everyone went for Happy Hour

2) The switch port have a broadcast control, if there is an loop the spanning tree will kicks in, and if the spanning tree cannot detect the loop, the broadcast control will errdisable the switch port
3) All network switch did not detect any MAC FLAP. during the incident. no alarm was trigger. so no VRRP change over during the incident.

From the information that i had gather so far, everything are point to bgp process or routing table related. maybe the ADC route had been removed some how or not detected after the route table corruption. it is likely to be happen, and match the symptom that i had.

scotthellewell · Wed Aug 05, 2015 4:20 am

Hope if anyone can give me some hints, for my frustration.

I also have full BGP feeds from several providers. I have 3 locations and each location have 2 routers with full feeds from 2 peers. And then each router at each location has iBGP feeds with each other (5 iBGP feeds). I am on the CCR1016-12S-1S+ routers. I found that the BGP routes never really converge and 1 of the CPUs gets stuck at 100% CPU. For now I have added a rule to my BGP in filters.

/routing filter
add action=discard bgp-as-path-length=3-4294967295 chain=eBGP-in
add action=discard bgp-as-path-length=3-4294967295 chain=iBGP-in

That has allowed me to take the best close paths, but obviously ends up using the default for further paths. Not ideal, but better than not having my routes converge. You might see if something similar helps you.

bennyng · Mon Aug 24, 2015 5:49 am

I think i had figured out what was going on.

1 of my customer who having a messy network bridging different vLAN with the cable. There might be a loop in their network and having the STP sending out to the router.

I just don't understand why the STP will affect the router. as 1) i'm not using any bridge, 2) it should be the router port which should ignore all the STP.

Mon Aug 24, 2015 9:15 am

maybe trying latest firmware and router os (6.30.2)??

firmware on ccr1009 is on 3.27

Disater ... all router dies at the sametime...

Disater ... all router dies at the sametime...

Re: Disater ... all router dies at the sametime...

Re: Disater ... all router dies at the sametime...

Re: Disater ... all router dies at the sametime...

Re: Disater ... all router dies at the sametime...

Re: Disater ... all router dies at the sametime...

Re: Disater ... all router dies at the sametime...

Re: Disater ... all router dies at the sametime...

Re: Disater ... all router dies at the sametime...

Re: Disater ... all router dies at the sametime...

Re: Disater ... all router dies at the sametime...

Re: Disater ... all router dies at the sametime...

Re: Disater ... all router dies at the sametime...

Who is online