Last Friday night is a disaster.. and i would like to share if anyone can give me some clue
I am running 2 x CCR1036 as the core router as the ISP service. Running for more than 2 years without problem on version 6.7. Since it was being a while, and i would like to have a fastpath support for my nature of customers VOIP, i had decided to upgrade for 6.28, and using firmware 3.22. I'm running only the SYSTEM and ROUTING package. other unnecessary package had been removed to save resource.
After upgrade, i had simplify the configuration and turn off some feature like ip traffic flow, remove all filter, turn off conn tract and move my major customers from vLAN port to physical port, so that fastpath can work.
Finally, everything is up and running on Jun 1, 2015. having 4 bgp full feed, 1 exchange connection and a dozen of bilateral peer with the exchange connection. each peer have around 4-5 filter rules. and both CCR are using 2 x iBGP session for interconnection. Customer end using VRRP for redundancy protection, total traffic running around 300Mb total traffic. everything is ok and happy at Jun 1. CPU at peak is running only couple %, compare to 6.7 which was running ~ 20-30%. everything was looking good. except the bgp instance always have 100% cpu utilization, at any point of time, one out of 32 core are running at 100% and this is the bgp instance.
After running for a week, i found the bgp prefix aggregation is not working on one out of two of the router, and the aggregated prefix announcement flap every 2 minutes. so i turn off the aggregation on the problem router so that it will re-boardcast prefix from the good router to upstream via the iBGP session. Until now, i still have no clue what was going on..
The disaster hit on last Friday. (Jun 12 ~ 23.20). Both router are stop responding at the same time, exactly on the same second. All design redundancy had lost at the same time. The symptom is all network card are not reachable, but the system are accessible from console port. The worse things is there are no error appear in the log.
After couple of reboot, (cold boot, disconnect power and reconnect it), the interface had back up for couple minutes and become bad again. I thought it is a configuration issue, so i had done the following
1) /system reset-configuration <= can reset the configure, some problem happen
2) downgrade to 6.7 <= same thing happen.
In the midst of try and error, i try to see if there are something wrong in the route table, when the problem hits the route table was filled with all bgp route (ADb status), even the bgp instance is OFF, or the configuration is being reset to factory default.
The weirdest of all is when i use my backup CCR (yes, the 3rd CCR at the warehouse) when i'm in the middle of configuration to my 3 upstream, the same exact problem hits.. something in the bgp route that make the CCR totally unusable....
After all the problem, try and error, i had no choice to get my old 2 x c7200 to recover the service.. and cross my finger hope everything is fine.
Hope if anyone can give me some hints, for my frustration.