We thought we were doing a good thing for network reliability by implementing VRRP. Instead, it made our network LESS reliable than if we put it all on one router. (Grrr...)
On early morning (like so early no one is up to make changes early), everything served by VRRP behind these routers started being intermittently down for a few minutes. Our monitoring system would show it down, and then back up. We had hundreds of checks in our monitoring system flapping. On the primary VRRP router, everything looked fine. On the backup VRRP router, the logs showed something else:
Code: Select all
Mar 20 01:01:36 VRRP-OSPF-B vrrpbinfo: combo1.vlan0173.vrrp2 now MASTER, master down timer
Mar 20 01:01:36 VRRP-OSPF-B vrrpbinfo: combo1.vlan0336.vrrp2 now MASTER, master down timer
Mar 20 01:01:36 VRRP-OSPF-B vrrpbinfo: combo1.vlan0173.vrrp1 now MASTER, master down timer
Mar 20 01:01:36 VRRP-OSPF-B vrrpbinfo: combo1.vlan0174.vrrp4 now MASTER, master down timer
Mar 20 01:01:36 VRRP-OSPF-B vrrpbinfo: combo1.vlan0174.vrrp0 now MASTER, master down timer
Mar 20 01:01:36 VRRP-OSPF-B vrrpbinfo: combo1.vlan0174.vrrp2 now MASTER, master down timer
Mar 20 01:01:36 VRRP-OSPF-B vrrpbinfo: combo1.vlan0339.vrrp1 now MASTER, master down timer
Mar 20 01:01:37 VRRP-OSPF-B vrrpbinfo: combo1.vlan0337.vrrp3 now MASTER, master down timer
Mar 20 01:01:37 VRRP-OSPF-B vrrpbinfo: combo1.vlan0411.vrrp1 now MASTER, master down timer
Mar 20 01:01:37 VRRP-OSPF-B vrrpbinfo: combo1.vlan0337.vrrp0 now MASTER, master down timer
Mar 20 01:01:37 VRRP-OSPF-B vrrpbinfo: combo1.vlan0337.vrrp1 now MASTER, master down timer
Mar 20 01:01:37 VRRP-OSPF-B vrrpbinfo: ether5.vlan0009.vrrp0 now MASTER, master down timer
Mar 20 01:01:37 VRRP-OSPF-B vrrpbinfo: ether1.vlan0315.vrrp0 now MASTER, master down timer
Mar 20 01:01:38 VRRP-OSPF-B vrrpbinfo: combo1.vlan0175.vrrp0 now BACKUP, got higher priority 100 from 10.20.10.241
Mar 20 01:01:38 VRRP-OSPF-B vrrpbinfo: combo1.vlan0173.vrrp2 now BACKUP, got higher priority 100 from 10.20.10.209
Mar 20 01:01:38 VRRP-OSPF-B vrrpbinfo: combo1.vlan0174.vrrp1 now BACKUP, got higher priority 100 from 10.20.10.225
Mar 20 01:01:38 VRRP-OSPF-B vrrpbinfo: combo1.vlan0339.vrrp1 now BACKUP, got higher priority 100 from 10.20.21.49
Mar 20 01:01:38 VRRP-OSPF-B vrrpbinfo: combo1.vlan0337.vrrp3 now BACKUP, got higher priority 100 from 10.20.21.17
Mar 20 01:01:38 VRRP-OSPF-B vrrpbinfo: combo1.vlan0336.vrrp0 now BACKUP, got higher priority 100 from 10.20.21.1
Mar 20 01:01:38 VRRP-OSPF-B vrrpbinfo: combo1.vlan0335.vrrp1 now BACKUP, got higher priority 100 from 10.20.20.241
Mar 20 01:01:38 VRRP-OSPF-B vrrpbinfo: combo1.vlan0411.vrrp1 now BACKUP, got higher priority 100 from 10.20.25.177
Mar 20 01:01:38 VRRP-OSPF-B vrrpbinfo: combo1.vlan0401.vrrp0 now BACKUP, got higher priority 100 from 10.20.25.17
Mar 20 01:01:38 VRRP-OSPF-B vrrpbinfo: combo1.vlan0219.vrrp1 now BACKUP, got higher priority 100 from 10.20.13.177
Mar 20 01:01:38 VRRP-OSPF-B vrrpbinfo: combo1.vlan0030.vrrp8 now BACKUP, got higher priority 100 from 10.20.1.225
Mar 20 01:01:38 VRRP-OSPF-B vrrpbinfo: ether5.vlan0007.vrrp0 now MASTER, master down timer
Mar 20 01:01:39 VRRP-OSPF-B vrrpbinfo: ether3.vlan0081.vrrp0 now MASTER, master down timer
Mar 20 01:02:45 10.3.19.2 25579: Mar 20 01:02:44.696: %SW_MATM-4-MACFLAP_NOTIF: Host 0000.5e00.0101 in vlan 341 is flapping between port Gi0/2 and port Gi0/1
Mar 20 01:02:45 10.3.19.2 25580: Mar 20 01:02:44.998: %SW_MATM-4-MACFLAP_NOTIF: Host 0000.5e00.0102 in vlan 303 is flapping between port Gi0/2 and port Gi0/1
Mar 20 01:02:45 10.3.19.2 25581: Mar 20 01:02:45.535: %SW_MATM-4-MACFLAP_NOTIF: Host 0000.5e00.0101 in vlan 334 is flapping between port Gi0/2 and port Gi0/1
Mar 20 01:02:45 10.3.19.2 25582: Mar 20 01:02:45.669: %SW_MATM-4-MACFLAP_NOTIF: Host 0000.5e00.0100 in vlan 303 is flapping between port Gi0/2 and port Gi0/1
Mar 20 01:02:46 10.3.19.2 25583: Mar 20 01:02:46.038: %SW_MATM-4-MACFLAP_NOTIF: Host 0000.5e00.0101 in vlan 436 is flapping between port Gi0/2 and port Gi0/1
Mar 20 01:02:46 10.3.19.2 25584: Mar 20 01:02:46.072: %SW_MATM-4-MACFLAP_NOTIF: Host 0000.5e00.0102 in vlan 325 is flapping between port Gi0/2 and port Gi0/1
Mar 20 01:02:46 10.3.19.2 25585: Mar 20 01:02:46.072: %SW_MATM-4-MACFLAP_NOTIF: Host 0000.5e00.0100 in vlan 382 is flapping between port Gi0/2 and port Gi0/1
Mar 20 01:02:46 10.3.19.2 25586: Mar 20 01:02:46.340: %SW_MATM-4-MACFLAP_NOTIF: Host 0000.5e00.0100 in vlan 436 is flapping between port Gi0/2 and port Gi0/1
Mar 20 01:02:46 10.3.19.2 25587: Mar 20 01:02:46.575: %SW_MATM-4-MACFLAP_NOTIF: Host 0000.5e00.0102 in vlan 342 is flapping between port Gi0/2 and port Gi0/1
Mar 20 01:02:47 10.3.19.2 25588: Mar 20 01:02:46.877: %SW_MATM-4-MACFLAP_NOTIF: Host 0000.5e00.0103 in vlan 304 is flapping between port Gi0/2 and port Gi0/1
Mar 20 01:02:47 10.3.19.2 25589: Mar 20 01:02:47.011: %SW_MATM-4-MACFLAP_NOTIF: Host 0000.5e00.0104 in vlan 304 is flapping between port Gi0/2 and port Gi0/1
Mar 20 01:02:47 10.3.19.2 25590: Mar 20 01:02:47.078: %SW_MATM-4-MACFLAP_NOTIF: Host 0000.5e00.0100 in vlan 315 is flapping between port Gi0/2 and port Gi0/1
Mar 20 01:02:47 10.3.19.2 25591: Mar 20 01:02:47.078: %SW_MATM-4-MACFLAP_NOTIF: Host 0000.5e00.0100 in vlan 415 is flapping between port Gi0/2 and port Gi0/1
Mar 20 01:02:47 10.3.19.2 25592: Mar 20 01:02:47.649: %SW_MATM-4-MACFLAP_NOTIF: Host 0000.5e00.0101 in vlan 304 is flapping between port Gi0/2 and port Gi0/1
Mar 20 01:02:48 10.3.19.2 25593: Mar 20 01:02:47.716: %SW_MATM-4-MACFLAP_NOTIF: Host 0000.5e00.0101 in vlan 342 is flapping between port Gi0/2 and port Gi0/1
Mar 20 01:02:50 10.3.19.2 25594: Mar 20 01:02:49.066: %SW_MATM-4-MACFLAP_NOTIF: Host 0000.5e00.0103 in vlan 303 is flapping between port Gi0/2 and port Gi0/1
Mar 20 01:01:45 VRRP-OSPF-B vrrpbinfo: combo1.vlan0175.vrrp1 now MASTER, master down timer
Mar 20 01:01:45 VRRP-OSPF-B vrrpbinfo: combo1.vlan0173.vrrp3 now MASTER, master down timer
Mar 20 01:02:52 10.3.19.2 25595: Mar 20 01:02:51.684: %SW_MATM-4-MACFLAP_NOTIF: Host 0000.5e00.0101 in vlan 325 is flapping between port Gi0/2 and port Gi0/1
Mar 20 01:02:52 10.3.19.2 25596: Mar 20 01:02:51.751: %SW_MATM-4-MACFLAP_NOTIF: Host 0000.5e00.0100 in vlan 325 is flapping between port Gi0/2 and port Gi0/1
Mar 20 01:02:52 10.3.19.2 25597: Mar 20 01:02:51.784: %SW_MATM-4-MACFLAP_NOTIF: Host 0000.5e00.0103 in vlan 321 is flapping between port Gi0/2 and port Gi0/1
Mar 20 01:02:52 10.3.19.2 25598: Mar 20 01:02:51.885: %SW_MATM-4-MACFLAP_NOTIF: Host 0000.5e00.0100 in vlan 321 is flapping between port Gi0/2 and port Gi0/1
Mar 20 01:02:52 10.3.19.2 25599: Mar 20 01:02:52.053: %SW_MATM-4-MACFLAP_NOTIF: Host 0000.5e00.0102 in vlan 304 is flapping between port Gi0/2 and port Gi0/1
Mar 20 01:01:46 VRRP-OSPF-B vrrpbinfo: combo1.vlan0339.vrrp0 now MASTER, master down timer
Mar 20 01:01:46 VRRP-OSPF-B vrrpbinfo: combo1.vlan0174.vrrp3 now MASTER, master down timer
Rebooting doesn't help. Trying to disable the VRRP interfaces, or the VLANs, and then rebooting doesn't help: it starts back up with the interfaces up, as if our control input did nothing and didn't save, and the flapping continues. The secondary VRRP router is logging that is becoming master, which means it must have not seen heartbeat(s) for a while from the primary VRRP router, even though the primary IS up and the network between the VRRP routers IS working. We are getting packet loss to the things served by these VRRP IPs, and I suspect it is because we have multiple OSPF routes to the subnets, but only one router can have the VRRP mac address from the switch's point of view. Even though CPU load seemed like it was 15%, we could not get a terminal prompt to appear in the Winbox terminal. It seemed as if this router was somewhat responsive to reading of status/statistics, but unable to implement any control commands we wrote/sent.
Eventually, we found the only way to get this situation under control was to go to the downstream switches, shut down the ports to the secondary VRRP router, wait a few minutes, reboot, and then enable VRRP interfaces again. Additionally, we found that the log messages (including if you watch logs in winbox, not just over syslog) showed messages about flapping to "backup" state up to minutes (!) after the ports that would deliver the heartbeat packets to cause that were down! Maybe there is some queue of heartbeat packets somewhere that gets very backed up?
We found that rebooting the backup VRRP router induces this VRRP meltdown. We also found that flapping a switch port to the VRRP backup router for 20 seconds or so also induces the same "meltdown", causing all VRRP to begin flapping (and never quit as far as we can tell) on all interfaces (not just the port that flapped).
Aggregate traffic between both routers is ~330-400 Mbps. CPU usage is single digits normally. We have 218 VRRP interfaces right now. Switch ports to these routers show no errors or excessive load.
We do OSPF upstream of the routers to the rest of the network, and have two different switches upstream to interconnect these routers with the rest of the network so there would not be a single point of failure. (Switches are a CRS317-1G-16S+ with lower OSPF cost, and a backup Cisco Catalyst 2960G with normal OSPF interface cost.)
On the downstream (towards customer) interfaces, we have two main switches these connect to for the VLANs that they do VRRP on, a Cisco Catalyst 2960G and a CRS326-24G-2S+. Most of the VLANs are running 1-3 VRRP interfaces (one for each Virtual IP), but some of the VLANs go to routers at remote sites, and thus run OSPF instead of VRRP (and those work fine -- just trying to give the full context that these problem are happening in).
Even though CPU load seemed like it was 15%, but this is a 9 core CPU, so maybe something single threaded is CPU bound? Maybe interrupts are being missed and packets dropped in some part of the router architecture? Or maybe some single threaded VRRP process is just overwhelmed? Or overwhelmed as it competes for CPU with OSPF as all these routes flap around?
A standard config for our primary router:
Code: Select all
/interface vlan add interface=combo1 name=combo1.vlan0404 vlan-id=404
/ip address add address=10.20.25.65/28 interface=combo1.vlan0404
/interface vrrp add interface=combo1.vlan0404 name=combo1.vlan0404.vrrp0 priority=100 vrid=0 v3-protocol=ipv4
/ip address add address=72.35.197.1/24 interface=combo1.vlan0404.vrrp0
/interface vrrp add interface=combo1.vlan0404 name=combo1.vlan0404.vrrp1 priority=100 vrid=1 v3-protocol=ipv4
/ip address add address=10.3.3.1/24 interface=combo1.vlan0404.vrrp1
/interface vrrp add interface=combo1.vlan0404 name=combo1.vlan0404.vrrp2 priority=100 vrid=2 v3-protocol=ipv6
/ipv6 address add address=2607:F248:330:6::1/64 interface=combo1.vlan0404.vrrp2 advertise=yes
Code: Select all
/interface vlan add interface=combo1 name=combo1.vlan0404 vlan-id=404
/ip address add address=10.20.25.66/28 interface=combo1.vlan0404
/interface vrrp add interface=combo1.vlan0404 name=combo1.vlan0404.vrrp0 priority=50 vrid=0 v3-protocol=ipv4
/ip address add address=72.35.197.1/24 interface=combo1.vlan0404.vrrp0
/interface vrrp add interface=combo1.vlan0404 name=combo1.vlan0404.vrrp1 priority=50 vrid=1 v3-protocol=ipv4
/ip address add address=10.3.3.1/24 interface=combo1.vlan0404.vrrp1
/interface vrrp add interface=combo1.vlan0404 name=combo1.vlan0404.vrrp2 priority=50 vrid=2 v3-protocol=ipv6
/ipv6 address add address=2607:F248:330:6::1/64 interface=combo1.vlan0404.vrrp2 advertise=yes