Community discussions

 
Azendale
newbie
Topic Author
Posts: 43
Joined: Thu Feb 06, 2014 8:49 pm

VRRP instability, flapping

Wed Mar 20, 2019 10:45 pm

We recently rolled a couple of Miktrotik routers into a pair of CCR1009-7G-1C-1S+ routers running VRRP.
We thought we were doing a good thing for network reliability by implementing VRRP. Instead, it made our network LESS reliable than if we put it all on one router. (Grrr...)

On early morning (like so early no one is up to make changes early), everything served by VRRP behind these routers started being intermittently down for a few minutes. Our monitoring system would show it down, and then back up. We had hundreds of checks in our monitoring system flapping. On the primary VRRP router, everything looked fine. On the backup VRRP router, the logs showed something else:
Mar 20 01:01:36 VRRP-OSPF-B vrrpbinfo: combo1.vlan0173.vrrp2 now MASTER, master down timer
Mar 20 01:01:36 VRRP-OSPF-B vrrpbinfo: combo1.vlan0336.vrrp2 now MASTER, master down timer
Mar 20 01:01:36 VRRP-OSPF-B vrrpbinfo: combo1.vlan0173.vrrp1 now MASTER, master down timer
Mar 20 01:01:36 VRRP-OSPF-B vrrpbinfo: combo1.vlan0174.vrrp4 now MASTER, master down timer
Mar 20 01:01:36 VRRP-OSPF-B vrrpbinfo: combo1.vlan0174.vrrp0 now MASTER, master down timer
Mar 20 01:01:36 VRRP-OSPF-B vrrpbinfo: combo1.vlan0174.vrrp2 now MASTER, master down timer
Mar 20 01:01:36 VRRP-OSPF-B vrrpbinfo: combo1.vlan0339.vrrp1 now MASTER, master down timer
Mar 20 01:01:37 VRRP-OSPF-B vrrpbinfo: combo1.vlan0337.vrrp3 now MASTER, master down timer
Mar 20 01:01:37 VRRP-OSPF-B vrrpbinfo: combo1.vlan0411.vrrp1 now MASTER, master down timer
Mar 20 01:01:37 VRRP-OSPF-B vrrpbinfo: combo1.vlan0337.vrrp0 now MASTER, master down timer
Mar 20 01:01:37 VRRP-OSPF-B vrrpbinfo: combo1.vlan0337.vrrp1 now MASTER, master down timer
Mar 20 01:01:37 VRRP-OSPF-B vrrpbinfo: ether5.vlan0009.vrrp0 now MASTER, master down timer
Mar 20 01:01:37 VRRP-OSPF-B vrrpbinfo: ether1.vlan0315.vrrp0 now MASTER, master down timer
Mar 20 01:01:38 VRRP-OSPF-B vrrpbinfo: combo1.vlan0175.vrrp0 now BACKUP, got higher priority 100 from 10.20.10.241
Mar 20 01:01:38 VRRP-OSPF-B vrrpbinfo: combo1.vlan0173.vrrp2 now BACKUP, got higher priority 100 from 10.20.10.209
Mar 20 01:01:38 VRRP-OSPF-B vrrpbinfo: combo1.vlan0174.vrrp1 now BACKUP, got higher priority 100 from 10.20.10.225
Mar 20 01:01:38 VRRP-OSPF-B vrrpbinfo: combo1.vlan0339.vrrp1 now BACKUP, got higher priority 100 from 10.20.21.49
Mar 20 01:01:38 VRRP-OSPF-B vrrpbinfo: combo1.vlan0337.vrrp3 now BACKUP, got higher priority 100 from 10.20.21.17
Mar 20 01:01:38 VRRP-OSPF-B vrrpbinfo: combo1.vlan0336.vrrp0 now BACKUP, got higher priority 100 from 10.20.21.1
Mar 20 01:01:38 VRRP-OSPF-B vrrpbinfo: combo1.vlan0335.vrrp1 now BACKUP, got higher priority 100 from 10.20.20.241
Mar 20 01:01:38 VRRP-OSPF-B vrrpbinfo: combo1.vlan0411.vrrp1 now BACKUP, got higher priority 100 from 10.20.25.177
Mar 20 01:01:38 VRRP-OSPF-B vrrpbinfo: combo1.vlan0401.vrrp0 now BACKUP, got higher priority 100 from 10.20.25.17
Mar 20 01:01:38 VRRP-OSPF-B vrrpbinfo: combo1.vlan0219.vrrp1 now BACKUP, got higher priority 100 from 10.20.13.177
Mar 20 01:01:38 VRRP-OSPF-B vrrpbinfo: combo1.vlan0030.vrrp8 now BACKUP, got higher priority 100 from 10.20.1.225
Mar 20 01:01:38 VRRP-OSPF-B vrrpbinfo: ether5.vlan0007.vrrp0 now MASTER, master down timer
Mar 20 01:01:39 VRRP-OSPF-B vrrpbinfo: ether3.vlan0081.vrrp0 now MASTER, master down timer
Mar 20 01:02:45 10.3.19.2 25579: Mar 20 01:02:44.696: %SW_MATM-4-MACFLAP_NOTIF: Host 0000.5e00.0101 in vlan 341 is flapping between port Gi0/2 and port Gi0/1
Mar 20 01:02:45 10.3.19.2 25580: Mar 20 01:02:44.998: %SW_MATM-4-MACFLAP_NOTIF: Host 0000.5e00.0102 in vlan 303 is flapping between port Gi0/2 and port Gi0/1
Mar 20 01:02:45 10.3.19.2 25581: Mar 20 01:02:45.535: %SW_MATM-4-MACFLAP_NOTIF: Host 0000.5e00.0101 in vlan 334 is flapping between port Gi0/2 and port Gi0/1
Mar 20 01:02:45 10.3.19.2 25582: Mar 20 01:02:45.669: %SW_MATM-4-MACFLAP_NOTIF: Host 0000.5e00.0100 in vlan 303 is flapping between port Gi0/2 and port Gi0/1
Mar 20 01:02:46 10.3.19.2 25583: Mar 20 01:02:46.038: %SW_MATM-4-MACFLAP_NOTIF: Host 0000.5e00.0101 in vlan 436 is flapping between port Gi0/2 and port Gi0/1
Mar 20 01:02:46 10.3.19.2 25584: Mar 20 01:02:46.072: %SW_MATM-4-MACFLAP_NOTIF: Host 0000.5e00.0102 in vlan 325 is flapping between port Gi0/2 and port Gi0/1
Mar 20 01:02:46 10.3.19.2 25585: Mar 20 01:02:46.072: %SW_MATM-4-MACFLAP_NOTIF: Host 0000.5e00.0100 in vlan 382 is flapping between port Gi0/2 and port Gi0/1
Mar 20 01:02:46 10.3.19.2 25586: Mar 20 01:02:46.340: %SW_MATM-4-MACFLAP_NOTIF: Host 0000.5e00.0100 in vlan 436 is flapping between port Gi0/2 and port Gi0/1
Mar 20 01:02:46 10.3.19.2 25587: Mar 20 01:02:46.575: %SW_MATM-4-MACFLAP_NOTIF: Host 0000.5e00.0102 in vlan 342 is flapping between port Gi0/2 and port Gi0/1
Mar 20 01:02:47 10.3.19.2 25588: Mar 20 01:02:46.877: %SW_MATM-4-MACFLAP_NOTIF: Host 0000.5e00.0103 in vlan 304 is flapping between port Gi0/2 and port Gi0/1
Mar 20 01:02:47 10.3.19.2 25589: Mar 20 01:02:47.011: %SW_MATM-4-MACFLAP_NOTIF: Host 0000.5e00.0104 in vlan 304 is flapping between port Gi0/2 and port Gi0/1
Mar 20 01:02:47 10.3.19.2 25590: Mar 20 01:02:47.078: %SW_MATM-4-MACFLAP_NOTIF: Host 0000.5e00.0100 in vlan 315 is flapping between port Gi0/2 and port Gi0/1
Mar 20 01:02:47 10.3.19.2 25591: Mar 20 01:02:47.078: %SW_MATM-4-MACFLAP_NOTIF: Host 0000.5e00.0100 in vlan 415 is flapping between port Gi0/2 and port Gi0/1
Mar 20 01:02:47 10.3.19.2 25592: Mar 20 01:02:47.649: %SW_MATM-4-MACFLAP_NOTIF: Host 0000.5e00.0101 in vlan 304 is flapping between port Gi0/2 and port Gi0/1
Mar 20 01:02:48 10.3.19.2 25593: Mar 20 01:02:47.716: %SW_MATM-4-MACFLAP_NOTIF: Host 0000.5e00.0101 in vlan 342 is flapping between port Gi0/2 and port Gi0/1
Mar 20 01:02:50 10.3.19.2 25594: Mar 20 01:02:49.066: %SW_MATM-4-MACFLAP_NOTIF: Host 0000.5e00.0103 in vlan 303 is flapping between port Gi0/2 and port Gi0/1
Mar 20 01:01:45 VRRP-OSPF-B vrrpbinfo: combo1.vlan0175.vrrp1 now MASTER, master down timer
Mar 20 01:01:45 VRRP-OSPF-B vrrpbinfo: combo1.vlan0173.vrrp3 now MASTER, master down timer
Mar 20 01:02:52 10.3.19.2 25595: Mar 20 01:02:51.684: %SW_MATM-4-MACFLAP_NOTIF: Host 0000.5e00.0101 in vlan 325 is flapping between port Gi0/2 and port Gi0/1
Mar 20 01:02:52 10.3.19.2 25596: Mar 20 01:02:51.751: %SW_MATM-4-MACFLAP_NOTIF: Host 0000.5e00.0100 in vlan 325 is flapping between port Gi0/2 and port Gi0/1
Mar 20 01:02:52 10.3.19.2 25597: Mar 20 01:02:51.784: %SW_MATM-4-MACFLAP_NOTIF: Host 0000.5e00.0103 in vlan 321 is flapping between port Gi0/2 and port Gi0/1
Mar 20 01:02:52 10.3.19.2 25598: Mar 20 01:02:51.885: %SW_MATM-4-MACFLAP_NOTIF: Host 0000.5e00.0100 in vlan 321 is flapping between port Gi0/2 and port Gi0/1
Mar 20 01:02:52 10.3.19.2 25599: Mar 20 01:02:52.053: %SW_MATM-4-MACFLAP_NOTIF: Host 0000.5e00.0102 in vlan 304 is flapping between port Gi0/2 and port Gi0/1
Mar 20 01:01:46 VRRP-OSPF-B vrrpbinfo: combo1.vlan0339.vrrp0 now MASTER, master down timer
Mar 20 01:01:46 VRRP-OSPF-B vrrpbinfo: combo1.vlan0174.vrrp3 now MASTER, master down timer
These logs were collected with a syslog server. The MAC address flap notification is from the Cisco 2960G switch downstream of these routers as it sees the VRRP mac address flip between switch ports. Upon logging into the secondary VRRP router, CPU said ~15 percent per core, according to the profile tool. So it doesn't seem like the CPU is hammered, BUT...trying to disable VRRP interfaces does nothing (turns grey if you select a different interface and then select back to the interface, but the flags stay flapping between XRM and B). Eventually, the winbox tool says changing the interface timed out. Same for the VLANs these VRRP interface on on top of, or the IP addresses it uses. Every VRRP interface on this router is flapping, regardless of the underlying physical ethernet interface it is on.

Rebooting doesn't help. Trying to disable the VRRP interfaces, or the VLANs, and then rebooting doesn't help: it starts back up with the interfaces up, as if our control input did nothing and didn't save, and the flapping continues. The secondary VRRP router is logging that is becoming master, which means it must have not seen heartbeat(s) for a while from the primary VRRP router, even though the primary IS up and the network between the VRRP routers IS working. We are getting packet loss to the things served by these VRRP IPs, and I suspect it is because we have multiple OSPF routes to the subnets, but only one router can have the VRRP mac address from the switch's point of view. Even though CPU load seemed like it was 15%, we could not get a terminal prompt to appear in the Winbox terminal. It seemed as if this router was somewhat responsive to reading of status/statistics, but unable to implement any control commands we wrote/sent.

Eventually, we found the only way to get this situation under control was to go to the downstream switches, shut down the ports to the secondary VRRP router, wait a few minutes, reboot, and then enable VRRP interfaces again. Additionally, we found that the log messages (including if you watch logs in winbox, not just over syslog) showed messages about flapping to "backup" state up to minutes (!) after the ports that would deliver the heartbeat packets to cause that were down! Maybe there is some queue of heartbeat packets somewhere that gets very backed up?

We found that rebooting the backup VRRP router induces this VRRP meltdown. We also found that flapping a switch port to the VRRP backup router for 20 seconds or so also induces the same "meltdown", causing all VRRP to begin flapping (and never quit as far as we can tell) on all interfaces (not just the port that flapped).

Aggregate traffic between both routers is ~330-400 Mbps. CPU usage is single digits normally. We have 218 VRRP interfaces right now. Switch ports to these routers show no errors or excessive load.

We do OSPF upstream of the routers to the rest of the network, and have two different switches upstream to interconnect these routers with the rest of the network so there would not be a single point of failure. (Switches are a CRS317-1G-16S+ with lower OSPF cost, and a backup Cisco Catalyst 2960G with normal OSPF interface cost.)

On the downstream (towards customer) interfaces, we have two main switches these connect to for the VLANs that they do VRRP on, a Cisco Catalyst 2960G and a CRS326-24G-2S+. Most of the VLANs are running 1-3 VRRP interfaces (one for each Virtual IP), but some of the VLANs go to routers at remote sites, and thus run OSPF instead of VRRP (and those work fine -- just trying to give the full context that these problem are happening in).

Even though CPU load seemed like it was 15%, but this is a 9 core CPU, so maybe something single threaded is CPU bound? Maybe interrupts are being missed and packets dropped in some part of the router architecture? Or maybe some single threaded VRRP process is just overwhelmed? Or overwhelmed as it competes for CPU with OSPF as all these routes flap around?

A standard config for our primary router:
/interface vlan add interface=combo1 name=combo1.vlan0404 vlan-id=404
/ip address add address=10.20.25.65/28 interface=combo1.vlan0404
/interface vrrp add interface=combo1.vlan0404 name=combo1.vlan0404.vrrp0 priority=100 vrid=0 v3-protocol=ipv4
/ip address add address=72.35.197.1/24 interface=combo1.vlan0404.vrrp0
/interface vrrp add interface=combo1.vlan0404 name=combo1.vlan0404.vrrp1 priority=100 vrid=1 v3-protocol=ipv4
/ip address add address=10.3.3.1/24 interface=combo1.vlan0404.vrrp1
/interface vrrp add interface=combo1.vlan0404 name=combo1.vlan0404.vrrp2 priority=100 vrid=2 v3-protocol=ipv6
/ipv6 address add address=2607:F248:330:6::1/64 interface=combo1.vlan0404.vrrp2 advertise=yes
A standard config for our backup router:
/interface vlan add interface=combo1 name=combo1.vlan0404 vlan-id=404
/ip address add address=10.20.25.66/28 interface=combo1.vlan0404
/interface vrrp add interface=combo1.vlan0404 name=combo1.vlan0404.vrrp0 priority=50 vrid=0 v3-protocol=ipv4
/ip address add address=72.35.197.1/24 interface=combo1.vlan0404.vrrp0
/interface vrrp add interface=combo1.vlan0404 name=combo1.vlan0404.vrrp1 priority=50 vrid=1 v3-protocol=ipv4
/ip address add address=10.3.3.1/24 interface=combo1.vlan0404.vrrp1
/interface vrrp add interface=combo1.vlan0404 name=combo1.vlan0404.vrrp2 priority=50 vrid=2 v3-protocol=ipv6
/ipv6 address add address=2607:F248:330:6::1/64 interface=combo1.vlan0404.vrrp2 advertise=yes
How do we fix this? Is something overloaded? (how would we see usage for that resource to know next time?)
 
Kindis
Member Candidate
Member Candidate
Posts: 251
Joined: Tue Nov 01, 2011 6:54 pm

Re: VRRP instability, flapping

Thu Mar 21, 2019 5:39 pm

Removed
Read this on a phone and missread your config so my advice is useless 😁
 
Azendale
newbie
Topic Author
Posts: 43
Joined: Thu Feb 06, 2014 8:49 pm

Re: VRRP instability, flapping

Tue Mar 26, 2019 9:56 pm

I'm still interested in finding a solution for this. (Thanks Kindis for at least trying!)
 
tdw
Member Candidate
Member Candidate
Posts: 196
Joined: Sat May 05, 2018 11:55 am

Re: VRRP instability, flapping

Fri Mar 29, 2019 4:38 am

Mikrotiks do not support VRRP owner, the virtual IP cannot be the same as the real IP, and are unlike other manufacturers implementations in that the mask on IPv4 VRRP interface should be /32.

The /32 mask caught me out when I first set up VRRP, with /24 (matching the real IP mask) it worked most of the time but there would be random short periods of loss of connectivity.
 
tangram
Member Candidate
Member Candidate
Posts: 133
Joined: Wed Nov 16, 2016 9:55 pm

Re: VRRP instability, flapping

Fri Mar 29, 2019 10:13 am

Mikrotiks do not support VRRP owner, the virtual IP cannot be the same as the real IP, and are unlike other manufacturers implementations in that the mask on IPv4 VRRP interface should be /32.

The /32 mask caught me out when I first set up VRRP, with /24 (matching the real IP mask) it worked most of the time but there would be random short periods of loss of connectivity.
tdw is right, you need 3 addresses and the one for vrrp with a /32. here's one of mine :

# ADDRESS NETWORK INTERFACE
4 10.250.0.3/29 10.250.0.0 ether5-backbone
5 I 10.250.0.2/32 10.250.0.2 backbone-vrrp-backup

# NAME INTERFACE MAC-ADDRESS VRID PRIORITY INTERVAL VERSION V3-PROTOCOL
0 backbone-vrrp-backup ether5-backbone XXXXXXXXXX 3 100 1s 3 ipv4

Who is online

Users browsing this forum: No registered users and 70 guests