I’ve got a bit of an odd one here and I’m hoping someone has some idea of what may be happening.
I have a bridge set up with VLANs. On four ports, I have PVID set to 227. For some reason, the host (node3) behind one of those ports became unreachable to the others on the VLAN (eg node1 and node2), and vice-versa. node3 is still reachable from outside the VLAN.
Unlike the other three ports, node3’s port also has a couple of tagged VLANS – vlan1 and vlan2. These VLANs do not have an IP address on the node proper, rather they are utilized by a virtual machine running on the box with macvtap interfaces. I tried removing the tagged VLANs from the port (and the vlan interfaces from the host, thinking that something was going on there. This did not fix things, and I restored them.
Now for the weird part… as a troubleshooting measure, I went to sniff ARP on the MikroTik to see if the ARP frames were were being received/sent correctly. As soon as I turned on the packet sniffer, communication was restored between node3 and the other nodes on the VLAN. Unfortunately, as soon as I turned the sniffer off, communication ceased.
Anybody seen anything like this or have any ideas what might be going on here?
RB4011
ROS 7.11
Relevant (I think anyway) config portions (the failing node is on ether4):
IPv4 FastTrack is active if the following conditions are met:
no mesh, metarouter interface configuration; sniffer, torch, and traffic generator are not running;
“/tool mac-scan” is not actively used;
“/tool ip-scan” is not actively used;
FastPath and Route cache is enabled under IP/Settings;
In case of problem by @OP it’s more likely a bug in HW offload from bridge to switch chips … sniffer disables HW offload for sniffed port. The same can be achieved by setting hw=no on “problematic” bridge port …
RB4011 is a bit special when it comes to L2HW offload if bridge spans ports from both switch chips. ROS versions prior to 7.11 had a bug (ports between different switch chips cpuld not communicate if bridge port was not tagged member of all relevant VLANs). MT was working on it and supposedly fixed that bug, but something might still be lurking in that hole. So I suggest to take a supout.rif file (while things don’t work the way they should) and open a ticket with support.
Interesting. I tried downgrading back to 7.10.2 to see if that improved anything, and it did not. But then I swapped ether4 and ether6 (and their configs)… and node3 came back.
Now I’m seeing a similar symptom on VLAN1001, except now that I can dig in a bit more, it appears to be a split-brain situation where, despite all of the bridge ports being tagged for the VLAN, will only communicate with other devices on the same switch chip.
Thankfully the split brain on 1001 isn’t a huge deal right now, I’ll get a case opened and the support file sent over. In the meantime I can keep the VLAN227 hardware together.
I just wanted to follow up and note that ROS 7.11.2 and the accompanying firmware upgrade resolved this issue. Thanks to all that took the time to respond.