Destination Host Unreachable on local network unless packet sniffer is running

I’ve got a bit of an odd one here and I’m hoping someone has some idea of what may be happening.

I have a bridge set up with VLANs. On four ports, I have PVID set to 227. For some reason, the host (node3) behind one of those ports became unreachable to the others on the VLAN (eg node1 and node2), and vice-versa. node3 is still reachable from outside the VLAN.

Unlike the other three ports, node3’s port also has a couple of tagged VLANS – vlan1 and vlan2. These VLANs do not have an IP address on the node proper, rather they are utilized by a virtual machine running on the box with macvtap interfaces. I tried removing the tagged VLANs from the port (and the vlan interfaces from the host, thinking that something was going on there. This did not fix things, and I restored them.

Now for the weird part… as a troubleshooting measure, I went to sniff ARP on the MikroTik to see if the ARP frames were were being received/sent correctly. As soon as I turned on the packet sniffer, communication was restored between node3 and the other nodes on the VLAN. Unfortunately, as soon as I turned the sniffer off, communication ceased.

Anybody seen anything like this or have any ideas what might be going on here?
RB4011
ROS 7.11

Relevant (I think anyway) config portions (the failing node is on ether4):

/interface bridge
add name=br-e3b0 protocol-mode=mstp vlan-filtering=yes

/interface vlan
add interface=br-e3b0 name=vlan1 vlan-id=1
add interface=br-e3b0 name=vlan2 vlan-id=2
add interface=br-e3b0 name=vlan227 vlan-id=227

/interface bridge port
add bridge=br-e3b0 interface=ether4 pvid=227
add bridge=br-e3b0 interface=ether7 pvid=227
add bridge=br-e3b0 interface=ether8 pvid=227
add bridge=br-e3b0 interface=ether10 pvid=227

/interface bridge vlan
add bridge=br-e3b0 tagged=br-e3b0 vlan-ids=227
add bridge=br-e3b0 tagged=br-e3b0,ether4 vlan-ids=1
add bridge=br-e3b0 tagged=br-e3b0,ether4 vlan-ids=2

do you see ARP records when the sniffer is not running?
maybe there is a problem with ARP learning or ageing on the bridge and/or vlan interfaces

I would not use vlan1 for any data traffic, its already in use by the bridge behind the scenes.

Just to be clear you intended ether4 to be a hybrid port?
( carrying vlan11 and 2 tagged and vlan227 untagged)

Yes, ARP records exist on the router's ARP table for all expected addresses.

Noted, but this problem doesn't appear to be related to this and I've had zero issues using vlan1 in the past.

Correct, ether4 should be carrying VLAN227 traffic untagged and VLAN 1-2 tagged.

Just to make sure I moved all of my vlan1 traffic onto vlan1001, no apparent change with my current issue.

IPv4 FastTrack is active if the following conditions are met:

no mesh, metarouter interface configuration;
sniffer, torch, and traffic generator are not running;
“/tool mac-scan” is not actively used;
“/tool ip-scan” is not actively used;
FastPath and Route cache is enabled under IP/Settings;

In case of problem by @OP it’s more likely a bug in HW offload from bridge to switch chips … sniffer disables HW offload for sniffed port. The same can be achieved by setting hw=no on “problematic” bridge port …

RB4011 is a bit special when it comes to L2HW offload if bridge spans ports from both switch chips. ROS versions prior to 7.11 had a bug (ports between different switch chips cpuld not communicate if bridge port was not tagged member of all relevant VLANs). MT was working on it and supposedly fixed that bug, but something might still be lurking in that hole. So I suggest to take a supout.rif file (while things don’t work the way they should) and open a ticket with support.

I already don’t have IPv4 fast path because I have firewall rules…

Interesting. I tried downgrading back to 7.10.2 to see if that improved anything, and it did not. But then I swapped ether4 and ether6 (and their configs)… and node3 came back.

Now I’m seeing a similar symptom on VLAN1001, except now that I can dig in a bit more, it appears to be a split-brain situation where, despite all of the bridge ports being tagged for the VLAN, will only communicate with other devices on the same switch chip.

Thankfully the split brain on 1001 isn’t a huge deal right now, I’ll get a case opened and the support file sent over. In the meantime I can keep the VLAN227 hardware together.

Thanks for the insight!

I just wanted to follow up and note that ROS 7.11.2 and the accompanying firmware upgrade resolved this issue. Thanks to all that took the time to respond.

Indeed … 7.11.1 changelog says:

*) bridge - fixed fast-path forwarding with HW offloaded vlan-filtering (introduced in v7.11);