Undefined behavior & lost traffic on devices with switch without bridge VLAN offloading

TL;DR: see questions at the end; attached full config as file

Background
I run many hEX PoE “routers” as managed switches, that are powered by PoE and pass PoE to e.g. security cameras. While this is an old device, it’s currently the only device in MT’s lineup that can do that. While it’s old it still contains a quite powerful switch chip. It doesn’t support now-preferred wire-speed bridge VLAN filtering, but instead requires manual switch chip config. No complains here - it’s an old device.

Recently playing with one of the switches I managed to make it stop passing traffic and it took me a while to isolate the issue. I prepared a reproducer.


Test setup
test_setup.png
To configure this simple setup I reset the device to factory defaults without default config. Next I replicated setup from my production switch. This setup does work and passes traffic as expected, i.e. 2nd port is an access for TRUSTED vlan 10, ports 3 & 4 are access for CAMS vlan 20, and port 5 is correctly a trunk for all uplink VLANs. This setup does not follow documentation!

/interface/ethernet/switch/vlan
    add switch=switch1 vlan-id=10 ports=ether1,ether2,ether5         comment="TRUSTED (10)"
    add switch=switch1 vlan-id=20 ports=ether1,ether5,switch1-cpu    comment="MGMT (20)"
    add switch=switch1 vlan-id=30 ports=ether1,ether3,ether4,ether5  comment="CAMS (30)"

/interface/ethernet/switch/port
    # leave ether1 as is (trunk uplink)
    set default-vlan-id=10  [find name=ether2]
    set default-vlan-id=30 [find name=ether3]
    set default-vlan-id=30 [find name=ether4]
    # leave ether5 as is (trunk downlink)

    # per docs (https://help.mikrotik.com/docs/spaces/ROS/pages/15302988/Switch+Chip+Features#SwitchChipFeatures-PortSettings)
    set vlan-header=leave-as-is [find]
    set vlan-mode=secure [find]

What is a MikroTik RouterOS bridge?
Let me be clear again: the config above doesn’t follow MT’s examples as it does not include a bridge with all ports added to it as described in manual switch chip config. I never did that for these devices and it did work, so I didn’t think twice about it as why would I add ports that don’t have traffic going to the cpu to a software bridge? Well, except for one - management access. I added the following config:

/interface/bridge/port add bridge=bridge1-vlan-master interface=ether1
/interface/vlan/add name=vlan20-mgmt interface=bridge1-vlan-master vlan-id=20
/ip/dhcp-client/add interface=vlan20-mgmt

A bug in bridge implementation?
This config does work as well. However, a true magic/bug happens. If you add a port that has no business routing any traffic to the CPU nor is a part of a VLAN that touches CPU it will break upon being disabled:

/interface/bridge/port/add bridge=bridge1-vlan-master interface=ether2
/interface/bridge/port/set disabled=yes [find interface=ether2]
/interface/bridge/port/set disabled=no [find interface=ether2]

This config above will cause the traffic on port to stop working for a few seconds when the port is added to the bridge, which is normal. However, disabling or deleting this port at this point from the bridge will make the port stop passing traffic. This isn’t something I would expect as ether2 doesn’t have any config suggesting traffic touching the CPU. Re-enabling the port on bridge fixes it. However, leaving the port assignment on the bridge disabled and then rebooting also fixes the issue (huh?!). Disabling hw acceleration on this port assignment has the same effect as disabling the assignment.



Questions

  1. Am I doing/understanding something incorrectly here? The config is as simple as it gets in my books but maybe I have some fundamental misunderstanding that I held for years :slight_smile:

  1. What internally is the point of adding switch-only ports to a bridge on devices that do not support bridge VLAN filtering?
    It appears that it does something undefined. The only trace of that is MikroTik’s examples putting all ports under a “software” bridge. However, it “appears” that the switch is functioning properly without that under most circumstances? :wink:

  1. The documentation in multiple places contains a cryptic note for QCA8337:

On QCA8337 and Atheros8327 switch chips, a default vlan-header=leave-as-is property should be used. The switch chip will determine which ports are access ports by using the default-vlan-id property. The default-vlan-id should only be used on access/hybrid ports to specify which VLAN the untagged ingress traffic is assigned to.

This isn’t a rule I normally follow, but instead I always set “vlan-header” accordingly as “always-strip” for access ports and “add-if-missing” for trunk ports. This seems incorrect for these 2 chipsets. Can someone explain reasoning behind this note in docs/why is there/what happens if you actually set it as not “leave-as-is” and if there any real-world consequences/incompatibilities with these chips vs others that don’t have this strange requirement?
full-config-export.rsc (1.08 KB)

The basic idea, mentioned everywhere in the docs, is that switched ports are members of bridge. The fact, that it works for you if not all ports are bridge members, is some kind of gray area … and hence behaviour might change from version to version.

And yes, it is normal that bridge config does affect switch ports even if advanced port config is done on switch chip.

As to question #2: when bridge got introduced (around 6.41), it replaced “master switch port” as means for CPU to communicate with bridged ports (the CPU-facing bridge port). And, as I already mentioned, the idea is to have gridge as complete as it gets. If certain L2 functions have to be configured elsewhere this doesn’t mean that bridge can be left only partially built.
I’m wondering (perhaps you could test it) if switch’s CPU can communicate with device, connected to port which is not bridge member (in your case ether5).

  1. My understanding is that QCA8337 and AR8327 simply ignore that setting for ports with default-vlan set and go with tagging/untagging regardless. I tend to set it to “always-strip” on access ports. I guess it wouldn’t be the right setting on hybrid ports, so it’s just good that switch chip knows better.

Indeed, this changed somewhere around the time of master-port removal. I was gaslighting myself if it used to be required and I was right, it was not. See the old copy of wiki where the flow is essentially:

  1. Set random port as “master-port”
  2. Use “/interface/ethernet/switch/port” to set trunk/access and PVID
  3. Configure VLAN table in “/interface/ethernet/switch/vlan”

What is also curious in this config is that MT specifically says that AR8327 supports leave-as-is/always-strip/add-if-missing options.


You seem to be 100% right. I opened documentation from 2018 and there they specifically call this change and both constructions being equivalent, even if normally “bridge” isn’t related to hardware switch:

# pre-v6.41 master-port configuration

/interface ethernet
set ether3 master-port=ether2
#...

# post-v6.41 bridge hw-offload configuration

/interface bridge
add name=bridge1 igmp-snooping=no protocol-mode=none
/interface bridge port
add bridge=bridge1 interface=ether2 hw=yes
#...



This is exactly my point here - the bridge makes sense if you want bridge on the CPU side. However, it appears that the “bridge” silently fiddles with hardware configuration even if CPU is deliberately cut off of the traffic. As for CPU communicating with non-bridge member


It doesn’t seem to be the case. The “switch1-cpu” must be added explicitly, as I did for VLAN 20 for the CPU to be able to get the traffic. The ROS appears to be completely cut off from the networking if the “switch1-cpu” isn’t added somewhere. I poked around using serial console in such state and removal of “switch1-cpu” is a death sentence for any network traffic between CPU and switch chip :wink:


Curiously the 2014s docs don’t have this note for the Atheros one (QCA wasn’t a thing yet). In 2016 docs the QCA8337 is there but note isn’t still. BUT you may be spot on with the ignore part! In 2019s documentation the note reads:

In QCA8337 and Atheros8327 chips when vlan-mode=secure is used, it ignores switch port vlan-header options. VLAN table entries handle all the egress tagging/untagging and works as vlan-header=leave-as-is on all ports. It means what comes in tagged, goes out tagged as well, only default-vlan-id frames are untagged at the egress of port.

Documentation from 2021 changed it to “The vlan-header is set to leave-as-is and cannot be changed” but still mentioned being ignored. The old documentation on the wiki as of today still mentions ignoring the setting. The new documentation repository changed the verbage to this confusing “should use”…
I wish MikroTik left that bit, as “it is being ignored” makes me sleep better than “don’t do it”, as the later implies some undefined behavior may be caused by changing “vlan-header” to value other than “leave-as-is”.

I think that Mikrotik’s point since 6.41 is to use bridge where it comes to L2 communication between two ports … and to use switch chip config submenus in addition to that. I’m willing to bet that in this context MT’s point wins over your point :laughing: