Unreliable connection between switches with VLANs

I am in the process of replacing the UniFI switches in my home network with MikroTik. I use a number of VLANs for things like clusters, CCTV, IoT, and guest networks. The motivating factor for switching from UniFi to MikroTik is that specifying ACLs for inter-VLAN routing on UniFi is currently painful, unstable, or even impossible. After upgrading most of my devices to 10G, inter-VLAN routing going through my OPNsense router became a severe bottleneck, so I went with MikroTik to get closer to line speed.

Obviously the UniFi management interface is simple to a fault, hiding most of the complexity of my configuration behind some checkboxes. Moving to RouterOS has made that clear. I am struggling to replicate my setup, even without inter-VLAN routing, in RouterOS. After reading through the relevant sections in the RouterOS manual, dozens of forum posts, and some tutorials, I have arrived at a configuration that I expect to work, but it fails in unexpected ways.

My network infrastructure looks like this:
network-diagram.png
I am not using any redundant links between devices at the moment, so each line represents exactly one (Q)SFP+ or CAT6a cable.

My VLANs are:

Number | Name       | CIDR
------ | ---------- | --------------
     2 | Manage     | 10.24.8.1/22
    16 | Service    | 10.24.64.1/22
    24 | Proxmox    | 10.24.96.1/22
    25 | Kubernetes | 10.24.100.1/22
    32 | Client     | 10.24.128.1/22
    40 | Security   | 10.24.160.1/22
    48 | Smart      | 10.24.192.1/22
    63 | Guest      | 10.24.252.1/22

OPNsense runs a DHCP server on each VLAN, at the IP address given in the above table. I prefer to use DHCP whenever possible, so my goal is to have each switch and access point get its IP address on VLAN 2 via DHCP. With static reservations, I expect something like this:

Host       | IP Address
---------- | ----------
opnsense1  | 10.24.8.1
netr1-csw1 | 10.24.8.10
appr1-dsw1 | 10.24.8.11
appr1-asw1 | 10.24.8.12
living-ap  | 10.24.8.50
theater-ap | 10.24.8.51
garage-ap  | 10.24.8.52

With my current configuration, this works. Everything gets its reserved IP from OPNsense, but that’s about as far as it gets. I can’t ping the switches from the router. I can’t ping the router from the switches. I can’t reach the Internet from the switches. If I plug a laptop into an access port on VLAN 2 on any of the switches, I can connect to that switch through Winbox, but not any of the others, and the connection is occasionally interrupted, which doesn’t happen when connected to the MGMT port. Running a packet sniffer on netr1-csw1 and pinging the switch from the router shows no incoming packets. After a reboot, ping works, but if I cancel the ping and start over, it starts timing out and intermittently reporting that the host is unreachable. I suspect that STP is the cause, so I tried RSTP, MSTP, and even disabling STP entirely because there are no loops in my network, but the behavior persists. OPNsense doesn’t seem to be the issue; to be sure, I have disabled all firewall rules.

I have attached the exports for each RouterOS device. In them, you can also see that I am designating the MGMT/BOOT port on each switch as a separate bridge that runs a DHCP server, as a sort of anti-lockout port in case I don’t have a serial cable to use the CONSOLE port. I don’t think this affects the rest of the configuration, though.

I am at a loss on how to proceed. Am I missing something extremely obvious? Is this type of setup not practical or achievable?
netr1-csw1-export.txt (3.62 KB)
appr1-asw1-export.txt (4.77 KB)
appr1-dsw1-export.txt (4.77 KB)

I just had a look at config of appr1-dsw1, I’ll assume the rest suffer from same errors. Here’s a brief list of things done wrong:

  • no need for multiple bridges (MGMT is on different bridge, which doesn’t have any access towards the rest of network)
  • no PVID setting for access ports
  • there’s no need for vlan interfaces for VLANs which are only handled by switch for switching purposes
  • IP config is incomplete - missing default gateway. Even if the L2 setup was correct, this alone would prevent switch from accessing internet (but should be fine to access other switches and router though). Also beware of running DHCP server on management VLAN … when all devices will see each other, there will be multiple DHCP servers serving same L2/L3 subnet. Beware of static IP addresses, if all switches are left to same IP address, they will overlap when mgmt vlan gets sorted out.

I recommend you to go through this fine tutorial on how to do VLANs in ROS properly. Concentrate on post titled “Switch with a separate router (RoaS)”.

Just wanted to follow up on this. Your advice worked, and clarified a few things I was confused about with VLANs (i.e. interface VLAN vs bridge VLAN). It seems the critical things I was missing were the neighbor discovery and MAC server settings. Thanks!