I’m the IT guy for a small company and recently moved us from a legacy flat network to the first stages of a more segregated setup. I have attached a network overview.
- SW01 is a CRS309-1G-8S+IN, SwOS 2.13
- SW02 is a CRS309-1G-8S+IN, SwOS 2.13
- SW03 is a CSS610-8G-2S+IN, SwOSlite 2.17
- (fault also occured on earlier versions, unfortunately not documented)
- SW04 is a CSS610-8G-2S+IN, SwOSlite 2.17
- (fault also occured on earlier versions, unfortunately not documented)
- ROUTER01 is a RB5009UG+S+IN, RouterOS 7.12
Faults
The problem manifested in random dropouts. When I wasn’t on site and could confirm quickly, I could not reproduce the issue. I then built a small tool that would periodically ping a set of IPs and log to CSV.
I deployed this tool on the openvz container already providing a prometheus SNMP exporter for the network hardware. I subsequently compiled the tool for armv7 and aarch64, so it could run on the NAS and even on the ubiquiti access points

From every monitor (infra, NAS, APs), I ping various other systems, on both sides of the fault. This allowed us to identify SW03 as the culprit. All systems on the same side (nearest switch, SW01 or SW02) stay reachable with acceptable RTTs.
I periodically import the CSV data into influxdb to make it accessible via Grafana. I now have many days of data and made a few peculiar observations.
Observations/Conclusions
- SW03 randomly stops forwarding traffic
- SW04 also does so, though it was not detected at first because it is used in a lab with no devices running 24/7 to ping
- the outages are not connected to any observable event
- it seems as if the switch chip stops forwarding, BUT
- the two access points on the same switch usually can ping each other for a few minutes when nothing else can get past the switch
- the switch managament interface still replies to pings, without interruption
- the outages last, within the accuracy provided by 1s ping interval, exactly 5 minutes!
- after these 5 minutes, traffic resumes and everything works as intended
- Error counter on SW03 only ever record RX pauses. Which, btw, are invisible via SNMP! Mikrotik?!
- SNMP error counters don’t indicate anything around the events.
- I did configure a port mirror on SW02 and the packet capture only shows everything beyond the fault stop responding until it comes back after 5 minutes
I came from a much larger infra structure based on other switches (fully meshed redundant 10G links over multiple buildings with ~80 hosts, 5 firewalls and various internet uplinks). I don’t think I’m trying to do anything more than the bare minimum here and don’t suspect a misconfiguration.
Exactly five minutes seems suspiciously like some watchdog running into a timeout and resetting some part of the hardware.
Experiments
I wanted to rule out overheating S+RJ10, as they can get toasty. This is why there is a separate uplink next to the multi-gig via bog standard S-RJ01 and a regular port on the CSS610. Also, the uplink itself can hardly be the problem since for some reason SW03 replies on it's admin interface down to the SNMP exporter in the basement, despite the access points becoming unreachable. If the APs were the problem, the NAS should be able to continuously ping the router and should be reachable from the SNMP exporter. Unless I'm mistaken, there is no constellation that allows the observed faults to occur without implicating SW03.
I initially assumed faulty HW at SW03, so I swapped configurations between SW03 and SW04 (identical models) and the error occurred on the other HW as well! This leads me to conclude that the fault lies with the OS, if it was the HW, the entire fleet (or a substantial portion) should be affected and then I would have found posts about this already.
I also had an unused CRS326-24G-2S+RM (running its current SwOS) lying about that I frankensteined into the setup to replace SW03. I had only one spare PoE injector, so only one AP could be connected to this switch. During that time, we observed at least one outage, where the AP on SW03 dropped off the network while the other AP on the CRS326 continued replying perfectly.
This further points towards a problem with SwOSlite or both of our CSS610 devices, although they have non-neighbour serial numbers and even different vendor bits in the MAC address, so it does not seem they are of the same batch.
--
I'm about at my wit's end and the only sure way out would be to replace the switches. And avoid CSS from now on. I am very annoyed by this, as I introduced Mikrotik to my company with "they pack a lot of punch for the money and are still 'proper' devices" and now "I" caused all those problems. People are understandably upset as a rare but random failure is the worst sort of gambling and dropouts during meetings are catastrophic.
Does anyone have an idea about the cause, ways to diagnose further or questions about missing information?
Best Regards
Moe