Community discussions

MikroTik App
 
moenadic
just joined
Topic Author
Posts: 4
Joined: Sat Oct 21, 2023 2:38 pm

CSS610-8P-2S+ randomly stops forwarding for exactly five minutes

Sat Nov 18, 2023 5:27 am

Setup

I’m the IT guy for a small company and recently moved us from a legacy flat network to the first stages of a more segregated setup. I have attached a network overview.
  • SW01 is a CRS309-1G-8S+IN, SwOS 2.13
  • SW02 is a CRS309-1G-8S+IN, SwOS 2.13
  • SW03 is a CSS610-8G-2S+IN, SwOSlite 2.17
    • (fault also occured on earlier versions, unfortunately not documented)
  • SW04 is a CSS610-8G-2S+IN, SwOSlite 2.17
    • (fault also occured on earlier versions, unfortunately not documented)
  • ROUTER01 is a RB5009UG+S+IN, RouterOS 7.12
There is a VLAN overlay that I did not illustrate. Basically, every connection between switches and to the two ubiquiti APs is a trunk (mode strict, tagged only, port set to VID 4090 and not a VLAN member). The switches and access points have a management VLAN and accept connections only from there. The SSIDs are mapped to VLAN tags. ROUTER01 is where access between VLANs is routed and firewalled. The infra server hosts a few VMs, among them an openvz container for monitoring.

Faults

The problem manifested in random dropouts. When I wasn’t on site and could confirm quickly, I could not reproduce the issue. I then built a small tool that would periodically ping a set of IPs and log to CSV.

I deployed this tool on the openvz container already providing a prometheus SNMP exporter for the network hardware. I subsequently compiled the tool for armv7 and aarch64, so it could run on the NAS and even on the ubiquiti access points :)

From every monitor (infra, NAS, APs), I ping various other systems, on both sides of the fault. This allowed us to identify SW03 as the culprit. All systems on the same side (nearest switch, SW01 or SW02) stay reachable with acceptable RTTs.

I periodically import the CSV data into influxdb to make it accessible via Grafana. I now have many days of data and made a few peculiar observations.

Observations/Conclusions

  • SW03 randomly stops forwarding traffic
  • SW04 also does so, though it was not detected at first because it is used in a lab with no devices running 24/7 to ping
  • the outages are not connected to any observable event
  • it seems as if the switch chip stops forwarding, BUT
    • the two access points on the same switch usually can ping each other for a few minutes when nothing else can get past the switch
    • the switch managament interface still replies to pings, without interruption
  • the outages last, within the accuracy provided by 1s ping interval, exactly 5 minutes!
  • after these 5 minutes, traffic resumes and everything works as intended
  • Error counter on SW03 only ever record RX pauses. Which, btw, are invisible via SNMP! Mikrotik?!
  • SNMP error counters don’t indicate anything around the events.
  • I did configure a port mirror on SW02 and the packet capture only shows everything beyond the fault stop responding until it comes back after 5 minutes

I came from a much larger infra structure based on other switches (fully meshed redundant 10G links over multiple buildings with ~80 hosts, 5 firewalls and various internet uplinks). I don’t think I’m trying to do anything more than the bare minimum here and don’t suspect a misconfiguration.

Exactly five minutes seems suspiciously like some watchdog running into a timeout and resetting some part of the hardware.

Experiments

I wanted to rule out overheating S+RJ10, as they can get toasty. This is why there is a separate uplink next to the multi-gig via bog standard S-RJ01 and a regular port on the CSS610. Also, the uplink itself can hardly be the problem since for some reason SW03 replies on it's admin interface down to the SNMP exporter in the basement, despite the access points becoming unreachable. If the APs were the problem, the NAS should be able to continuously ping the router and should be reachable from the SNMP exporter. Unless I'm mistaken, there is no constellation that allows the observed faults to occur without implicating SW03.

I initially assumed faulty HW at SW03, so I swapped configurations between SW03 and SW04 (identical models) and the error occurred on the other HW as well! This leads me to conclude that the fault lies with the OS, if it was the HW, the entire fleet (or a substantial portion) should be affected and then I would have found posts about this already.

I also had an unused CRS326-24G-2S+RM (running its current SwOS) lying about that I frankensteined into the setup to replace SW03. I had only one spare PoE injector, so only one AP could be connected to this switch. During that time, we observed at least one outage, where the AP on SW03 dropped off the network while the other AP on the CRS326 continued replying perfectly.

This further points towards a problem with SwOSlite or both of our CSS610 devices, although they have non-neighbour serial numbers and even different vendor bits in the MAC address, so it does not seem they are of the same batch.

--

I'm about at my wit's end and the only sure way out would be to replace the switches. And avoid CSS from now on. I am very annoyed by this, as I introduced Mikrotik to my company with "they pack a lot of punch for the money and are still 'proper' devices" and now "I" caused all those problems. People are understandably upset as a rare but random failure is the worst sort of gambling and dropouts during meetings are catastrophic.

Does anyone have an idea about the cause, ways to diagnose further or questions about missing information?

Best Regards
Moe
You do not have the required permissions to view the files attached to this post.
 
User avatar
mkx
Forum Guru
Forum Guru
Posts: 10713
Joined: Thu Mar 03, 2016 10:23 pm

Re: CSS610-8P-2S+ randomly stops forwarding for exactly five minutes

Sat Nov 18, 2023 10:36 am

A clarification question: how are configured ports which are used for the two connections between SW03 and SW02? Any special config (such as bonding) or nothing?
 
moenadic
just joined
Topic Author
Posts: 4
Joined: Sat Oct 21, 2023 2:38 pm

Re: CSS610-8P-2S+ randomly stops forwarding for exactly five minutes

Sat Nov 18, 2023 1:54 pm

The ports were initially configured as LAG but when SW03 consistently showed them being part of two LAGs despite the same neighbor, I investigated and found the byline that the CSS610 only does L2 hashing while the CRS309 will use more layers, so I chalked it up to that. @mikrotik: This should be highlighted, maybe build an overview table over all SwOS devices showing their differences?

I did not yet have the time to change SW02 to RouterOS where this behaviour is configurable (why not SwOS? It’s either a chip feature or the underpinning linux kernel can do it, anyway?).

This means it is now only in the hands of RSTP to prevent loops and do the failover. Experimentally disconnecting either link showed basically no impact on running pings, so I assume this works, just not with load balancing.
Last edited by moenadic on Sat Nov 18, 2023 1:55 pm, edited 1 time in total.
 
tdw
Forum Guru
Forum Guru
Posts: 1799
Joined: Sat May 05, 2018 11:55 am

Re: CSS610-8P-2S+ randomly stops forwarding for exactly five minutes

Sun Nov 19, 2023 11:59 pm

Five minutes suggests an issue with switch FDB entries ageing out. Do you have any duplicate MAC addresses on different VLANs? SwOS lite does not support IVL which would be required if that is the case.
 
moenadic
just joined
Topic Author
Posts: 4
Joined: Sat Oct 21, 2023 2:38 pm

Re: CSS610-8P-2S+ randomly stops forwarding for exactly five minutes

Tue Nov 21, 2023 3:05 pm

Of course, by default linux (=the unifi APs) will use the same MAC for all VLANs. But the APs are connected over one port a piece, so, unless I'm missing the obvious, no matter what VLAN, as long as the packet is forwarded to the right port and the tag is left intact, everything should work.

There is also continuous traffic during the day, when we had outages in the past, so the switch should never go one expiration interval without being reminded that $mac is coming in via $port.

I am not aware of having configured a single device to use different MACs for different VLANs. And if it was a general problem, why does it work 99%+ of the day?
 
tdw
Forum Guru
Forum Guru
Posts: 1799
Joined: Sat May 05, 2018 11:55 am

Re: CSS610-8P-2S+ randomly stops forwarding for exactly five minutes

Wed Nov 22, 2023 10:24 pm

Switches do not work in that manner. When a packet destined for a unicast MAC address which does not exist in the forwarding database the packet is transmitted out of all the other switch ports, if the destination MAC address does exist in the database the packet is only transmitted out of the port associated with that address.

The database is populated by recording the source MAC address and switch port as packets arrive, the timer for each entry is reset each time the source MAC address is seen otherwise ages out. If packets with a duplicate source MAC address arrive on different ports then return traffic will only be sent to the port on which it was last seen.
 
moenadic
just joined
Topic Author
Posts: 4
Joined: Sat Oct 21, 2023 2:38 pm

Re: CSS610-8P-2S+ randomly stops forwarding for exactly five minutes

Thu Nov 23, 2023 11:07 am

I concur that this timeout looks like it might fit the behaviour. Even so, there is, at least as long as the ping processes run, a constant stream of ICMP packets arriving at least every second, that should update the FDB in both directions. The connection to the access points is unambiguous, there exists only one port each from which they possibly can send packets.

And the other direction is where the ICMP echo requests come in, so how could the switch ever get confused? I also tried disconnecting one link, so no packets ever arrive on the second connection between SW03 and SW02, the network is absolutely unambiguous for SW03, and the outages still happen.

Who is online

Users browsing this forum: No registered users and 2 guests