We have a customer who has recently started experiencing a network issue that we haven’t been able to pinpoint yet. The customer is running an MLAG stack with two CRS354-48P units as core switches, and connected to these are about 3–4 CRS328-24P switches deployed throughout the factory. This setup has been running without issues for several years.
Earlier this spring, we upgraded to the latest RouterOS version available at the time (7.16). The next day, we encountered severe packet loss issues within the MLAG stack and with 802.3ad bonding towards a server. We rolled back to a previous version, which resolved that particular issue.
Around the same time (or possibly coincidentally), some users/clients began to intermittently lose “internet” connectivity once or twice a day. All clients are Windows machines. What happens is that Windows drops the link entirely and indicates no network access. After 1–2 minutes, connectivity is restored without any user intervention.
We’ve tried moving affected clients to different switches, but the problem persists regardless of switch port or switch model.
Does anyone have an idea what could be causing this? Could it be client-related or something in the switching infrastructure?