CRS32824P4SRM
Surely you mean the
CRS328-24P-4S+RM? If you want to abbreviate, say "CRS328" if it doesn't matter if you're talking about the 20S or 24P version, or "CRS328-24P" if it does matter, as when PoE is involved. That is, use the shortest meaningful prefix. That can go down to "CRS3xx" or "CRS" in some contexts.
Spelling it out fully can be helpful. Those dashes and pluses are meaningful to those of us who know how to read
MikroTik's product naming scheme. In this particular case…
TWO 1G RTSP UPLINKS -> CRS3171G16S+
…an informed reading of the product naming scheme calls into question why you aren't using 10G fiber links between those switches. You've got four in the 328, and 16 in the 317. Why copper, in that situation?
Do you really mean "RTSP" here, presumably part of AES67, to provide media stream negotiation?
I ask for two reasons.
One, you use "RSTP" elsewhere in your message. They're very much not the same thing!
Two, when a networking geek like me sees "two 1G uplinks" between two switches, we immediately think of redundant links and the need for a technology like RSTP to prevent loops. If that's what's going on, RSTP election fights could explain your symptom. I'm normally a fan of enabling RSTP everywhere, but in a hard-real-time application like this one, I think you can't afford even "rapid" spanning-tree negotiations. For such a simple topology,
loop-protect should suffice.
If the reason for two links is some hope of redundancy, what real-world conditions do you achieve that goal under? How often do you get a single port dying, and not an 8-port cluster, or the whole switch? How often does a cable cut affect one cable and not the others that run in the same cable chase?
The only case I can see where it might help is the highly unlikely one where someone manually unplugs one of the two cables while the system is "live." If that's a substantial risk in your situation, are you employing untrained cable monkeys, or what? "Don't do that" should go without saying.
If I can't talk you out of redundant 1G links and RSTP, such as because you're using the SFP+ interfaces for something else, and you must be able to interoperate with non-MikroTik switches that don't understand RouterOS's loop protection scheme, at least configure the
bridge priorities to force the root election to go the same way every time, as long as both links are present.
If the problem isn't RSTP, then you might be running into the SFP port flapping issue in the CRS328, fixed in 7.4beta2. Alas, that fix isn't in any stable release yet, but I've been running the beta on my CRS328 for 3+ weeks now, and it seems fine. My uptime is back down to less than a day due to the beta5 release earlier today, but the logs have been quiet since.
see attached device error log
Try again…
I have ben looking into DSCP QOC
What's "QOC"? Do you mean QoS?
in all honesty it looks like a ball of wax on the current Mikrotik firmware.
Really? One switch rule should suffice:
/interface/ethernet/switch/rule
add dscp=0 ports=ether1,ether2,ether3… rate=100M switch=switch1
That is, everything on the potentially-conflicting ports without a DSCP tag is limited to 10% of the maximum throughput of the presumed 1G links on those ports, so that it'd take ten of them in concert to swamp a 1G uplink. In the presumably more common case of just one bad actor, 90% of the uplink rate is dedicated to traffic carrying DSCP tags.
Obviously there is plenty of room for adjustment here. Apply with care and sensibility for your local needs.
Incidentally, I can't help but point out that the scheme as-presented is what you'd get for free by using 10G fiber uplinks between the switches: it'd take ten bad actors on 1G ports to swamp the uplink in that case. Even if you need no more than 1G between any two endpoints, flow aggregation may end up being a substantial benefit in your situation.