The network is very boring and very flat – no VLANs, no routing, no jumbo frames, etc. Also, no IGMP.
One of the libraries we’re using is based on Lightweight Communication and Marshaling, which uses UDP multicast to broadcast data. After figuring out some minor details we have good communication between all nodes.
The majority of LCM packets are small (less than 1500 bytes). However there are a few message types which produce larger messages. These are IP fragmented and generally work.
But one message type in particular has one of its packets dropped consistently – the third of four fragments. Moreover, if we run Wireshark on other clients on the network … including both clients attached to the CSS610 and the embedded PC on the other end of the fiber … we see the same behavior – the third packet is dropped consistently on any of those machines.
The message contents are constant, and I get repeatable results, though I haven’t tested if/how the behavior changes if I make minor changes to the packet contents. What I think I’m seeing:
Of all of the fragmented LCM UDP multicast messages, only this one message is “damaged” in transit
This message is always damaged in transit and the third packet is always the one dropped
This behavior of the third packet disappearing can be observed on Linux and Mac computers attached to the CSS610 and on the embedded Linux PC
After all that, I reduced the MTU on the transmitting computer from 1500 to 1400 and the issue appeared to resolve itself. ???
As I said, it’s a weird issue. I don’t have a clue where to start looking for a real solution… help!?!
So to double check, when you talk about Wireshark captures, is the fragment missing already in the sending direction, or only as late as in the receiving one?
The packet is present in a capture on the sending (Linux) machine; but is not present on any of the three available receiving machines (one each Linux and Mac “client machines” and the embedded PC on the end of the fiber link).
Okay. The only “real solution” I can think of is to identify the network element that drops the frame, and then send the wireshark capture of all 4 fragments to the manufactirer of that device.
So connect one of the clients directly to the output of the sending machine; if the fragment is missing on reception there, take a different type of client to double-check that it is not a problem of the recipient before blaming the sender. If the fragment is there on the recipient, it must be one of the switches that drops it. In that case, connect the source back to its switch and connect he client to another port of that same switch. If that works, move the sender to the other switch and try there. If both switches forward it on gigabit ports, it must be the 10 G link.
I don’t use anything with SwOS so I have no idea whether there are some error counters; if there are, they might indicate CRC errors.
To check which switch is guilty, you’d need to use VLANs so that you could interconnect two ports on the same switch, but the tagging/untagging could affect the behaviour. So maybe try that first, configure the link between the two switches as a trunk one (in terms that the VLAN to which all other ports are access ones will be tagged on these ports linking the switches together).
As an initial test, I’ve done packet captures of the relevant UDP packet, attached as .pcapng / Wireshark files.
In all cases:
transmitting computer is Ubuntu 18.04 / Intel i210 NUC
receiving computer is an older MacBook Pro OS X 11.6.5 w/ Thunderbolt ethernet adaptor
MTU 1500
In the “direct connection” scenario, the two computers are connected by a Cat 6 cable
In the “CSS610” scenario, the two computers are connected via 1G ports on a CSS610-8G-2S+; no 10G was used for this testing
As shown, the receivers gets 4 out of 4 packets in the direct connection scenario, and 3/4 when going through the switch. This effect has been repeated with other Linux clients (though not captured)
I’d say this is enough to open a ticket with Mikrotik. Don’t get discouraged by the statement you are only authorized for support within the first two weeks from purchase, they just don’t want to deal with questions like “how do I change the IP address on LAN port”. Real issues do get handled.