A programmer in my office says sometimes he copies a big file over our network and the resulting file is of the same size but has different control sum than the source. I’m trying to find out what could go wrong. We have two Mikrotik switches (CRS328-24P-4S+ and CRS312-4C+8XG) connected via optical 10G link. The file (say 20+ Gb) was copied from a Windows Server (5Gbit link) to a share on another Windows Server (5Gbit link) with the help of Far Manager, then CRC64 control sums of the files were counted on both sides. In case of inconsistent sums the file was copied again. He ran into this problem at least twice for the last few months. MTU on both switches is set to 1500. I don’t see unusual errors in statistics of the involved interfaces.
I couldn’t reproduce the problem and I’m rather inclined to blame drivers, Far, windows caches, hdd, etc. than the switches, still .. is it possible that somehow files could be altered in the process of copying?
This is not going to be a network problem. Assuming they are not using their own proprietary network stack the most likely source is going to be in layer 7.
Many L7 protocols rely on the network to properly calculate checksums while sending and receiving individual packets. So if this fails somewhere because the corrupt data have the same checksum as the correct ones, resulting in a corrupt packet not getting dropped by the first receiving device, the upper layers have no means to detect the corruption. That’s the very reason why doing the checksum of the file at source and at destination after copying is done.
Having said that, I can see several directions for further investigation:
-
calculate the checksum of the source file before copying; if the checksum of the copy does not match, calculate it again at the source. This will filter out the case when the file got modified during copying. It does happen - I don’t say it is your case, but you haven’t made any statement about this.
-
if the individual packets get corrupt, it would be a miracle if all of the corrupt ones would have a correct checksum. So check the counters on all interfaces along the way; if they show nothing, run Wireshark or tcpdump at the source or destination machine while copying, and look for retransmissions. Occurrence of retransmissions will indicate that packets have been dropped due to wrong checksum even though the interface counters do not indicate this (which would be a bug of the counters, can happen).
-
if you can see no retransmissions, some device on the path ignores wrong checksums and lets the packets through anyway, which would be the worst case to analyze, or the application feeds wrong data to the network. Checksum verification in Wireshark may be misleading when the capture is taken at the endpoint machine, so don’t rely on that. To tell between the two possibilities, you would have to capture the same file transfer at both the source and the destination, make a binary comparison of the files to identify the offsets of the error(s), and then compare the contents of the packet carrying the corrupt bytes in the two captures. If the contents of this packet in the capture from source matches the one from the destination, it is an application layer problem, otherwise it is a network one.
Assuming that you have no network analyzer capable of intentionally sending Ethernet frames with wrong checksums, the only way to find the guilty box is the good old splitting the path into halves. First, repeatedly copy the file via the complete network path until the error shows up for the third time, in order to get a rough idea on how frequently the error happens. Then move the destination machine to the middle of the network path and copy the same number of times again. If all of the attempts suceed this time, this half of the path is working properly, so connect some source machine to the middle point of the network, put the destination one at its original place, and test again. At the end, you’ll end up with the source machine and the destination machine connected to the same middlebox, or you’ll find that the destination machine itself is guilty.
Thanks, that makes sense.
There is a slim chance that this is a network issue (yes it is possible, but not probable). there is multiple redundanct checks throughout the layers. On Layer 2 you have Cyclic redundancy check, layer 3 - checksum, layer 4 checksum. So to say that your switches are corrupting a frame that have to be verified 3 times before they hit the application is again slim chance.
Actually L3 checksum is unusable in this scenario. With IPv4 there is checksum, but covers only IP header without payload. In IPv6 checksum is completely omited, L4 is expected to cover it … indeed TCP and UDP include checksums (UDP checksum in IPv4 is optional but in IPv6 it’s mandatory), which cover all headers (including L3 headers) and payload (excluding checksum field for obvious reasons). But there’s still ethernet FCS (CRC) and TCP checksum which both have to miss detection of errors at the same time which is quite unlikely - specially so as both use different methods to calculate the checksum meaning it’s even more unlikely for an error to produce same checksum as unaltered frame / packet.
The entire ip header is encapsulated in the frame, therefore still corruptable, but your point is noted, and I did not know ip6 elminated that checksum, so thanks for the lesson.
Hello everyone, I see that the topic is old, but I was overtaken by a similar incident. Forgive me in advance for the long story, but the story is worth it (it is not over yet, by the way). Over the course of a year, there were three cases when users reported file corruption, and in all cases we could not trace the pattern, since the damaged files did not have modified time stamps (even when the file lay untouched for two years, for example, it turned out that it was still damaged).
The paradox was that even with a fairly good backup system (3 levels, shadow copies, operational copies to network storage, long-term copies to external media), it turned out that even if the copies were damaged in the archive (7-ZIP multi-volume continuous archive, which generally excluded damage inside it). For example, on Monday, December 16, 2024, the user reports that the file that he successfully opened on Friday, December 13 of the same year is now damaged, but the time stamps were all earlier than the incident date (the file was created in 2023, modified in November 2024), and when we took the file from the archive two weeks ago (from December 6, 2024) and IT TURNED OUT that the file there was also damaged.
At first, we thought about the Microsoft DFS system, then about the SAN and storage, but we did not find any problems there and drew attention to the fact that such problems were only with files of a certain format (Autodesk ArtCAM).
Sometimes, when I unpacked archives on another machine and outside the server room and the network in it, it turned out that the files from the archive started working, we placed them pointwise in shared folders (restore) and then I started comparing the damaged files and the files from the archive (restored, so to speak) in the HEX editor and found differences in them (on the screenshot in the application, the damaged file is on the left, the correct one is on the right). What caught my eye was that this was something similar to a simple bit inversion, but not always.
The strangest thing was that it fixed itself and the files miraculously started opening and working as expected (from archives, from shadow copies, and from the shared folder at the same time, i.e. these files were definitely not copied from the archive). It was found experimentally that restarting the network equipment helps (the main 10 gigabit router in the server room is CCR2004 and the two downstream ones are CRS354 and CRS328, which are connected to each other by optics through WDM fiber transmitters), it was also noted that at this point the network responsiveness and the transmission speed between the switches drop, which has already been localized as a problem with the optical connection.
At the moment, it is planned to replace the optical transmitters, but the root of the problem has not yet been found. Unfortunately, this occurs rarely enough to identify an objective cause.
But I can say for sure that the problem is specifically in the part of data transmission over the network, and it only manifests itself with certain data patterns, maybe someone will be able to discern a pattern in the screenshot in the attachment.
Several screenshots for analysis




I love invisible screenshots, they blend very well with the board theme …
![]()
EDIT: Ah, ok, now they show.
The original pattern seems to be a repeating 4 sets of 12 triplets, with 11 triplets 00FFFF and 1 00FF17.
With a hex view in 16 columns the global pattern is every 9 lines, 16x9=144 bytes, 144/12/3=4
The corruption on the file is strange (see image).

Yes, but based on this data, without knowing how the internal algorithms work when transmitting a packet, it is difficult to find out the cause of data corruption.
PS: once, in the early 2000s, when I was a student, I came across an article that a certain file was found that was impossible to write to a CD disk, a certain sequence of bytes (bits), which after being processed by the drive’s “brains”, looked like a signature of the beginning of a data sector on a CD, thereby hanging the drive during recording such a sequence. I’ll send you a link in Russian, if you are curious, read it, here, I think, something similar happens under certain circumstances. (https://www.ixbt.com/optical/magia-chisel.shtml)
Yep, but what I mean is that I am failing to see a “corruption pattern”.
on line 03C540 there is a missing 00 byte (thus data is shifted by one byte) on positions 8-11, 4 consecutive FF bytes should mean that.
on line 03C560 a whole triplet is missing
everything seems shifted up by 4 byte until line 03C5C0 where 4 bytes magically appear out of nowhere and the “17” pattern synchronizes again.
A 00 vs.FF or viceversa can be a XOR or bit flipping, but 17 remains the same.
The next difference seems like a completely different pattern, seemingly bytes are inserted and everything is shifted 16 bytes down.

Yep, but what I mean is that I am failing to see a “corruption pattern”.
on line 03C540 there is a missing 00 byte (thus data is shifted by one byte) on positions 8-11, 4 consecutive FF bytes should mean that.
on line 03C560 a whole triplet is missing
everything seems shifted up by 4 byte until line 03C5C0 where 4 bytes magically appear out of nowhere and the “17” pattern synchronizes again.
A 00 vs.FF or viceversa can be a XOR or bit flipping, but 17 remains the same.
The next difference seems like a completely different pattern, seemingly bytes are inserted and everything is shifted 16 bytes down.
Thanks for such a deep dive into this issue! I tried to train the AI to search for these patterns in several files, but so far without success.
Well, for now, according to the topic starter’s message, we have a similar problem on similar equipment (I have ONLY MIKROTIK switches and routers in my network), and I found it interesting that searching for my problem on Google led to the MIKROTIK forum.