Files copied have different control sums

zhenb · Wed Jan 19, 2022 12:35 pm

A programmer in my office says sometimes he copies a big file over our network and the resulting file is of the same size but has different control sum than the source. I'm trying to find out what could go wrong. We have two Mikrotik switches (CRS328-24P-4S+ and CRS312-4C+8XG) connected via optical 10G link. The file (say 20+ Gb) was copied from a Windows Server (5Gbit link) to a share on another Windows Server (5Gbit link) with the help of Far Manager, then CRC64 control sums of the files were counted on both sides. In case of inconsistent sums the file was copied again. He ran into this problem at least twice for the last few months. MTU on both switches is set to 1500. I don't see unusual errors in statistics of the involved interfaces.
I couldn't reproduce the problem and I'm rather inclined to blame drivers, Far, windows caches, hdd, etc. than the switches, still .. is it possible that somehow files could be altered in the process of copying?

smyers119 · Thu Jan 20, 2022 3:54 am

This is not going to be a network problem. Assuming they are not using their own proprietary network stack the most likely source is going to be in layer 7.

sindy · Thu Jan 20, 2022 1:04 pm

Many L7 protocols rely on the network to properly calculate checksums while sending and receiving individual packets. So if this fails somewhere because the corrupt data have the same checksum as the correct ones, resulting in a corrupt packet not getting dropped by the first receiving device, the upper layers have no means to detect the corruption. That's the very reason why doing the checksum of the file at source and at destination after copying is done.

Having said that, I can see several directions for further investigation:
1) calculate the checksum of the source file before copying; if the checksum of the copy does not match, calculate it again at the source. This will filter out the case when the file got modified during copying. It does happen - I don't say it is your case, but you haven't made any statement about this.

2) if the individual packets get corrupt, it would be a miracle if all of the corrupt ones would have a correct checksum. So check the counters on all interfaces along the way; if they show nothing, run Wireshark or tcpdump at the source or destination machine while copying, and look for retransmissions. Occurrence of retransmissions will indicate that packets have been dropped due to wrong checksum even though the interface counters do not indicate this (which would be a bug of the counters, can happen).

3) if you can see no retransmissions, some device on the path ignores wrong checksums and lets the packets through anyway, which would be the worst case to analyze, or the application feeds wrong data to the network. Checksum verification in Wireshark may be misleading when the capture is taken at the endpoint machine, so don't rely on that. To tell between the two possibilities, you would have to capture the same file transfer at both the source and the destination, make a binary comparison of the files to identify the offsets of the error(s), and then compare the contents of the packet carrying the corrupt bytes in the two captures. If the contents of this packet in the capture from source matches the one from the destination, it is an application layer problem, otherwise it is a network one.

Assuming that you have no network analyzer capable of intentionally sending Ethernet frames with wrong checksums, the only way to find the guilty box is the good old splitting the path into halves. First, repeatedly copy the file via the complete network path until the error shows up for the third time, in order to get a rough idea on how frequently the error happens. Then move the destination machine to the middle of the network path and copy the same number of times again. If all of the attempts suceed this time, this half of the path is working properly, so connect some source machine to the middle point of the network, put the destination one at its original place, and test again. At the end, you'll end up with the source machine and the destination machine connected to the same middlebox, or you'll find that the destination machine itself is guilty.

zhenb · Thu Jan 20, 2022 2:54 pm

Thanks, that makes sense.

smyers119 · Thu Jan 20, 2022 11:31 pm

Many L7 protocols rely on the network to properly calculate checksums while sending and receiving individual packets. .......

There is a slim chance that this is a network issue (yes it is possible, but not probable). there is multiple redundanct checks throughout the layers. On Layer 2 you have Cyclic redundancy check, layer 3 - checksum, layer 4 checksum. So to say that your switches are corrupting a frame that have to be verified 3 times before they hit the application is again slim chance.

mkx · Fri Jan 21, 2022 12:04 am

Actually L3 checksum is unusable in this scenario. With IPv4 there is checksum, but covers only IP header without payload. In IPv6 checksum is completely omited, L4 is expected to cover it ... indeed TCP and UDP include checksums (UDP checksum in IPv4 is optional but in IPv6 it's mandatory), which cover all headers (including L3 headers) and payload (excluding checksum field for obvious reasons). But there's still ethernet FCS (CRC) and TCP checksum which both have to miss detection of errors at the same time which is quite unlikely - specially so as both use different methods to calculate the checksum meaning it's even more unlikely for an error to produce same checksum as unaltered frame / packet.

smyers119 · Fri Jan 21, 2022 12:19 am

Actually L3 checksum is unusable in this scenario. With IPv4 there is checksum, but covers only IP header without payload. In IPv6 checksum is completely omited, L4 is expected to cover it ... indeed TCP and UDP include checksums (UDP checksum in IPv4 is optional but in IPv6 it's mandatory), which cover all headers (including L3 headers) and payload (excluding checksum field for obvious reasons). But there's still ethernet FCS (CRC) and TCP checksum which both have to miss detection of errors at the same time which is quite unlikely - specially so as both use different methods to calculate the checksum meaning it's even more unlikely for an error to produce same checksum as unaltered frame / packet.

The entire ip header is encapsulated in the frame, therefore still corruptable, but your point is noted, and I did not know ip6 elminated that checksum, so thanks for the lesson.

Files copied have different control sums

Files copied have different control sums

Re: Files copied have different control sums

Re: Files copied have different control sums

Re: Files copied have different control sums

Re: Files copied have different control sums

Re: Files copied have different control sums

Re: Files copied have different control sums

Who is online