Thu Jan 20, 2022 1:04 pm
Many L7 protocols rely on the network to properly calculate checksums while sending and receiving individual packets. So if this fails somewhere because the corrupt data have the same checksum as the correct ones, resulting in a corrupt packet not getting dropped by the first receiving device, the upper layers have no means to detect the corruption. That's the very reason why doing the checksum of the file at source and at destination after copying is done.
Having said that, I can see several directions for further investigation:
1) calculate the checksum of the source file before copying; if the checksum of the copy does not match, calculate it again at the source. This will filter out the case when the file got modified during copying. It does happen - I don't say it is your case, but you haven't made any statement about this.
2) if the individual packets get corrupt, it would be a miracle if all of the corrupt ones would have a correct checksum. So check the counters on all interfaces along the way; if they show nothing, run Wireshark or tcpdump at the source or destination machine while copying, and look for retransmissions. Occurrence of retransmissions will indicate that packets have been dropped due to wrong checksum even though the interface counters do not indicate this (which would be a bug of the counters, can happen).
3) if you can see no retransmissions, some device on the path ignores wrong checksums and lets the packets through anyway, which would be the worst case to analyze, or the application feeds wrong data to the network. Checksum verification in Wireshark may be misleading when the capture is taken at the endpoint machine, so don't rely on that. To tell between the two possibilities, you would have to capture the same file transfer at both the source and the destination, make a binary comparison of the files to identify the offsets of the error(s), and then compare the contents of the packet carrying the corrupt bytes in the two captures. If the contents of this packet in the capture from source matches the one from the destination, it is an application layer problem, otherwise it is a network one.
Assuming that you have no network analyzer capable of intentionally sending Ethernet frames with wrong checksums, the only way to find the guilty box is the good old splitting the path into halves. First, repeatedly copy the file via the complete network path until the error shows up for the third time, in order to get a rough idea on how frequently the error happens. Then move the destination machine to the middle of the network path and copy the same number of times again. If all of the attempts suceed this time, this half of the path is working properly, so connect some source machine to the middle point of the network, put the destination one at its original place, and test again. At the end, you'll end up with the source machine and the destination machine connected to the same middlebox, or you'll find that the destination machine itself is guilty.