So all in all you were right and it is an MTU related issue, but it is a weirdly intermittent one as otherwise your downloads would have to break much sooner - as said, when TCP is used to send a file, it tries to slice it into pieces maxing out the smallest available MTU on the path. There could be an exception to this rule if the receiving side would constantly report a small receive window, i.e. the remaining size of the receiving buffer in the TCP stack which can be filled by the payload of the incoming packets. So if the receiving application using that TCP socket is slow to fetch the data from the buffer, the buffer may be smaller than the MSS all the time, so if the sender would decide to send smaller packets rather than to wait for the window to become larger, the available MTU would not be maxed out most of the time, and once it would, the download would fail because the discovered MTU wasn't correct. But for this to happen, several things have to go wrong - namely, some element on the path must be letting through frames larger than the actual MTU somewhere further on the path, but smaller frames must be going for long enough for this to remain unnoticed.
Top to bottom (L4 to L1):
- the SYN and SYN+ACK packets inform the server and client, respectively, about the MSS they can accept, calculating it based on their local MTU. However, TCP packets are sent with "don't fragment" flag set, which means that any router on the path which finds the packet too big to fit to outbound interface's MTU drops it and sends back an ICMP notification saying that it would have to be fragmented and what is the maximum size which will fit. The sender than sends the same data again using a shorter packet according to this information. So during first few attempts, the packet finally reaches the destination and gets acknowledged, and the sender keeps its MSS in sending direction on this reduced value. This process may not happen immediately as the payload of first few packets of a session may be far less than the actual MSS of the path, and it may also happen repeatedly during and ongoing session if there are multiple paths through the network, each with a different MTU.
- the PPPoE gets an information about MTU supported by the Ethernet layer from the EoIP tunnel or from the bridge to which it is connected; the smallest of the MTUs of all ports connected to the bridge is reported as bridge MTU where you use the bridge itself as an interface. You can see it in bridge parameters. The PPPoE itself adjusts the MTU of the underlying L2 interface by subtracting its own 8 bytes of headers, making it 1492 bytes if the L2 interface below has 1500.
- if you set the EoIP MTU to a higher value than the MTU of the physical path minus the size of (ethernet (+ vlan) + ip + gre) headers together, the EoIP algorithm will send IP packets larger than 1500 bytes so the IP stack below will split each packet carrying an EoIP-encapsulated PPPoE packet carrying a TCP packet into two fragments, one which fits to the MTU of its underlying physical interface and a much shorter one carrying what didn't fit to the first one with additional 20 bytes of the IP header. This way, it makes the physical MTU invisible to the PPPoE, but causes the rate of transport IP packets to almost double the rate of the PPPoE packets (almost because not all packets max out the PPPoE's MTU).
PPPoE doesn't have any mechanism to detect the actual MTU of the path. It blindly relies on the MTU reported by its underlying inerface and if a packet doesn't fit, it cannot do anything about it because the upper layer (IP) doesn't track the interface MTU constantly, so starting to report a different MTU would have no effect as nobody would ask again.
So the first thing I keep wondering about is how is it possible that it works most of the time - watching the window size reported in the ACK packets sent by the download recipient should confirm or deny my theory that most of the time the receiving buffer is almost full so the recipient asks for smaller-than-MSS packets.
Another possible explanation is that something is wrong with how the TCP stack fragments or re-assembles the GRE packets, so when the window size becomes lower than usual and the EoIP needs to deliver the frame using two packets and the difference is just one byte in either direction, the splitting or re-assembly fails while for bigger differences it works properly. To find this out, you would have to capture on the physical interface and at the EoIP one simultaneously, and ping using various packet sizes to find out the critical size which causes the issue:
/ping x.x.x.x do-not-fragment size=y, using values between, say, 1300 and 1500 one by one, maybe using a script:
local restbl {"0"=>3} ;for i from=1300 to=1520 step=1 do={set ($restbl->$i) [ping re.mo.te.ip count=1 do-not-fragment size=$i]} ; foreach size,rslt in=$restbl do={if ($rslt<3 && $rslt>-1) do={put ($size.": ".$rslt)}}
My problem is that I haven't found any value which would fail in my lab setup, until reaching the path MTU the GRE transport packets go without fragmentation, and then second fragments are added, and the reassembly works fine too:
157 time=6.102 num=315 direction=tx src-mac=CC:2D:E0:C9:B7:E6 dst-mac=64:D1:54:87:38:54 interface=ether1 src-address=192.168.5.1
dst-address=192.168.5.100 protocol=ip ip-protocol=gre size=1512 cpu=1 fp=no ip-packet-size=1498 ip-header-size=20 dscp=0
identification=37708 fragment-offset=0 ttl=255
158 time=6.108 num=317 direction=tx src-mac=CC:2D:E0:C9:B7:E6 dst-mac=64:D1:54:87:38:54 interface=ether1 src-address=192.168.5.1
dst-address=192.168.5.100 protocol=ip ip-protocol=gre size=1513 cpu=1 fp=no ip-packet-size=1499 ip-header-size=20 dscp=0
identification=37964 fragment-offset=0 ttl=255
159 time=6.114 num=319 direction=tx src-mac=CC:2D:E0:C9:B7:E6 dst-mac=64:D1:54:87:38:54 interface=ether1 src-address=192.168.5.1
dst-address=192.168.5.100 protocol=ip ip-protocol=gre size=1514 cpu=1 fp=no ip-packet-size=1500 ip-header-size=20 dscp=0
identification=38220 fragment-offset=0 ttl=255
160 time=6.119 num=321 direction=tx src-mac=CC:2D:E0:C9:B7:E6 dst-mac=64:D1:54:87:38:54 interface=ether1 src-address=192.168.5.1
dst-address=192.168.5.100 protocol=ip ip-protocol=gre size=1514 cpu=1 fp=no ip-packet-size=1500 ip-header-size=20 dscp=0
identification=38476 fragment-offset=0 ttl=255
161 time=6.119 num=322 direction=tx src-mac=CC:2D:E0:C9:B7:E6 dst-mac=64:D1:54:87:38:54 interface=ether1 src-address=192.168.5.1
dst-address=192.168.5.100 protocol=ip ip-protocol=gre size=35 cpu=1 fp=no ip-packet-size=21 ip-header-size=20 dscp=0
identification=38476 fragment-offset=1480 ttl=255
162 time=6.126 num=325 direction=tx src-mac=CC:2D:E0:C9:B7:E6 dst-mac=64:D1:54:87:38:54 interface=ether1 src-address=192.168.5.1
dst-address=192.168.5.100 protocol=ip ip-protocol=gre size=1514 cpu=1 fp=no ip-packet-size=1500 ip-header-size=20 dscp=0
identification=38732 fragment-offset=0 ttl=255
163 time=6.126 num=326 direction=tx src-mac=CC:2D:E0:C9:B7:E6 dst-mac=64:D1:54:87:38:54 interface=ether1 src-address=192.168.5.1
dst-address=192.168.5.100 protocol=ip ip-protocol=gre size=36 cpu=1 fp=no ip-packet-size=22 ip-header-size=20 dscp=0
identification=38732 fragment-offset=1480 ttl=255
If you end up with the same result like me, I'd recommend to ping for a long time (minutes) with the maximum packet size which fits without fragmentation and the
do-not-fragment option; if you can see time intervals where such packets do pass and other time intervals where they don't, there must be something on the path that changes the MTU value.