Community discussions

MikroTik App
 
NetWorker
Frequent Visitor
Frequent Visitor
Topic Author
Posts: 98
Joined: Sun Jan 31, 2010 6:55 pm

Slow VPN speed with single TCP stream in one direction

Sat Apr 03, 2021 2:49 am

Hey everyone,
I’ve got a VPN issue that I can’t figure out. You can skip the background description if you wish.

Background:
- Main office: Mikrotik RB3011, 30/10 mbps (down/up) on VDSL, Win2008r2 filserver
- Remote office: Mikrotik RB2011uias, 10/10 mbps on 3 hop wireless uplink to fiber, Win2016 fileserver

At some point we’ll upgrade past our physical Windows boxes but for the moment we still rely on them with a SMB (robocopied) backup performed from the remote to the main office.
Ever since I set the VPN up between both sites (4-5 years ago) I’ve experienced slow upload speeds from the remote to the main office. At the time I thought it was due to known issues with SMB (version issues between the 2k8 and 2016 or the protocol being too chatty over WAN). I tinkered around a bit but gave up as the time it took to upload the backups was something I could live with. Last week the size of our backups doubled due to new software and it now takes north of 72 hours to upload a full image. We generally do incrementals but every second week we do full sized backups. Transfer speed hangs around 2 to 3 mbps. So I decided to try out FTP and Windows 2016 to Windows 10 but I didn’t see any improvement.
About a year ago we upgraded the remote office’s connection to a symmetric 10/10 mbps and we now have a solid 25 ms round trip time from one office to the other.

Long story short, I found out it’s a VPN issue with single stream TCP connections and has nothing to do with the application running on top of it (SMB, FTP, other).

The VPN itself is L2TP over IPSec with IKE2 and certificates. I chose L2TP because I needed an interface to assign OSPF to and GRE or IPIP needs static IPs which we don’t have. And I chose IPSec because it supports hardware encryption on the 3011 and is generally more secure using certificates rather than PSKs. Authentication is SHA256 and encryption AES256cbc.


Symptoms:
- Single stream TCP transfers from the 3011 to the 2011 speed maxes out at 10 mbps
- Single stream TCP transfers from the 2011 to the 3011 speed crawls along at 2 to 3 mbps
- Multiple stream TCP transfers from the 2011 to the 3011 are able to max out the 10 mbps.
- CPU load at the 2011 is around 30% when encrypting at 3 mbps.
- CPU load at the 3011 is about 2 to 3% obviously as it only has to route a bit of traffic and the encryption being offloaded when pushing 10 mbps.
- CPU load at the 2011 is about 60-70% when decrypting at 10 mbps
- Single TCP stream internet upload maxes out the 10 mbps
- A second office, also with a RB2011, same setup, 4 mbps upload speed also crawls along at ¼ (1 mbps ish) speed. And is also able to max out the 4 mbps when using multiple streams.

Things I’ve tried:
- AES256 CTR, AES128 CTR, AES128 CBC
- GRE, IPIP and PPTP instead of L2TP
- MTU and MRU dialed down to adequate values
- MSS clamp to PMTU rule

The issue:
Single TCP stream through the VPN goes fine one way but crawls along at ¼ speed in the other. Multiple streams are able to max out the connection. It doesn’t seem to be an encryption or hardware problem but I’m at a loss as to what could be causing this. Any pointers greatly appreciated!!
 
User avatar
bpwl
Forum Guru
Forum Guru
Posts: 2978
Joined: Mon Apr 08, 2019 1:16 am

Re: Slow VPN speed with single TCP stream in one direction

Sun Apr 04, 2021 11:04 pm

Any pointers? OK. Slow transfer over a long distance?
This sounds like "TCP congestion avoidance" kicking in. Windows has moved on from the initial RENO and TAHOE and other algoritmes, to the default "compound".
But now you can also set CUBIC. https://msandbu.org/windows-10-and-serv ... ancements/
 
NetWorker
Frequent Visitor
Frequent Visitor
Topic Author
Posts: 98
Joined: Sun Jan 31, 2010 6:55 pm

Re: Slow VPN speed with single TCP stream in one direction

Wed Apr 07, 2021 3:37 am

Thanks for replying bpwl!

I've been reading up on TCP congestion control. Cool stuff! Never occured to me the amount of engineering that goes into a protocol we take for granted.

I do have a couple of remarks and a couple of questions though.

- The issue persists when only two devices are involved: routerboard on each side and traffic is /tool btest generated on either device (i.e. no windows or other involved).
- While the physical distance is still a couple hundred miles, RTT is a solid 25 ms through the tunnel. That's better than some wireless LANs.
- We have multiple ISPs across multiple locations. At out main office we also have two. The issue persists across different connections. So while it's possible that it's due to an issue somewhere upstream, it's highly unlikely.

- Our central router is on 6.46.2 stable. Our remote offices' are either on 6.43.2 stable o 6.43.16 long term. Is it possible that Mikrotik updated the TCP congestion control algorithm (I doubt it)?
- I haven't wrapped my head around congestion control's dependency on MSS yet. How would you recommend to tune MSS in RouterOS to best adapt it for tunnel performance?
- What do you make of the fact that only the tunnel (being UDP encapsulated) is affected while uploads at either end are unaffected?
- What about the fact that it only goes in one direction?

Finally, I'll see if I can wireshark the tunnel an look for retransmissions. Thanks again!
 
User avatar
bpwl
Forum Guru
Forum Guru
Posts: 2978
Joined: Mon Apr 08, 2019 1:16 am

Re: Slow VPN speed with single TCP stream in one direction

Wed Apr 07, 2021 12:07 pm

I'll see if I can wireshark the tunnel an look for retransmissions.
... also look at the timing of the packets. Which side is introducing the delay?

Be aware of the fact that some TCP implementations only ACK every other packet. (There is a delay of 120ms for a burst of uneven number of packets for the last ACK.)
https://serverfault.com/questions/34866 ... end-an-ack
Last edited by bpwl on Wed Apr 07, 2021 12:37 pm, edited 1 time in total.
 
User avatar
bpwl
Forum Guru
Forum Guru
Posts: 2978
Joined: Mon Apr 08, 2019 1:16 am

Re: Slow VPN speed with single TCP stream in one direction

Wed Apr 07, 2021 12:17 pm

Is it possible that Mikrotik updated the TCP congestion control algorithm (I doubt it)?
No idea. The TCP congestion plays at the end-points. If 2 devices communicate over the MT, is the TCP session terminated at the MT or not? With a webproxy it is, normal routed traffic is not, even NAT is not. In the case of local encryption on the MT I wonder if the TCP (congestion logic) session is terminated at the router or not. If the VPN is end-to-end it's not.

If the TCP session is not terminated and restarting at the tunnel entrance, the MT congestion algorithm does not play a role, for client-to-client TCP sessions.
 
User avatar
bpwl
Forum Guru
Forum Guru
Posts: 2978
Joined: Mon Apr 08, 2019 1:16 am

Re: Slow VPN speed with single TCP stream in one direction

Wed Apr 07, 2021 12:31 pm

How would you recommend to tune MSS in RouterOS to best adapt it for tunnel performance?
MSS should be set to have no fragmentation (introduced by the tunnel encapsulation and UDP overhead). It is typical lower than the ethernet MSS. TCP should discover the max MSS size, but does not always. It safe to set it at 1400 bytes. Optimum may be somewhat larger, but the somewhat smaller packet has not much performance impact. I sometimes used 1300bytes to be sure.
UDP is not aware of MSS size and that can be a major problem. (e.g. Microsoft AD logon handshake typically switched from UDP to TCP at 2008 bytes. And that's too late. With many OU and security groups some people cannot logon through a tunnel, until you set AD to always use TCP) . UDP is always a potential problem with satellite links, as small and large UDP packets are handled with different priority.

A steady stream of large TCP packets going through the tunnel should give a steady stream of large UDP packets of tunnel traffic.

The correct lower MSS should be set at the end-points, or you will create fragmentation at the routers.
 
NetWorker
Frequent Visitor
Frequent Visitor
Topic Author
Posts: 98
Joined: Sun Jan 31, 2010 6:55 pm

Re: Slow VPN speed with single TCP stream in one direction

Wed Apr 14, 2021 8:57 pm

Ok, so I'm back.

First of all, I thought to set MSS to a fixed value to be a big no-no. That one should set MTU and MRU to fixed values so as not to fragment and leave MSS be.

Further I didn't quite grasp what you meant by terminating the tcp session at the endpoints. We're talking traffic generated, encrypted and sent from router A to router B with the reverse process ocurring at B. The traffic doesn't leave either router except for the trip down the tunnel. So the tcp session necessarily starts at router A (remote office) and terminates at router B (central office).

Anyway, I wiresharked (100k packet captures) the tunnel during bandwidth tests. Results do look interesting.
- Sending RB3011 -> RB2011 @ 9.2 mbps tunnel (10.2 mbps give or take with overhead) I get about 1.7% duplicate ack in largish groups (like it gets 50 duplicates at one time) every few seconds, a handful of restransmissions (28) and out of orders (56). Otherwise smooth sailing.
- Receiving RB2011 -> RB3011 @ 2.4 mbps tunnel I get 8.1% duplicate ack in somwhat tighter groups (about 20), somewhat more frequently (every 200 tcp frames give or take), close to 1% retransmissions (966) but again only a handful of out of orders (111).

Any suggestion as to where to start troubleshooting? Thanks again for the help!

edit: the difference in performance does surprise me but I can't picture that less than 10% of duplicates and retransmissions and negligeble amounts of out of orders would cause a 75% drop in throughput. Right?

edit2: if anything I rather suspect the duplicates and retransmissions to be a product of the limited tunnel performance and not the other way around. Though I've nothing to substantiate that.
 
User avatar
bpwl
Forum Guru
Forum Guru
Posts: 2978
Joined: Mon Apr 08, 2019 1:16 am

Re: Slow VPN speed with single TCP stream in one direction

Wed Apr 14, 2021 10:02 pm

MTU is OK, if MSS/MTU discovery works well. What happens if the ethernet MTU is 1500 and the tunnel MTU is 1400 , and the MTU discover did not work.
In other equipment (Juniper, Netscape, Fortinet, ...) there was the option to re-write the MSS in the TCP discover, so the sender would learn the smallest MTU in the path.
(https://kb.fortinet.com/kb/documentLink ... ID=FD40793)

Well I wonder about your TCP session termination at the routers (or not). I fully agree if you test from router to router then the endpoint can only be the routers. Your real file transfer is different. It starts from server1, is routed and encapsulated/decapsulated in the routers, and delivered to server2. The TCP session counters, packet numbers, buffer window size, ack handling, retransmission ... is only handled at the servers. The routers only route, fragment if needed, encode/decode, (normally do not reassemble fragments), the TCP session, at least that's what I expect.

Missing ACK's, resulting in retransmissions and duplicates, can have an enormous effect on the TCP throughput, as the TCP congestion avoidance algorithm reacts to this missing ACKs by reducing the pace at which it will send the next packages. It will gradually (fast or slow depending on the algorithm used) increase that pace until it starts failing again. This can be an unstable process.

The problem at the routers is that a large buffer (bufferbloat) is growing , because a fast network connects to a slower network. (Like on the highway : going from 3 lanes to 2 lanes is usually a problem). Maybe throttling would help: viewtopic.php?t=162533

Using wifi as medium is yet another mismatch with TCP. The TCP congestion algorithms are not fit for that varying medium. For satellite transmission very aggressive TCP congestion protocols are used, mostly executed in dedicated traffic concentrators (called accelerators). Routers do not end and restart TCP sessions, like those concentrators do, AFAIK. I don't know in detail how your tunnel is processed in RoS. The tunnel is often UDP (router-router) , to avoid having two competing TCP-congestion-protocols in the connection. So even if the UDP tunnel is router-router end-to-end, the TCP session is server-server end-to-end.

I had very good experience with the "Expand networks accelerator", this does much more like local ACK to increase the traffic over the link, and reduces delays. (Steelhead from Riverbed was too expensive)

And more practical: what performance do you get when you use a multistream FTP client (like Filezilla) ? Not sure it does segmented: https://whatbox.ca/wiki/Multi-threaded_ ... mented_FTP
 
NetWorker
Frequent Visitor
Frequent Visitor
Topic Author
Posts: 98
Joined: Sun Jan 31, 2010 6:55 pm

Re: Slow VPN speed with single TCP stream in one direction

Fri Apr 23, 2021 6:30 pm

Well, this week I disabled the IPSec and L2TP tunnel. I didn't suspect it to be the culprit but just to rule it out. So I went with an IPIP tunnel with no encryption (MSS clamp to PTMU=yes) and had the same results. UDP 10 mbps, multistream TCP 10 mbps, single stream TCP 2.4 mbps, single stream TCP the other way 10 mbps.

At this point I'm almost at the point of giving up on this. If it is a TCP issue then it's probably beyond my control. At some point I'll drive out there with another router and give it a try just to see if it's not due to anything in my config. Even though I've gone through forwards, backwards and sideways more than 100 times. It'll be a while since Corona has us pinned down again.

Edit: bpwl, just reread your post and forgot to mention that I had already tried FTP to a Ubuntu LTS server (see original post). And I *think* I used Filezilla as the client. But I wasn't aware there was a multistream option in Filezilla. I assume this'll only work with a Filezilla server since afaik that's not part of the original FTP spec, is it? I'll look into this later this week. I also looked into Expand Networks and Riverbed. The former is no more, with its remains having been taken over by the latter. But as far as I can see that line of products is history. In any case, we're not seated in the States so specialty hardware like this is hard or impossible to come by. =(

Thanks again for all the tips and suggestions!
 
mikegleasonjr
Frequent Visitor
Frequent Visitor
Posts: 55
Joined: Tue Aug 07, 2018 3:14 am

Re: Slow VPN speed with single TCP stream in one direction

Sat Aug 28, 2021 2:25 am

I have pretty much the same problem, did you resolve it?
 
sindy
Forum Guru
Forum Guru
Posts: 10205
Joined: Mon Dec 04, 2017 9:19 pm

Re: Slow VPN speed with single TCP stream in one direction

Sat Aug 28, 2021 10:06 am

I was looking into a similar problem (a single-connection TCP using /tool bandwidth-test between two CHR routers running at the same provider), and the root cause of the throughput being lowered from 200 Mbps to less than 0.5 Mbps was that 25 % of the tiny second fragments of the transport packets did not arrive to the destination (nor did 0.2 % of the large first fragments to be fair). All that on a network path with just 1.6 ms round-trip delay on ping, and the CHRs being just 5 to 7 routing hops away from each other.

@bpwl has already explained it all, but I'll do that once again using other words to emphasize the key aspect: since the TCP throughput bandwidth is properly calculated from the payload data successfully delivered, and since retransmissions are sent with an exponentially growing delay, if not only the initial packet carrying a given chunk of the payload stream but also multiple retransmissions of this packet get lost (in my case, due to loss of a fragment), the resulting gap in payload delivery may reach multiple seconds. Such gaps affect the result much more than would seem proportional to the packet loss ratio. I even had a case where RouterOS has terminated the test on its own because even the last retransmission attempt has failed. (I didn't test what happens if you run the test with connection-count other than 1 and one of the connectons fails this way, i.e. whether it gets replaced by a new one or not).

I'm currently waiting for a feedback from a fellow forum user on the same test on two CCRs about 2000 miles / 3000 km apart, where the choice of the local ISP at one end makes the difference between "business as usual" and "unusable".

Of course, your test results are welcome too - just run /tool sniffer on both routers simultaneously, saving the result into a file and filtering on the ip-address of the peer router alone (no filtering on ports because fragments do not contain port numbers) while running /tool bandwidth-test protocol=tcp connection-count=1.

In your case, it may not be fragments in particular to get lost, but some packet loss between the routers is by far the most likely root cause.
 
NetWorker
Frequent Visitor
Frequent Visitor
Topic Author
Posts: 98
Joined: Sun Jan 31, 2010 6:55 pm

Re: Slow VPN speed with single TCP stream in one direction

Sat Aug 28, 2021 11:51 pm

Hey, mikegleasonjr! No, I haven't found a solution to it. I managed to reduce full backup size so that they take about 30 hours. Given that I only do full backups every other week, daily backups are incrementals taking short of an hour to upload and that I've already sunk several workdays into this, I decided again that I can live with the limitation for the time being.

Hi sindy! I hear what you guys are saying. Basically that missing ACKs and especially retransmits of fragmented packets will make the TCP congestion avoidance kick the transmit delays to exponentially high orders of magnitude which essentially leads to very low actual transmission rates, regardless of actual link saturation.

And while I think you guys are on to something I'm thinking it should affect multiple TCP streams the same way it affects a single one. The loss ratio in percent would be the same so you'd have the same retransmit to delivered packets ratio. And that's not the behavior I'm seeing.
Oth, I'm thinking that a smaller bandwidthwise stream affected by an exponential time/frame loss would have a quicker chance to recover given that the total amount bytes to retransmit from a single event is smaller. However this assumes that the loss events are evenly distributed over time. If they weren't, as they're unlikely to be, this would lead to simultaneous collapse of various streams at once resulting in heavy delivery fluctuation, which I'm not seeing either. When using higher than one TCP connection counts that is.

Just ran a real quick btest.
1 stream = 2.4 mbps
2 streams = 4.0 mbps
4 streams = 6.2 mbps
8 streams = 8.3 mbps
It takes 10 connections to max out the available bandwidth (10 mbps, 9.5 throughput). And again a single connection is able to max out the 10 mbps in the other direction.

Would love to test this out further but unfortunately I'm tied down on another project, had my second kid two weeks ago and don't really have time for a thorough gremlin hunt in the coming weeks. But do keep posting please!
 
sindy
Forum Guru
Forum Guru
Posts: 10205
Joined: Mon Dec 04, 2017 9:19 pm

Re: Slow VPN speed with single TCP stream in one direction

Sun Aug 29, 2021 1:44 pm

Would love to test this out further but unfortunately I'm tied down on another project, had my second kid two weeks ago and don't really have time for a thorough gremlin hunt in the coming weeks. But do keep posting please!
There's little to post without any input data (from you and/or from anyone else affected by this "gremlin") to analyse :)

I've checked what's the difference at bandwidth-test side between single-connection TCP test and a multi-connection one. If the single-connection test cannot get any data through for about 30 seconds (maybe 32), it terminates. On the contrary, if one session within a multi-connection test cannot get any data through for an extended period of times (minutes), not only the test as a whole doesn't terminate, but both the bandwidth-test client and the bandwidth-test server keep trying on the broken session and resume its use if it recovers. And while that session is struggling, other sessions occupy the available bandwidth.

So in a multi-connection test with enough sessions in total, there seems to always be enough unaffected sessions to compensate for the loss of throughput on the other ones caused by whatever root cause. The rest is statistics.

As for the packet loss itself, there's a practical difference: if the loss only (or at least by large) affects fragments, taking measures to prevent fragmentation solves the issue; if the loss affects also complete (non-fragmented) packets, fragmentation prevention has almost no effect.

You don't need to assign too much time: assuming you use something-over-IPsec, and as the file transfer takes 30 hours, just sniff into a file, simultaneously at both ends, at any time the transfer is ongoing spontaneously, filtering by ip-address=ip.of.the.remote.peer. A minute of the recording from both ends is sufficient to show what's going on (start the sniffing at router A, then at router B, wait 1 minute, stop sniffing at router B, then stop it at router A). And since the actual data are encrypted, you can outsource the really time-consuming part, which is the analysis of the recordings, here. If you want to obfuscate even the actual IP addresses in the pcap files, you can use tracewrangler for that.
 
NetWorker
Frequent Visitor
Frequent Visitor
Topic Author
Posts: 98
Joined: Sun Jan 31, 2010 6:55 pm

Re: Slow VPN speed with single TCP stream in one direction

Tue May 03, 2022 4:34 pm

Hey y'all!!

Yesterday we went online with a new ISP over a symmetric GPON line at our main office (finally catching up to 2015 era tech rofl).
Anyway, our VPN issues are now solved. Our DSL line ISP was the culprit. I still can't pinpoint the problem but issues with their pppoe server or router coming off the pppoe lines would be my guess. For you guys out there having this issue make sure to erm... "annoy" the hell out of your ISPs staff so that they work with you on this issue.

In the meantime, I'll mark this one as solved (even though no real solution is provided).
Big shout out to bpwl and sindy for the help!!

Scratch all that. This issue is not behind us at all. This has gotten so old that I btested forgetting to set the tcp stream count to 1. So I was like f..me it works! But last night when I started transferring a backup it still crawled along, albeit at 4.5ish mbps now. That's somewhat better than it used to be. Latency is down to solid 8 ms now. As per bpwl's suggestion I decided to just go ahead and set up an ftp server an client and using three simultaneous transfers I'm able max out the 10 mbps at the remote office and upload the backups which again have increased in size in about overnight (9 hours, give or take). When all else fails, go back to basics lol.

At this time I suspect there's more than one factor at play. TCP congestion might be part of it. With additional performance the RB2011's CPU is barely coping with encryption, the wireless link at the remote office, being only half duplex by it's nature and possibly others are all conspiring against my networking-fu. My next step will probably involve replacing the remote router for one that has a hardware encryption chip like the 3011 or other. If I ever track this down I'll let you guys know but for the moment this works for me.

Who is online

Users browsing this forum: No registered users and 59 guests