Page 1 of 1

Approximately 5s delay in TCP connections when using a static route via an address on bridge

Posted: Mon Dec 28, 2020 8:59 pm
by arclength
My home router is a hEX RB750Gr3 running RouterOS v6.48. The LAN is 192.168.20.0/24. I'm running WireGuard on a Linux server (eth0: 192.168.20.10 wg0:192.168.21.1) for "roadwarrior" access to my home network as well as linking up with off-site backup hosts. Clients access this Linux server from the internet through a port forward on the RB750Gr3.

If it's helpful, here's a simplified diagram of my network.
network-diagram.png
The WireGuard network is 192.168.21.0/24. Since I need to access WireGuard hosts from the LAN, I've dispensed with the usual masquerade ifup/ifdown iptables rules on the Linux host and added a static route on the RB750Gr3 for 192.168.21.0/24 via 192.168.20.10. This setup mostly works, but connections take like 5 seconds to start moving data from 192.168.20.0/24 to 192.168.21.0/24 after the initial TCP handshake. This problem does not arise with traffic to and from the internet.

Consider the following example
$ time curl http://192.168.21.1 > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   612  100   612    0     0     91      0  0:00:06  0:00:06 --:--:--   175

real	0m6.742s
user	0m0.020s
sys	0m0.038s
$ sudo ip route add 192.168.21.0/24 via 192.168.20.10
$ time curl http://192.168.21.1 > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   612  100   612    0     0   199k      0 --:--:-- --:--:-- --:--:--  298k

real	0m0.046s
user	0m0.022s
sys	0m0.021s
The WireGuard server is also running an HTTP server for internal use. To take WireGuard out of the picture, we're sending an HTTP GET request to the WireGuard server's wg0 interface (192.168.21.1) rather than something like the offsite backup server (192.168.21.10). We identify the routing on the RB750Gr3 as the culprit by adding a static route on our testing host which bypasses the RB750Gr3. When we do this, the delay goes away.

Here's a screenshot of a representative packet capture. The TCP handshake goes fine, but things go downhill after the testing host sends the HTTP GET.
pcap-problem.png
A lightly redacted copy of my configuration follows:
# dec/28/2020 12:47:30 by RouterOS 6.48
# software id = REDACTED
#
# model = RouterBOARD 750G r3
# serial number = REDACTED
/ip pool
add name=dhcp ranges=192.168.20.10-192.168.20.254
/ip dhcp-server
add address-pool=dhcp disabled=no interface=bridge name=defconf
/ip address
add address=192.168.20.1/24 comment=defconf interface=ether2 network=192.168.20.0
/ip dhcp-client
add comment=defconf disabled=no interface=ether1
/ip dhcp-server lease
====REDACTED====
/ip dhcp-server network
add address=192.168.20.0/24 comment=defconf gateway=192.168.20.1 netmask=24
/ip dns
set allow-remote-requests=yes
/ip dns static
add address=192.168.20.1 comment=defconf name=router.lan
/ip neighbor discovery-settings
set discover-interface-list=LAN
/ip firewall filter
add action=accept chain=input comment="defconf: accept established,related,untracked" connection-state=established,related,untracked
add action=drop chain=input comment="defconf: drop invalid" connection-state=invalid
add action=accept chain=input comment="defconf: accept ICMP" protocol=icmp
add action=accept chain=input comment="defconf: accept to local loopback (for CAPsMAN)" dst-address=127.0.0.1
add action=drop chain=input comment="defconf: drop all not coming from LAN" in-interface-list=!LAN
add action=fasttrack-connection chain=forward comment="defconf: fasttrack" connection-state=established,related
add action=accept chain=forward comment="defconf: accept established,related, untracked" connection-state=established,related,untracked
add action=drop chain=forward comment="defconf: drop invalid" connection-state=invalid
add action=drop chain=forward comment="defconf: drop all from WAN not DSTNATed" connection-nat-state=!dstnat connection-state=new in-interface-list=WAN
/ip firewall nat
add action=masquerade chain=srcnat comment="defconf: masquerade" ipsec-policy=out,none out-interface-list=WAN
add action=dst-nat chain=dstnat comment=wireguard dst-port=51820 protocol=udp to-addresses=192.168.20.10 to-ports=51820
/ip route
add distance=1 dst-address=192.168.21.0/24 gateway=192.168.20.10 pref-src=192.168.20.1
/ip service
====REDACTED====

Re: Approximately 5s delay in TCP connections when using a static route

Posted: Wed Dec 30, 2020 9:33 pm
by erkexzcx
Seems your target destination (of your static route) is part of existing bridge. I once had similar issue and all was fixed when I enabled bridge firewall:

/interface bridge settings set use-ip-firewall=yes

It just fixed it for me. Maybe someone has better ways to fix this kind of issue.

Re: Approximately 5s delay in TCP connections when using a static route

Posted: Wed Dec 30, 2020 10:45 pm
by arclength
Thanks for that!

Yes, the target destination is part of an existing bridge.

Enabling the ip firewall for the bridge does resolve the issue (progress!) in that latency in traffic between 192.168.20.0/24 and 192.168.21/24 goes away. Unfortunately, there are the following side effects:
  • I can consistently get iperf3 results of 920-940 Mbps across hosts on 192.168.20.0/24. When I turn on the ip firewall for the bridge and the traffic transverses the hEX's switch, speeds go down to 440 Mbps. If I take the hEX's onboard switch out of the picture by connecting a separate switch to a bridge port and connecting the rest of the network to that switch, speeds go back to what they should be. If anything, they get more consistent, with all results being 941±1 Mb/s.
  • I've got a residential gigabit service. Speedtest.net results to a particular server consistently goes from about 800 Mbps up/850 Mbps down to 800 Mbps up/440 Mbps down when I enable the IP firewall on bridge. Replacing the hEX's role as a switch as previously described seems to improve upload speed by 50 Mb/s.
Other approaches are welcome, but I think the IP firewall needs to be enabled on the bridge for this to work. I wonder why it works at all with it disabled. But hey, I understand just a bit more about how RouterOS does things now.

As I understand it, I've got the following options:
  1. Unassign a port from the hEX's bridge and do a dedicated run from the server to the hEX on that port on a new point-to-point network. I worry that this will produce the same kinds of slowdowns on the WAN link because it's still an interface that the IP firewall has to run on, exacerbated by the fact that the WAN port and the the point-to-point link will share a link to the CPU.
  2. VLAN nonsense to accomplish the same as #1.
  3. Get a more powerful router than the hEX that can handle running the IP firewall on another interface. Was already on my radar as I'd like to get IPv6 running, but the hEX's performance has been so poor when I've tried it in the past that it's disabled for now.
Or maybe there is another way to accomplish this without enabling ip firewall on the bridge?

Re: Approximately 5s delay in TCP connections when using a static route  [SOLVED]

Posted: Thu Dec 31, 2020 12:37 am
by mkx
I'm pretty sure you're victim of "routing triangle": when 192.168.20.9/24 host initiates connection towards 192.168.21.0/24, it sends packet to its default gateway (192.168.20.1). That MT takes a note in its connection tracking state and forwards packet to next hop router (WG concentrator at 192.168.20.10). Then the packet proceeds to the destination. Destination replies, packet arrives at WG concentrator which notices that destination address is in directly connected subnet and delivers it directly. Reply packet thus bypasses main router and its connection tracking machine can't update connection state properly. Next packet, sent from 192.168.20.0/24 host, is then out of perceived connection state and is dropped due to being invalid.

The solution is to disable connection tracking for connections between the two subnets (and hence the firewall filter rule which accepts untracked packets will trigger). Or introduce a new subnet, used solely for connection between the main router and WG concentrator which means WG concentrator will have to pass replies to main router making main router's connection tracking machine happy. Or stop droping invalid packets. Or something else which will bypass invalid connection state.

Re: Approximately 5s delay in TCP connections when using a static route

Posted: Thu Dec 31, 2020 2:16 am
by arclength
Thanks you mkx!

I added rule 3, so my forward chain now looks like
Flags: X - disabled, I - invalid, D - dynamic 
 0  D ;;; special dummy rule to show fasttrack counters
      chain=forward action=passthrough 

 1    ;;; defconf: fasttrack
      chain=forward action=fasttrack-connection connection-state=established,related 

 2    ;;; defconf: accept established,related, untracked
      chain=forward action=accept connection-state=established,related,untracked log=no log-prefix="" 

 3    chain=forward action=accept connection-state=invalid,new src-address=192.168.20.0/24 dst-address=192.168.21.0/24 in-interface=bridge log=no 
      log-prefix="" 

 4    ;;; defconf: drop invalid
      chain=forward action=drop connection-state=invalid 

 5    ;;; defconf: drop all from WAN not DSTNATed
      chain=forward action=drop connection-state=new connection-nat-state=!dstnat in-interface-list=WAN 
With this in place, I can set use-ip-firewall=no for my bridge, I don't see the huge latency problems I saw before, and the throughput problems I've seen with use-ip-firewall=yes are gone. I'm also using the hEX solely as a bridge because I was going to put a switch in after the hEX anyway (add some 10g links). I think my problem would be solved if I went back to using the built in hEX switch, but I'm not going to bother testing.

Re: Approximately 5s delay in TCP connections when using a static route

Posted: Thu Dec 31, 2020 2:31 am
by erkexzcx
That's something I learnt too. :)