Hi,
TLDR version: My TCP connection that’s routed by hap lite behaves as if TCP window can only hold single packet at a time and I have to ack every frame individually. I blame routeros but I don’t know what might be the exact cause. tcpdump on one side shows that server response contains 8 frames before waiting for ack (I suppose that is the tcp slow start I’m looking at) . On the other end client only sees one first frame and acknowledges it. The interesting part is that if I run
/tools/sniffer/quick interface=wg0
The issue goes away, as if everything is fine. The behavior is also only seen if I establish connection where client IP is from some network that routeros knows nothing about and has to forward frame to the next hop. If the network is attached to the router(in my case that’s network inside VPN tunnel) then everything is grand and no packet loss anywhere.
I’m not sure where to start so I’ll start from my use case.
What I’m trying to do
I have a VM in the cloud that has public IP and I want to use it as an L3 proxy/NAT to access services running inside local network that is behind NAT and can’t be reached from the outside easily. For that I created Wireguard tunnel between my routerboard and that VM.
My Network Topology
My routerboard in local network:
[admin@MikroTik] > /system/routerboard/print
routerboard: yes
board-name: hAP ac lite
model: RB952Ui-5ac2nD
revision: r2
serial-number: *
firmware-type: qca9531L
factory-firmware: 6.44.5
current-firmware: 7.9.2
upgrade-firmware: 7.9.2
router board interfaces:
/interface wireguard
add listen-port=7676 mtu=1420 name=wg0
/interface wireguard peers
add allowed-address=0.0.0.0/0 endpoint-address=$vm_public_ip endpoint-port=51820 interface=wg0 persistent-keepalive=10s
networks:
[admin@MikroTik] /ip/address> print
Flags: D - DYNAMIC
Columns: ADDRESS, NETWORK, INTERFACE
# ADDRESS NETWORK INTERFACE
;;; defconf
0 192.168.88.1/24 192.168.88.0 bridge
1 10.10.20.1/24 10.10.20.0 wg0
2 D 10.10.6.30/24 10.10.6.0 wan1
firewall rules:
192.168.88.16 is my service’s ip that I want to expose.
[admin@MikroTik] /ip/firewall> export
/ip firewall filter
add action=accept chain=input connection-state=established,related,untracked
add action=accept chain=input protocol=icmp
add action=accept chain=input in-interface-list=LAN
add action=drop chain=input
add action=fasttrack-connection chain=forward connection-state=established,related hw-offload=yes
add action=accept chain=forward connection-state=established,related
add action=accept chain=forward connection-state=new dst-address=192.168.88.16
add action=accept chain=forward in-interface-list=LAN
add action=drop chain=forward
/ip firewall mangle
add action=mark-connection chain=prerouting connection-mark=no-mark connection-state=new in-interface=wg0 new-connection-mark=vpn_conn
add action=mark-routing chain=prerouting connection-mark=vpn_conn new-routing-mark=conns_from_vpn
/ip firewall nat
add action=masquerade chain=srcnat comment="defconf: masquerade" ipsec-policy=out,none out-interface-list=WAN
Router also has 2 routing tables to use wireguard tunnel end point as the default next hop for connections established through that tunnel.
[admin@MikroTik] /routing/table> print
Flags: D - dynamic; X - disabled, I - invalid; U - used
0 D name="main" fib
1 name="conns_from_vpn" fib
routing table:
[admin@MikroTik] /ip/route> print detail
Flags: D - dynamic; X - disabled, I - inactive, A - active;
c - connect, s - static, r - rip, b - bgp, o - ospf, d - dhcp, v - vpn, m - modem, y - bgp-mpls-vpn; H - hw-offloaded; + - ecmp
DAd dst-address=0.0.0.0/0 routing-table=main pref-src="" gateway=10.10.6.1 immediate-gw=10.10.6.1%wlan1 distance=1 scope=30 target-scope=10
vrf-interface=wlan1 suppress-hw-offload=no
DAc dst-address=10.10.6.0/24 routing-table=main gateway=wan1 immediate-gw=wan1 distance=0 scope=10 suppress-hw-offload=no
local-address=10.10.6.30%wan1
DAc dst-address=10.10.20.0/24 routing-table=main gateway=wg0 immediate-gw=wg0 distance=0 scope=10 suppress-hw-offload=no
local-address=10.10.20.1%wg0
DAc dst-address=192.168.88.0/24 routing-table=main gateway=bridge immediate-gw=bridge distance=0 scope=10 suppress-hw-offload=no
local-address=192.168.88.1%bridge
0 As dst-address=0.0.0.0/0 routing-table=conns_from_vpn pref-src="" gateway=wg0 immediate-gw=wg0 distance=1 scope=30 target-scope=10
suppress-hw-offload=no
1 As dst-address=192.168.88.0/24 routing-table=conns_from_vpn pref-src="" gateway=bridge immediate-gw=bridge distance=1 scope=30 target-scope=10
suppress-hw-offload=no
the VM is working as a simple DNAT box and has single nat rule like this:
iptables -t nat -A PREROUTING -p tcp -d $vm_public_ip --dport 443 -j DNAT --to-destination 192.168.88.16:443
How it all supposed to work
Client reaches for $vm_public_ip and establishes connection. VM uses DNAT and instead forwards all traffic to 192.168.88.16 which is reachable through wireguard tunnel established between VM and bouterboard. Routerboard decapsulates traffic, marks the whole connection as ‘vpn_conn’ and forwards it to the destination
On the way back routerboard sees that it’s vpn_conn frame and forwards it to tunnel using default route from conns_from_vpn routing table. On the other end of the tunnel VM does it’s job as a NAT box and send the packet further to the client.
Traffic Sniffing and debugging part
I just run wget http://$vm_public_ip/10gb.bin and capture pcaps on every interface I can.
On the Server side I see usual tcp handshake. Then 1 ingress packet with HTTP request, 8 egress packets containing first part of response and 1 second delay. Then 1 ingress ack for the first packet of response. again response of 8 packets. again 1 second delay. again ack only for the second packet of response and so on.
On the client side I see normal tcp handshake, 1 packet of http request, 1 second delay, 1 packet of http response. ack. 1 second delay. second packet of http response etc. As if the client’s tcp window can only hold 1 packet at a time (which is not true, it’s 64k and window scale tcp option present ).
That makes me think that routeros looses all but the first frame in the batch so I’ve decided to run sniffer from the routeros and the problem completely disappeared! the routerboard’s CPU is now at 100% crunching through the packets at 60mbit/s and everyone is happy. Which is really strange for me and I can only guess how turning sniffer on can help.
Another interesting observation is that if I run wget on the VM itself and use 10.10.20.2 ip then there is also no packet loss. The packets are only lost when the traffic is NATed on VM. I still believe that’s routeros who causes the trouble though because packets are lost somewhere inside wireguard tunnel before being decrypted and before reaching linux network stack.
Maybe wireguard doesn’t know what to do with some of the packets when dst-ip is not in any table nor is it any tunel endpoint ip but then I’d expect all traffic to be lost.
The problem is most certainly not in conntrack as I turn on and off routeros’ sniffer multiple times during the same http get request and I see packet loss going away and reappearing again.
I’m hoping that’s just some misconfiguration mistake. If anyone has any ideas what could be the cause of that behavior or what else I can do to diagnose the problem I would greatly appreciate your ideas! ![]()