This is my current setup:
And this is the "command window":
Explained: Winbox on the left is the NAT router, showing connection tracking with broadcast crud filtered out, interfaces, and CPU Profile tool. Top right is the BRAS, with 8 PPPoE clients at 4Mbps rate limit (with 5Mbps burst). These 8 clients are the RB411s, remotely controlled via a simple app I've written, which can run a variety of load tests on individual clients or all at once. The load tests revolve around downloading a 5MB .zip file from the web server in the diagram above.
Steps to trigger the issue:
1. Start a parallel download load test, where each client downloads the .zip file multiple times in parallel via fetch tool. This is what we see while running:
We can see that connection tracking captures a number of simultaneous connections from the clients on 172.16.0.X:Y to 10.0.0.124:80. So far so good. We can also see CPU load is quite high, which is expected.
2. Disable the PPPoE interface on the BRAS:
The downloads stop, but almost immediately, thousands of tracked connections are built up on the NAT router, all of them with the maximum expiry time of 1 day (the default Mikrotik setting). These connections all come from the web server on 10.0.0.124, but with increasing source ports starting at 1, all aimed at the addresses of the PPPoE clients which are now "dead".
I captured a trace of this event, and it shows a flood of ACK packets from 10.0.0.124 sourced on increasing ports, starting at 1. Wireshark decodes the ACKs on port 123 as NTP, port 139 as NetBIOS, etc. Of interest is that the source and destination MAC addresses change in the three retries per ACK, first packet is from NAT router MAC to BRAS MAC, the other two are from BRAS MAC to NAT router's MAC:
Code: Select all
Source Source Port Destination Dest Port Protocol Length Src MAC Dst MAC
10.0.0.124 1 172.16.0.253 56633 TCP 1494 6c:3b:6b:9b:82:c3 b8:69:f4:12:69:53
10.0.0.124 1 172.16.0.253 56633 TCP 1494 b8:69:f4:12:69:53 6c:3b:6b:9b:82:c3
10.0.0.124 1 172.16.0.253 56633 TCP 1494 b8:69:f4:12:69:53 6c:3b:6b:9b:82:c3
I can share the full pcap dump if someone wants to dig into it, but it's basically 1000s of the above, with source port increasing by 1 on each try.
Relevant configuration for the NAT router:
Code: Select all
/interface bridge add admin-mac=6C:3B:6B:9B:82:C3 auto-mac=no fast-forward=no name=BR_LAN protocol-mode=none
/ip address add address=10.40.40.10/24 interface=ETH2_V_50_WAN network=10.40.40.0
/ip address add address=192.168.100.1/24 comment="Internal IP for PPPoE BRAS" interface=BR_LAN network=192.168.100.0
/ip dns set allow-remote-requests=yes
/ip firewall filter add action=accept chain=input comment="defconf: accept established,related,untracked" connection-state=established,related,untracked
/ip firewall filter add action=accept chain=forward comment="defconf: accept established,related, untracked" connection-state=established,related,untracked
/ip firewall filter add action=drop chain=forward connection-state=invalid
/ip firewall filter add action=drop chain=input connection-state=invalid
/ip firewall filter add action=drop chain=forward connection-nat-state=!dstnat connection-state=new in-interface=ETH1_MGMT
/ip firewall filter add action=accept chain=input comment="defconf: accept ICMP" protocol=icmp
/ip firewall filter add action=accept chain=forward comment="defconf: accept in ipsec policy" ipsec-policy=in,ipsec
/ip firewall filter add action=accept chain=forward comment="defconf: accept out ipsec policy" ipsec-policy=out,ipsec
/ip firewall mangle add action=mark-routing chain=prerouting new-routing-mark=nat_routing passthrough=yes src-address=172.16.0.2-172.16.0.254
/ip firewall nat add action=src-nat chain=srcnat comment="NAT PPPoE BRAS traffic" out-interface=ETH1_MGMT src-address=192.168.100.2 to-addresses=10.0.0.44
/ip firewall nat add action=src-nat chain=srcnat comment="NAT CPE PPPoE traffic" out-interface=ETH1_MGMT src-address=172.16.0.2-172.16.0.254 to-addresses=10.0.0.44
/ip route add check-gateway=ping distance=1 dst-address=172.16.0.0/24 gateway=192.168.100.2
The BRAS has no role in this that I can tell, so will not post the config unless someone requires it.
Fundamental questions:
1. Why would the firewall connection tracker interpret ACK packets on ports not related as new connections, and start tracking them? These are NOT related to an existing tracked connection from the PPPoE client to the web server, which uses port 80.
2. Why would it track each re-try of the ACKs as new connections?
3. Why does it apply the maximum expiry time, causing the RB to lock up resources for a whole day? As far as I have tested, these connections are NOT cleared unless they are manually removed (or I guess I could wait 24 hours and see if they expire...).
4. Why do the ACKs come first from NAT->BRAS then twice from BRAS->MAC in terms of MAC addresses?
4. Am I doing something totally stupid here that's causing this? This is totally possible!