Terrible connection tracking bug... or terrible stupidity! [SOLVED] (it was me!)

flameproof · Thu Sep 17, 2020 1:10 pm

I've been following the suggestions from Syed on his blog post, and have managed to replicate the setup in my lab (this is to fix the random PPPoE disconnection floods & CCR lockups reported on various threads previously).

This is my current setup:

PPPoE Test Setup.png

And this is the "command window":

Screen Shot 2020-09-17 at 07.56.58.png

Explained: Winbox on the left is the NAT router, showing connection tracking with broadcast crud filtered out, interfaces, and CPU Profile tool. Top right is the BRAS, with 8 PPPoE clients at 4Mbps rate limit (with 5Mbps burst). These 8 clients are the RB411s, remotely controlled via a simple app I've written, which can run a variety of load tests on individual clients or all at once. The load tests revolve around downloading a 5MB .zip file from the web server in the diagram above.

Steps to trigger the issue:

1. Start a parallel download load test, where each client downloads the .zip file multiple times in parallel via fetch tool. This is what we see while running:

Screen Shot 2020-09-17 at 07.57.59.png

We can see that connection tracking captures a number of simultaneous connections from the clients on 172.16.0.X:Y to 10.0.0.124:80. So far so good. We can also see CPU load is quite high, which is expected.

2. Disable the PPPoE interface on the BRAS:

Screen Shot 2020-09-17 at 07.58.19.png

The downloads stop, but almost immediately, thousands of tracked connections are built up on the NAT router, all of them with the maximum expiry time of 1 day (the default Mikrotik setting). These connections all come from the web server on 10.0.0.124, but with increasing source ports starting at 1, all aimed at the addresses of the PPPoE clients which are now "dead".

I captured a trace of this event, and it shows a flood of ACK packets from 10.0.0.124 sourced on increasing ports, starting at 1. Wireshark decodes the ACKs on port 123 as NTP, port 139 as NetBIOS, etc. Of interest is that the source and destination MAC addresses change in the three retries per ACK, first packet is from NAT router MAC to BRAS MAC, the other two are from BRAS MAC to NAT router's MAC:

Source	Source Port	Destination	Dest Port	Protocol	Length	Src MAC		Dst MAC
10.0.0.124	1	172.16.0.253	56633	TCP	1494	6c:3b:6b:9b:82:c3	b8:69:f4:12:69:53
10.0.0.124	1	172.16.0.253	56633	TCP	1494	b8:69:f4:12:69:53	6c:3b:6b:9b:82:c3
10.0.0.124	1	172.16.0.253	56633	TCP	1494	b8:69:f4:12:69:53	6c:3b:6b:9b:82:c3

I can share the full pcap dump if someone wants to dig into it, but it's basically 1000s of the above, with source port increasing by 1 on each try.

Relevant configuration for the NAT router:

/interface bridge add admin-mac=6C:3B:6B:9B:82:C3 auto-mac=no fast-forward=no name=BR_LAN protocol-mode=none
/ip address add address=10.40.40.10/24 interface=ETH2_V_50_WAN network=10.40.40.0
/ip address add address=192.168.100.1/24 comment="Internal IP for PPPoE BRAS" interface=BR_LAN network=192.168.100.0
/ip dns set allow-remote-requests=yes
/ip firewall filter add action=accept chain=input comment="defconf: accept established,related,untracked" connection-state=established,related,untracked
/ip firewall filter add action=accept chain=forward comment="defconf: accept established,related, untracked" connection-state=established,related,untracked
/ip firewall filter add action=drop chain=forward connection-state=invalid
/ip firewall filter add action=drop chain=input connection-state=invalid
/ip firewall filter add action=drop chain=forward connection-nat-state=!dstnat connection-state=new in-interface=ETH1_MGMT
/ip firewall filter add action=accept chain=input comment="defconf: accept ICMP" protocol=icmp
/ip firewall filter add action=accept chain=forward comment="defconf: accept in ipsec policy" ipsec-policy=in,ipsec
/ip firewall filter add action=accept chain=forward comment="defconf: accept out ipsec policy" ipsec-policy=out,ipsec
/ip firewall mangle add action=mark-routing chain=prerouting new-routing-mark=nat_routing passthrough=yes src-address=172.16.0.2-172.16.0.254
/ip firewall nat add action=src-nat chain=srcnat comment="NAT PPPoE BRAS traffic" out-interface=ETH1_MGMT src-address=192.168.100.2 to-addresses=10.0.0.44
/ip firewall nat add action=src-nat chain=srcnat comment="NAT CPE PPPoE traffic" out-interface=ETH1_MGMT src-address=172.16.0.2-172.16.0.254 to-addresses=10.0.0.44
/ip route add check-gateway=ping distance=1 dst-address=172.16.0.0/24 gateway=192.168.100.2

The BRAS has no role in this that I can tell, so will not post the config unless someone requires it.

Fundamental questions:

1. Why would the firewall connection tracker interpret ACK packets on ports not related as new connections, and start tracking them? These are NOT related to an existing tracked connection from the PPPoE client to the web server, which uses port 80.

2. Why would it track each re-try of the ACKs as new connections?

3. Why does it apply the maximum expiry time, causing the RB to lock up resources for a whole day? As far as I have tested, these connections are NOT cleared unless they are manually removed (or I guess I could wait 24 hours and see if they expire...).

4. Why do the ACKs come first from NAT->BRAS then twice from BRAS->MAC in terms of MAC addresses?

4. Am I doing something totally stupid here that's causing this? This is totally possible!

sindy · Thu Sep 17, 2020 2:02 pm

Fundamental questions:

1. Why would the firewall connection tracker interpret ACK packets on ports not related as new connections, and start tracking them? These are NOT related to an existing tracked connection from the PPPoE client to the web server, which uses port 80.

Because the source ip:port and destination ip:port do not match any existing tracked connection, and because you have loose-tcp-tracking under /ip firewall connection tracking set to the default yes, which means that the conntrack module doesn't look at presence of the SYN flag in the TCP packet in order to treat it as connection-state=new.

2. Why would it track each re-try of the ACKs as new connections?

If I read you right, these "re-tries" are not retransmissions as such, because each of these ACKs comes from a different source port. So it is no surprise that each spawns a new tracked connection (plus there wouldn't be so many of them if they were actual retransmissions by a well-behaving TCP stack).

3. Why does it apply the maximum expiry time, causing the RB to lock up resources for a whole day? As far as I have tested, these connections are NOT cleared unless they are manually removed (or I guess I could wait 24 hours and see if they expire...).

Or you could reduce the tcp-established-timeout to something much less that 1d, but basically yes, this is it. Whenever a packet bearing an ACK flag is the last one to be seen in a tracked TCP connection, the lifetime of that connection is reset to tcp-established-timeout; whenever a packet without the ACK flag is seen, the lifetime is reset to tcp-unacked-timeout.

4. Why do the ACKs come first from NAT->BRAS then twice from BRAS->MAC in terms of MAC addresses?

I'd say it is due to routing - when the PPPoE client is up, the BRAS has a connected route to it, so it sends the packet down the PPPoE pipe; while it is dead, its address is unassigned, and the BRAS sends the packet down the default route, which is the NAT router.

It might also explain the mystery why the ACKs come with increasing source ports, but the fact that the source address of those ACKs is the one of the server spoils this theory.

I would normally expect to see two packets in each macX -> mac Y direction (bridge + its member port), but I don't have enough information about the internal setup of the devices nor the exact point you are sniffing at.

5. Am I doing something totally stupid here that's causing this? This is totally possible!

I'd say the only omission is the routing at the BRAS. You should add a type=blackhole route towards the pool from which the PPPoE clients get their address assignments. Whenever a PPPoE client will be active, the blackhole route will be shadowed by the dynamically added connected one (with distance=0). And by setting loose-tcp-tracking to no you should get rid of those thousands of tracked connections if the increasing source port on the ACK packets is not caused by the routing loop combined with NAT but has a different reason.

flameproof · Thu Sep 17, 2020 3:37 pm

Hi sindy, thanks so much for your thoughful reply! I also came across a post from April where you mentioned loose-tcp-tracking. How does this setting impact the box in terms of resources CPU/memory? I don't want to jump out of the pan into the fire, so-to-speak...

Rather than go blow-by-blow, I'll respond in one block -- the original traffic capture was made on the NAT router. I have now made captures on the web server itself via tcpdump, and on the BRAS server's upstream port, the one that connects to the NAT router's port that's part of a bridge. The first conclusion is that the web server is NOT sending the "fake" ACKs at all. You can also see the rate limiter on the BRAS in effect, with the regular blocks of retransmissions:

Screen Shot 2020-09-17 at 14.16.02.png

When the link is cut off, the web server tries a few retransmissions before giving up.

On the NAT router, this is what's captured on the upstream interface, the one facing 10.0.0.124 which is the webserver, and which has 10.0.0.44 as its own IP address:

Screen Shot 2020-09-17 at 14.20.47.png

Again, all normal, matches what's seen at the web server. If I sniff on the bridge interface of the NAT router, where the BRAS is connected to, we see this:

Screen Shot 2020-09-17 at 14.24.57.png

followed by a boatload of retransmissions of the same ACKs:

Screen Shot 2020-09-17 at 14.25.24.png

If I filter by e.g. port 317, one of the "random" ports supposedly chosen by 10.0.0.124, we see one ACK followed by 13 retransmission attempts. This is what the BRAS sees, as captured on the upstream WAN interface that's facing the NAT router's bridge:

Screen Shot 2020-09-17 at 14.28.19.png

It seems quite evident that the ACK packets are actually hitting the BRAS, coming from the NAT router.

I can confirm that if I disable loose TCP tracking, the "phantom" connections do not take place, there are no ACKs from random ports, etc. It all looks "normal", and only the tracked connection from 172.16.0.247 (the PPPoE client) to 10.0.0.124:80 remains on the connections table, with an expiry timeout of 5 minutes:

Screen Shot 2020-09-17 at 14.32.11.png

The hard question is how can anything be generating "fake" packets with increasing port numbers in such massive numbers, and in response to what? I am proper stumped!

sindy · Thu Sep 17, 2020 4:23 pm

Hi sindy, thanks so much for your thoughful reply! I also came across a post from April where you mentioned loose-tcp-tracking. How does this setting impact the box in terms of resources CPU/memory? I don't want to jump out of the pan into the fire, so-to-speak...

I don't expect the load coming from inspecting one more bit (bear in mind that the ACK flag value is also inspected in every single TCP packet) to be significantly higher. So no memory impact at all while handling normal traffic, and a small bit of CPU.

The hard question is how can anything be generating "fake" packets with increasing port numbers in such massive numbers, and in response to what? I am proper stumped!

I'm afraid it must be the routing loop as I've described in the previous post, combined with the src-nat setting on the NAT router, the only thing which is not logical is that the source address remains the actual one of the server. All the rest makes sense - the NAT router sends the packet with source port X and destination port Y. When the BRAS returns it (because the only route to the destination address is the default one via the NAT router), the source port remains X and destination port remains Y, but this time it is a received packet from the NAT router's perspective, so it cannot match it to any existing tracked connection and thus it creates a new one. And as it treats the packet as the first one of a new connection, it also consults the NAT table, and finds out that the source port is already used by another connection, so it has to be substituted, hence it sends the packet out leaves with the next source port not used by any connection with the same source and destination address and destination port.

So now please add the blackhole route I've suggested to the BRAS, which should be sufficient to address the situation even if you set the loose-tcp-tracking back to yes. If it really is sufficient, I'd like to see the export of the configuration of the NAT router (not a screenshot from Winbox/WebFig), to understand how is it possible that only the port is NATed but the address remains unchanged during the process above.

Also, unless you are setting a fixed TTL value during forward on any of the two routers, you should see it to be decreasing in those ACK packets as they circulate. That's another illustration why screenshots are useless, you cannot click on the packet list row to see the whole dissection of that packet...

flameproof · Thu Sep 17, 2020 5:35 pm

You are on the right track - looks like a routing loop (not had one before!), looking at the packet details in Wireshark, "normal" traffic has a TTL of 63, when the link is cut off, there is a ping-pong of ACKs between NAT and BRAS boxes, with the TTL decreasing by one each time. When TTL gets to 1, the retransmissions start, again with TTL of 63 and decreasing...

I have attached the pcap file with the relevant portion:

NAT_Bridge.pcap.zip

The relevant BRAS configuration is thus, without adding your blackhole suggestion, which I'm going to do next:

/interface bridge add admin-mac=B8:69:F4:12:69:53 auto-mac=no fast-forward=no name=BR_LAN protocol-mode=none
/interface vlan add disabled=yes interface=BR_LAN name=BR_V_200_PPPOE vlan-id=200
/ip pool add name=VALID_ACCOUNTS ranges=172.16.0.2-172.16.0.254
/ppp profile add change-tcp-mss=no dns-server=172.16.0.1 local-address=172.16.0.1 name=PPPoE_CLIENTS only-one=yes remote-address=VALID_ACCOUNTS use-compression=yes use-encryption=yes
/ip firewall connection tracking set enabled=no loose-tcp-tracking=no
/interface pppoe-server server add authentication=mschap2 default-profile=PPPoE_CLIENTS disabled=no interface=BR_V_200_PPPOE keepalive-timeout=90 one-session-per-host=yes service-name=PPPoE_SERVER
/ip address add address=172.18.64.10/24 comment="PPPoE routed traffic" disabled=yes interface=ETH2_WAN network=172.18.64.0
/ip address add address=192.168.100.2/24 comment="BRAS internal traffic" interface=ETH2_WAN network=192.168.100.0
/ip firewall filter add action=accept chain=input comment="defconf: accept ICMP" protocol=icmp
/ip firewall filter add action=accept chain=input comment="defconf: accept to local loopback (for CAPsMAN)" dst-address=127.0.0.1
/ip firewall filter add action=accept chain=forward comment="defconf: accept in ipsec policy" ipsec-policy=in,ipsec
/ip firewall filter add action=accept chain=forward comment="defconf: accept out ipsec policy" ipsec-policy=out,ipsec
/ip firewall mangle add action=mark-routing chain=prerouting comment="Mark traffic for upstream routing" dst-address=!172.16.0.1 new-routing-mark=nat_routing passthrough=yes src-address=172.16.0.2-172.16.0.254
/ip route add check-gateway=ping comment="Route PPPoE traffic upstream" distance=1 gateway=192.168.100.1 routing-mark=nat_routing
/ip route add distance=1 gateway=192.168.100.1

Thanks again!

flameproof · Thu Sep 17, 2020 5:43 pm

I have now added a type=blackhole route with destination address 172.16.0.0/24, and the flood no longer happens! So, routing loop confirmed, with the mystery of the source address being the web server, as you mentioned.

sindy · Thu Sep 17, 2020 5:52 pm

The relevant BRAS configuration is this

Yeah, but I was interested in the configuration of the NAT router, where the source port changes without source address changes happen.

flameproof · Thu Sep 17, 2020 6:12 pm

I posted the relevant config of the NAT router in the first post of the thread, are there other areas of config you’re missing? I can post the whole thing but there are no other NAT/mangle or routing rules.

sindy · Thu Sep 17, 2020 8:56 pm

I posted the relevant config of the NAT router in the first post of the thread, are there other areas of config you’re missing? I can post the whole thing but there are no other NAT/mangle or routing rules.

The OP was so packed with information that I've missed that, sorry :)

Normally, I'd say that the relevant word is dangerous here, because the issue is typically in the part which you deem unrelated. But as you state here that there are no other NAT rules, I'm afraid this behaviour is caused by some "undocumented feature" of the firewall, as none of these two src-nat rules should match that traffic. So I'll try to replicate the routing loop on CHR 6.46.7 (the fresh long-term as of now) and see whether it behaves the same way.

sindy · Thu Sep 17, 2020 9:33 pm

OK, I've tried the same on CHR 6.46.7, including the disabling of the address on the "BRAS" during an ongoing TCP session (rather than initiating a new session while the loop has already been in place). As that behaviour didn't show up, I've even added a NAT rule which translates the src-address used for the test to the same to-addresses - still nothing. So it may be a specialty of the CPU architecture or of the RouterOS release you run on the NAT router, no idea.

flameproof · Thu Sep 17, 2020 11:25 pm

You are right that sometimes the relevant stuff is hidden in what you consider not relevant... been there, done that! In this case, the configuration is rather simple.

I'll raise this as a bug with Mikrotik, given the "weird" nature of the PPPoE "flaps" we keep seeing with no way to properly debug them, it's not far-fetched to think this behavior may have something to do with it. The "flaps" are triggered by the physical disconnect of a large number of active PPPoE clients, however, we have no static routing or split duties setup yet (that's the intent of my lab test that started this thread...).

These tests are all run on RB750 variants, so fairly limited resources.

Thanks a lot, again, for your input, really appreciate it!

sindy · Fri Sep 18, 2020 9:11 am

So far, Mikrotik's response to these symptoms (everything down after simultaneous disconnection of a group of PPPoE clients due to network failure) has been consistently "too many clients given the CPU capabilities, use a more powerful hardware for your network and/or split the tasks" - except the suggestion not to use masquerade for src-nat, because the masquerade's bulk removal of tracked connections from the conntrack table when the uplink flaps is a known source of problems. One may agree or not with that (why should so much more CPU be spent on handling the few control packets related to disconnection/reconnection when it handles encapsulation/decapsulation of hundreds of thousands of packets to/from the PPP transport packets until the disconnection happens), but that's all one can do.

I intentionally write "symptoms" because the actual cause definitely is CPU overload, but what are the tasks which cause this overload in addition to the inevitable handling of client unavailability (when PPP keepalives stop being responded) may differ depending on the setup.

Again, your clear configuration mistake was the absence of the blackhole route. Whilst the relative impact of handling the same packet about 50 times (until the TTL expires) on the total load is yet to be determined, even if small, it may be the last drop crossing the edge between "business as usual" and the catastrophic evolution of events (where the fact that the CPU lags behind while handling a disconnection of a group of clients "by natural reasons" causes more clients to disconnect, thus even more disconnections to handle - what I have seen was that the CPU was so busy that it was sending the PPP keepalives late, and probably also evaluating the received responses late, declaring the connections dead as a consequence).

The fact that the "spurious NAT" is happening is not nice but it only surfaces if the routing loop is there, so as such it doesn't seem like a high priority issue to me (of course any unexpected behaviour may be a symptom of something more critical but that's another can of worms). Can you state the exact model of the 750xy which you've used as the NAT router in your lab setup, and the RouterOS release it is running? I've recently found some issues to be CPU architecture dependent, so if I've got a device with the same CPU architecture in my zoo, I'd like to re-test the "spurious NAT" with this setup on it.

flameproof · Fri Sep 18, 2020 9:24 am

Yes, the response I got from Mikrotik was about the same ("Our hardware is not up to your particular needs"), however no suggestion to split duties was made, I found that one by searching for options.

My biggest peeve at this time is the lack of proper visibility into the innards and performance in certain areas, such as DNS or PPP, via higher-resolution tracing (e.g. sniffing runs in the background dumping all traffic to file, why not the same for CPU profiling faster than 1Hz?). This makes it very hard to determine when your hardware is at the end of the line and you should upgrade. There is circumstantial evidence, but no way to verify it.

My setup is the NAT router on a RB 750 r2 / hEX lite, running 6.47.3, and the BRAS on a RB 951Ui-2HnD (revision r2) also running 6.47.3.

I agree the loop itself was self-induced, so I'm updating the thread title :-)

Terrible connection tracking bug... or terrible stupidity! [SOLVED] (it was me!)

Terrible connection tracking bug... or terrible stupidity! [SOLVED] (it was me!)

Re: Terrible connection tracking bug... or terrible stupidity! (by me)

Re: Terrible connection tracking bug... or terrible stupidity! (by me)

Re: Terrible connection tracking bug... or terrible stupidity! (by me)

Re: Terrible connection tracking bug... or terrible stupidity! (by me)

Re: Terrible connection tracking bug... or terrible stupidity! (by me)

Re: Terrible connection tracking bug... or terrible stupidity! (by me)

Re: Terrible connection tracking bug... or terrible stupidity! (by me)

Re: Terrible connection tracking bug... or terrible stupidity! (by me)

Re: Terrible connection tracking bug... or terrible stupidity! (by me)

Re: Terrible connection tracking bug... or terrible stupidity! (by me)

Re: Terrible connection tracking bug... or terrible stupidity! (by me)

Re: Terrible connection tracking bug... or terrible stupidity! (by me)

Who is online