NAT connection tracking

jkroon · July 21, 2010, 8:53pm

Right, so the way the Linux kernel deals with NAT and stuff has one flaw that I’m aware of. If no route is available at the first opportunity a packet for a flow is seen, the MASQUERADE won’t get applied, when the route does come up your un-NAT’ed IP becomes visible (obviously assuming the flow hasn’t timed out yet).

For TCP this isn’t generally a problem, and DNS runs with different source ports, so also not a problem. Enters VoIP which typically uses ports 5060 on both sides, or 4569 if you’re using IAX/2. Huge problems. So in my experience (and the way I solve this with Linux routers) is to whenever a ppp connections comes up just flush connection tracking cache of all udp flows (tcp flows can survive a connection restart permitting the device comes up quickly enough with the same IP, and retries will probably use different source ports). Flushing a few extra udp flows isn’t overly serious usually (typically if you’re unlucky you’re going to drop a few return DNS/NTP replies - they’ll be retried quickly enough).

So the following gets rid of UDP connection tracking entries on Router OS:

{ :local cons [/ip firewall connection find protocol="udp"]; :if ([:len $cons] > 0) do={ /ip firewall connection remove numbers=$cons;}}

It looks to be possible to do this much smarter if you know what your “internal” network is, so you can probably play with dst-address and reply-dst-address to only get broken entries (entries where the dst-address is NOT on your network and the reply-dst-address IS on your network - unless you have a public IP range the interwebs will not be able to route back to you).

The part where I get stuck is to actually run this whenever my pppoe-inet connection comes up. Any assistance in this regard is appreciated.

fewi · July 21, 2010, 9:06pm

a) You can simplify that to

/ip firewall connection remove [/ip firewall connection find protocol="udp"];

b) You can use netwatch (http://wiki.mikrotik.com/wiki/Manual:Tools/Netwatch) to watch an IP address that can only be reached through that connection, and fire the above in the script run on an ‘up’ event. You can enforce that an IP address be reachable only through one connection by implementing ‘output’ chain firewall rules that block all traffic to/from that IP through all other interfaces. Say you’re watching 10.0.0.1 and you have interfaces named ‘LAN’, ‘WAN’, ‘WAN2’ and ‘PPP’ and want to run a script when the ‘PPP’ interface comes up and want to guarantee that you will never reach 10.0.0.1 through any other interface so you don’t have false positives, you could implement

/ip firewall filter add chain=output dst-address=10.0.0.1 protocol=icmp src-address-type=local out-interface=!PPP action=drop

So while ideally you can find an IP that is only reachable through the PPP interface natively (by merit of being homed on their network and being reachable through their edge only), you can watch pretty much anything as long as you’re sure you don’t need to ping that resource from the router otherwise.
Hope that helps.

jkroon · July 22, 2010, 10:15pm

Possibly, but my experience has been that if you get an empty set from ROS the script will die, sometimes causing other problems, so I try to make a point of making scripts run with “success” as far as possible.

b) You can use netwatch (> http://wiki.mikrotik.com/wiki/Manual:Tools/Netwatch> ) to watch an IP address that can only be reached through that connection, and fire the above in the script run on an ‘up’ event. You can enforce that an IP address be reachable only through one connection by implementing ‘output’ chain firewall rules that block all traffic to/from that IP through all other interfaces. Say you’re watching 10.0.0.1 and you have interfaces named ‘LAN’, ‘WAN’, ‘WAN2’ and ‘PPP’ and want to run a script when the ‘PPP’ interface comes up and want to guarantee that you will never reach 10.0.0.1 through any other interface so you don’t have false positives, you could implement

Not good enough. Consider this, netwatch tests 10.0.0.1 every 10 seconds with a timeout of 500ms, now consider this sequence:

@ 0ms → icmp echo request goes out
@ 200ms → icmp echo response comes back
@ 500ms → link drops
@ 7000ms → link comes back with a different IP
@ 10000ms → echo request goes out …

Anyway, you get the picture.

Anyway, correctly set up routing already prefents the case you’re referring to so this is pointless (and doesn’t prevent the conntrack entry from being created on the NEW packet):

/ip firewall filter add chain=output dst-address=10.0.0.1 protocol=icmp src-address-type=local out-interface=!PPP action=drop

>
> So while ideally you can find an IP that is only reachable through the PPP interface natively (by merit of being homed on their network and being reachable through their edge only), you can watch pretty much anything as long as you're sure you don't need to ping that resource from the router otherwise.
> Hope that helps.

I have a whole internet of such IPs.  So getting those IPs are not difficult, but not having to waste a crapload of bandwidth (In SA we have this really retarted concept they call a CAP ... so instead of paying for capacity you end up paying for both capacity - with 4Mbps being the fastest generally available at sane prices link - and actual bytes uploaded + downloaded).  It's all quite messed up.

No, I think what I rather need to do is write that code to find the invalid entries and just trash them every minute.  Unfortunately that will (on ROS at least, as far as I can see) be the only reliable mechanism, but alas, also the most CPU intensive if the connection count becomes high.

jkroon · July 22, 2010, 11:25pm

I think I realized where you misunderstood. You’re thinking I’m using RB to provide services to periodic connecting people whereas in fact I’m stuck with a RB between my LAN (ether1) and my PPPoE WAN link (pppoe-client1). I’m not even 100 % sure how they configured it, they’re just not believing me that this is a NAT connection tracking issue (which I have already illustrated affects pretty much every DSL router out there, Linux is affected, and now MT). I seriously do prefer having a full-blown Linux box on my gateways where trouble-shooting these kinds of problems is not only easier and faster, it’s more accurate, less guessing and I typically solve issues there in a tenth of the time it takes me on MT.

Ok, so after fooling around for a while longer I can now (assuming my internal LAN attached the ether1 is in the 192.168.0.0/24 range) use the following to find all connection that has a dst-reply-address in that range. This is most likely all the connections that I want but I need to actually attach a negated check on the dst-address, ie, only if the dst-address isn’t also on the LAN:

/ip firewall connection find reply-dst-address~"^192\\.168\\.0\\.[0-9]+:[0-9]+\$"

Imho that is fugly. However, it’ll work. But it’ll also trap all internal connections to the RB, so additional filtering is required to ensure that internal connections are not trapped. I’m not sure about that additional filtering. And that is where I’m getting stuck at the moment. The ~ comparison gets us posix extended regexes. Which doesn’t do a “doesn’t match”, so the options that is required simply doesn’t work:

Check that dst-address is NOT on the LAN.
Check that reply-src-address is NOT on the LAN.

I’ve tried various ! intrapositions, eg:

! dst-address~"nasty regex"
dst-address !~ "nasty regex"
dst-address ~! "nasty regex"
dst-address ~ "!(nasty regex)"

None of these worked.

The idea can be expanded to cover typical “internal” IP ranges quite easily so have a single script cover things nicely (if you do this tens of times a year it’s often easier to write a single script that you just use everywhere):

/ip firewall connection find dst-reply-address~"^10\\.[0-9]+\\.[0-9]+\\.[0-9]+:[0-9]+\$"
/ip firewall connection find dst-reply-address~"^192\\.168\\.[0-9]+\\.[0-9]+:[0-9]+\$"
/ip firewall connection find dst-reply-address~"^172\\.(1[6-9]|2[0-9]|3[01])\\.[0-9]+\\.[0-9]+:[0-9]+\$"

Just the negated check that’s missing to make that function properly. Please don’t be naïve enough to think that the above will never have false positives, even with the negations added, if you use the generic script you will need to attach all three negations to each of the above to prevent false positives. Even then I’m going to say your mileage may vary.

The above may (possibly) be good enough if your INPUT and OUTPUT filter chains are all on ACCEPT and you don’t have any connection-marks in the mix on those connections. Also, when you start doing load-balancing and other smart routing tricks it gets way, way more complicated than this and the above will be extremely inadequate. Fortunately it’s also very difficult to setup complex enough routing examples on MT where those more complicated cases will trigger.

Just to try and express the issue even better, take the SIP example where an internal IP 192.168.0.10 will try and establish SIP to 1.2.3.4 in the down case. A DROP (as has been suggested will help to prevent the problem - I prefer using unreachable routes for the private ranges and then use OSPF between routers internally to make sure everything routes where it should without accidentally sending private IPs out to the interwebs) doesn’t prevent the connection tracking entry from being created. So if the pppoe is down, the system will only have routes to the internal networks. Thus the packet comes in on (eg) ether1, it goes throught the mangle, and dst-nat tables, then when it goes to the routing decision it gets dropped. At this point conntrack will still create a connection tracking entry with the following:

dst-address 1.2.3.4:5060
src-address 192.168.0.10:5060
reply-dst-address 192.168.0.10:5060 <-- problem.
reply-src-address 1.2.3.4:5060

Note that my client in this particular case just shot this down as “we have a static IP, this doesn’t apply”. I wish that was true. Even when you have a static IP on the pppoe, if the link is down the routing won’t happen, it’ll never get to src-nat, triggering the problem.

The other potential fix may be to somehow wangle the NOTRACK (is this even available on MT?) into the mix - but I can’t possibly think where, because until the routing decision has been made we don’t know that we want NOTRACK, and once the routing decision has been made … well, it’s too late, we will never see the packet in netfilter again.

Come to think of it (yes, I’m ranting here …) isn’t this a kernel bug? Shouldn’t the kernel just be modified to not create a conntrack entry if the packet cannot be forwarded, resulting in future packets for the flow also being treated as NEW … resulting in new conntrack entries when they eventually do go out? (keep in mind that according to the iptables man page MASQUERADE will drop conntrack entries for connections on that interface should the interface go down … I suspect this is done by purging all entries where reply-dst-address matches any of the addresses that was assigned to the downed interface, not by actually tracking flows to interfaces.) Then again … even that won’t solve the more complex cases I had in mind above where routing changes from one interface to another (both of which is NAT’ed with the peer doing return-path-filtering).

jkroon · December 7, 2011, 12:18pm

Hi,

Right, so we ran into this with a slightly different variation, exact sample:

0.0.0.0/0 gw ppp-out1 distance 1
0.0.0.0/0 gw ppp-out2 distance 2

Ok, so obviously the intent is to route via ppp-out1 unless it’s not available, in which case route to ppp-out2. Now the issue is once more the exact same bug, and the script below covers this particular case (more general bug from above can still occur if both uplinks goes down simultaneously. Consider a udp stream going from an internal ip of 192.168.0.5 to an external server at 1.2.3.4, udp, port 5060 on both sides (SIP). Initially this will route over ppp-out1, which is fine, NAT will masquerade correctly. Now, ppp-out1 goes down, still no problem as the way MASQUERADE works will purge the conntrack entry, effectively removing the now incorrect connection tracking entry. The next packet to go from the LAN side will route out over ppp-out2, creating a new connection tracking entry, full communications re-established. Now there are two possible scenarios:

ppp-out2 goes down too (or was already down). Next packet will create an incorrect tracking entry, and we’re stuck again. This is a really bad situation to begin with, and for the moment we’re ignoring this.
ppp-out1 comes back up. In this case, without connection marking all routing will go back to ppp-out1, but still use the IP from ppp-out2 (established conntrack entries). The script below is specifically designed to purge those entries from the cache (and to be selective about it).

Please note that using proper marked routing with full load balancing nearly eliminates the effects of this bug, but if ALL your uplinks goes down, you still have a problem. We at ULS also does a split routing where we route local (within ZA) traffic different from international. When combining that with load balancing bad things can still happen, but we deal with that by using other mechanisms. The load balancing at IP level also doesn’t apply to our VoIP setups, so not a concern there either. In fact, VoIP is the ONLY situation we have at the moment where the contrack bug is a major concern for us, and the script below nails that beautifully, even though it CAN result in calls being dropped during switch-overs (a risk we’re willing to accept, and using some additional parameters we can further negate that effects to simply not having a clean call shutdown in that the BYE packet will go missing - both ends will simply timeout the call, the handset being less of a problem, and the remote switch will likely kill the call on BYE from the other leg).

# This script is designed for use cases where we want to purge weird state
# connection tracking.  It's similar to conntrack -F on linux, but is much more
# selective in that it will only flush entries that is "wrong".
#
# We assume that we only have "default" routes here, so it won't function in
# combination with the split-routing scripts.
#
# Also not designed to deal with load balancing situations, this _purely_
# covers the broken fail-over routing case (multiple default routes at
# different distances)

:local route
:local gateway
:local gwip
:local realdefip ""

# Find the current default IP and store it in realdefip.
:foreach route in=[/ip route find dst-address=0.0.0.0/0 active=yes] do={
	:foreach gateway in=[/ip route get $route gateway] do={
		:if ([:len $realdefip] < 1) do={
			:foreach gwip in=[/ip route find where gateway=$gateway active=yes] do={
				:if ([:len $realdefip] < 1) do={
					:set realdefip "$[/ip route get $gwip pref-src]"
				}
			}
		}
	}
}

:put "Default IP: $realdefip"

:if ([:len $realdefip] > 1) do={
#	only if we could actually locate a default route ...
#	for all other possible default routes.
	:foreach route in=[/ip route find dst-address=0.0.0.0/0 active=no disabled=no] do={
		:foreach gateway in=[/ip route get $route gateway] do={
			:foreach gwip in=[/ip route find where gateway=$gateway active=no disabled=no] do={
				:local altdefip [/ip route get $gwip pref-src]
				if ([:len $altdefip] > 0 && $altdefip != $realdefip) do={
#					If you need additional criteria (eg protocol=udp), add them into the find here.
					/ip firewall connection remove [/ip firewall connection find reply-dst-address~"^$altdefip:"]
				}
			}
		}
	}
}