Multiple Road Warrior L2TP/IPsec clients behind NAT - solved

WARNING! Risk of brain overheat. Prepare some coolware before reading.

The well-known problem

L2TP/IPsec clients reaching the server via NAT do work but only one at a time per each public address. A new client connection from behind the same public address ruins the pre-existing client session.

The root cause

To understand what is wrong, one must stop looking at L2TP/IPsec as at a magic blackbox and closely inspect its strictly layered structure. L2TP/IPsec is literally L2TP transported using IPsec, both protocol-wise and (at least in case of RouterOS) application-wise: when you tick the “use IPsec” check-box in L2TP settings, RouterOS automatically creates an IPSec peer and, at client side, an appropriate policy necessary to transport the L2TP connection. The same IPsec setup can be done manually instead of ticking the check-box in the L2TP configuration, and the resulting processing of the packets is exactly the same in both cases.

According to the standard, L2TP is transported using UDP and both the server and the client use port 1701 to both send and receive. So all standards adherent client implementations use port 1701 also locally.

The L2TP/IPsec standard requires that ESP transport mode is used. ESP transport mode is designed for efficient encapsulation of encrypted traffic directly between two machines, so it does not carry any information about source and destination IP address in the encapsulated payload as it would be redundant. It is assumed that these addresses are the same for the original transported packet and the ESP transport packet. If one of the addresses eventually changes on the way from the sender to the recipient, the decrypted and de-encapsulated packet inherits this changed address. The information about ports is part of the transported payload; ESP itself has no notion of ports at all.

The fact that ESP uses no ports means that it cannot be handled by a NAT; to work this around, the NAT-T extension to IPsec has introduced encapsulation of ESP into UDP. However, this only allows the UDP-encapsulated ESP to be forwarded through the NAT properly. In the process of de-encapsulation of the ESP payload from the UDP transport, the information about source port of the UDP (which has been changed by the NAT to distinguish one flow from another) is lost. The source address of the ESP packet is inherited from the UDP packet in which the ESP has come encapsulated.

When, in the next step, the UDP packet carrying the L2TP is decrypted and de-encapsulated from the ESP, its source address is inherited from the ESP, so it is again the public IP of the NAT, and the source port is the one from which the client has sent the UDP before encapsulating it into the ESP. So if both clients behind the same NAT send from the same port (1701), the L2TP server sees both as coming from the same socket address, consisting of the public address of the NAT and the same port provided by each client. Thus not only that the flows are indistinguishable from each other when received but there is also no way to address the response packets selectively to one of them - the IPsec policy matches only on address and port, not on anything in the payload.

Some client implementations are aware of this and use random ports. RouterOS server implementation is not strict in this and accepts connections from such clients, so this is a solution of the problem for these implementations but not for other ones, like the one of Microsoft Windows. And it is also not a solution for the Android embedded client which uses a random port for L2TP but doesn’t restrict its IPsec policy to this port.

The solution at server side RouterOS

As there is no out-of-the-box way to link the source port of the “inner” UDP (the one which carries the L2TP) to the source port of the “outer” UDP packet (the one which carries the ESP), we need to convey the information distinguishing the two client’s flows from one another across two encapsulating layers using some other means. Without coding, the only way is to change the transport packets’ source IP addresses to unique ones depending on their source ports. So what we need is to use source-NAT on received packets. Problem solved - from the point of view of the L2TP server and IPsec policy matching, there is no ambiquity or conflict as each flow is coming from a unique IP address.

Implementation Details

The above sounds easy but the first cooldown comes when you realize that src-nat does not work for received packets as it is deemed useless for any sane application scenario. We could use two routers in a chain to do that but that would mean extra hardware costs and also some inconvenience. So the first task is to make RouterOS src-nat a received packet and then continue handling it as a received one. The solution is to take any two local addresses and establish an IP-IP tunnel between them, so that we could run the same packet through the whole firewall machinery twice on a single machine. We receive the packet from a physical interface, route it “out” using one of the IP-IP tunnel’s ends as the output interface, and we “receive” it the second time from the other end of that tunnel.

Another task is the assignment of a unique source address depending on port. While we could theoretically map the port number 1:1 into the lower two bytes of the IP address, the issue here is that very often flows from two clients behind different NAT boxes come from the same port (like 1024), so that method would not be 100% safe. So this is not the way even if we neglect some other associated complications. The solution is to use a stack of addresses and give each new connection its own address regardless the actual source address and port.

The address assignment policy of src-nat action of RouterOS firewall is not helpful here, as it does not prefer diversity, so if two connections come from different ports on the same IP address, the src-nat rule is likely to assign them the same address no matter how large pool it has got as to-address. A per-connection-classifier can use a hash of source address and port but even with an annoying amount of rules the uniqueness of address assignment would not be guaranteed. The solution is to use a single address in the src-nat rule and increment it each time it gets assigned to a new connection. As using the available means it is impossible to do this fast enough, there is a big chance that two connections get src-nat’ed to the same source address, so we need to resolve that systematically. To do so, the firewall filter in the second pass checks a src-address-list for the first packet of each new connection; if the source address is already there, the firewall drops the packet, thus preventing the connection from establishing, otherwise it adds the source to the address list with a 1 minute lifetime and accepts the packet. So the first connection with a particular source address always succeeds and all the following ones always fail. Once per a couple of seconds, a cleaner script checks whether the source address currently configured in the src-nat rule is used by a properly established connection; if it is, the script updates the rule with an unused address and cleans any connection attempts waiting for a response because their establishing packets have been dropped by the second pass through the firewall, which makes the firewall treat a retransmission of the same packet as a brand new one and thus use the src-nat rule on it. The client retransmits the IPsec connection establishing packet for tens of seconds before it gives up, therefore many clients would have to attempt to connect during the same time window of several seconds so that the most unlucky ones would fail. This can happen e.g. after a network outage if the clients are autonomous devices which keep on trying, so effectively for some of them the network outage will seem to last longer than for others.

It is also worth noting that we need to prevent the incoming UDP transport packets from being delivered to the IPsec stack already in the first pass; to do that, we must dst-nat them to an address which is not a local one to the router using a dst-nat rule which matches in the first pass, and dst-nat them back to the original address using a dst-nat rule which matches in the second pass. Routes must be added to send the internal traffic through the tunnel.

What makes all this even possible is that it is enough to give this special treatment to packets coming to UDP port 4500. The NAT-T mechanism of IPsec does not require that the UDP-encapsulated communication coming to port 4500 would come from the same IP address from which the initial IKE communication was coming to port 500. There would be no way to pair connections to port 500 with connections to port 4500 coming from the same client.

The limitations

The pool of IP addresses used for src-nat must be significantly larger than the maximum expected number of clients connected simultaneously, because if an established connection breaks, the connection tracker remembers it for minutes so its src-nat address cannot be reused during that time. So if some of the clients have unstable connections and reconnect quite frequently, they would exhaust the pool. If your network is so large and complex that you cannot find a free pool of thousands of addresses, the way out is to mark the response IPsec packets from the server and policy-route them back to the tunnel, while the traffic to actual owners of these addresses is routed normally.

The risks

Each modification of the src-nat rule causes a configuration save. This means an extra wear for the flash chip. I have no idea what this may cause in real life deployment on Routerboards. To reduce the impact and also to save some resources, I recommend to exclude packets with source port 4500 from the special treatment. The background is such that NATs usually only change the source port if the original one is already used for another connection to the same remote socket. So in many cases the connection from our first client behind each NAT comes from port 4500 and thus does not cause a modification of the src-nat rule, and in many of these cases the client will be the only one behind that NAT.

The configuration

# Create a bridge without any member ports so that we'd have something to attach the additional local IP address to.
# Actually the address could be added to an existing interface, but a member-less bridge never fails.
/interface bridge
add name=aux-lo protocol-mode=none

# Add another local address - just to have this part independent from the rest of the configuration.
/ip address
add address=127.0.1.1 interface=aux-lo network=127.0.1.1

# Add a firewall rule permitting local traffic - currently, default firewall rules drop traffic from in-interface-list=!LAN which
# includes local traffic
/ip firewall filter
add chain=input src-address=127.0.0.0/8 dst-address=127.0.0.0/8 action=accept place-before=right after the "accept established,related" rule

# Create the two ends of the local tunnel
/interface ipip
add local-address=127.0.0.1 mtu=1500 name=ipip-inner remote-address=127.0.1.1
add local-address=127.0.1.1 mtu=1500 name=ipip-outer remote-address=127.0.0.1

# Add routes for the addresses used for the solution
/ip route
add distance=1 dst-address=10.0.0.0/20 gateway=ipip-inner
add distance=1 dst-address=10.0.15.254/32 gateway=ipip-outer

# Add the chain of firewall rules preventing newer connections from killing an older one before the cleaner script changes the src-nat address
/ip firewall filter
add chain=udp-4500-in src-address-list=src-addresses-in-use action=drop
add chain=udp-4500-in action=add-src-to-address-list address-list=src-addresses-in-use address-list-timeout=1m
add chain=udp-4500-in action=accept

# Add the firewall rule sending new packets to UDP 4500 coming from the tunnel to the chain above
/ip firewall filter
add action=jump chain=input connection-state=new dst-port=4500 in-interface=ipip-inner jump-target=udp-4500-in protocol=udp place-before=right after the "accept established,related" rule

# The usual IPsec- and L2TP-related firewall rules must be there as well, usually they already exist
add action=accept chain=input connection-state=new dst-port=500,4500 protocol=udp
add action=accept chain=input connection-state=new dst-port=1701 ipsec-policy=in,ipsec protocol=udp

# Add the firewall rule permitting forwarding of dst-nated packets in the first pass
/ip firewall filter
add action=accept chain=forward connection-state=new dst-address=10.0.15.254

# Add the NAT rules
/ip firewall nat
# Restore our public IP address on packets after they've passed through the tunnel
add action=dst-nat chain=dstnat dst-address=10.0.15.254 in-interface=ipip-inner to-addresses=1.2.3.4
# src-nat the packets before sending them to the tunnel
add action=src-nat chain=srcnat out-interface=ipip-outer protocol=udp to-addresses=10.0.0.1
# Redirect packets to port 4500 to the auxiliary destination address to give them the special treatment;
# for testing that it works with only two client devices, remove the "src-port=!4500"
add action=dst-nat chain=dstnat dst-port=4500 src-port=!4500 dst-address=1.2.3.4 protocol=udp to-addresses=10.0.15.254

# Add the cleaner script
/system script
add name=l2tp-helper owner=admin policy=ftp,reboot,read,write,policy,test,password,sniff,sensitive,romon source=\
":local cntr 0; \\\
    \n:local auxip [/ip firewall nat get [find chain=\"srcnat\" && out-interface=\"ipip-outer\"] to-addresses]; \\\
    \n:while ([/ip firewall connection print count-only where src-address~\"^\$auxip\" && dst-address~\":4500\" && seen-reply]=1) \
    do={\
    \n  :set auxip (\$auxip+1); \\\
    \n  :if (\$auxip>10.0.15.253) do={:set auxip 10.0.0.1};:set cntr (\$cntr+1)\
    \n}\
    \n:if (\$cntr>0) do={\
    \n  /ip firewall nat set [find chain=\"srcnat\" && out-interface=\"ipip-outer\"] to-addresses=\"\$auxip\"; \\\
    \n  /ip firewall connection remove [find dst-address~\":4500\" && !seen-reply]\
    \n}\
    \n"

# Schedule the cleaner script to run every 3 seconds right from the restart
/system scheduler
add interval=3s name=l2tp-scheduler on-event=l2tp-helper policy=ftp,reboot,read,write,policy,test,password,sniff,sensitive,romon \
    start-time=startup

Wow … I’ve had issues with this on Peplink devices.

I think MikroTik should someone bake your solution into the firmware somehow.

@sindy - In the following is the 10.0.0.0/20 for inner considered what you’re using as your LAN? ie, the same as 192.168.2.0/24 (as example).

Also - Is the 10.0.15.254/32 the same as a WAN IP or is this an internal binding for the loop creation on the bridge only? Or is that considered the end client’s IP range given the /32? If so on the latter, if we’re using a DHCP pool for our external vpns, how would we bind that part dynamically?

add distance=1 dst-address=10.0.0.0/20 gateway=ipip-inner
add distance=1 dst-address=10.0.15.254/32 gateway=ipip-outer

Sorry for late answer, I haven’t received a notification e-mail.
Yes, 10.0.15.254 is just the last individual address from 10.0.0.0/20. The whole range you choose instead of the 10.0.0.0/20 in the example must not collide with any private range you use, including the pools you use to assign addresses to the L2TP clients. There is no relationship between the pool from which you assign addresses to L2TP clients and this range (except that they must not collide). These addresses are not assigned to any interface anywhere.

Hi Sindy!

I’ve been trying to implement your solution for a couple of days now, but with minimal progress unfortunately and am hoping for some help at this point…

On first glance, your tutorial screams complex (took me a good couple of hours to disseminate everything lol).

My setup involves a CHR v 6.42.7 acting as the L2TP/IPsec server and two windows 10 clients dialling from behind the same public IP. As expected, the last dialled client will “keep alive”, while the first will hang around for anything between 1-15mins before disconnecting. Pinging the router while both are connected renders an alternate response (5-10s worth of responses for one client, then it times out and the other one begins to receive them - to be expected I guess).

My Jump rule (and consequently the udp-4500-in chain) is not being hit at all, no matter where I position it in the firewall and am beginning to suspect an IP config problem somewhere. Also, I have no local ranges conflicting with your proposed config values.

Would it be too much to ask for a server-side config sample for reference?

Much appreciated!

Both the lab device on which I’ve developed the configuration and the sole machine on which I’m currently running it in production have much more complex firewall rules and many unrelated settings. Therefore, both would have to be edited before posting, so there is a risk of losing some important bit during this process. On the other hand, the configuration already posted in the OP has been cut to bare bone of what is necessary for it to run.

So I propose you the reverse, export your config following the instruction in my automatic signature (and take special care about the “systematical” part while obfuscating) and I’ll have a look. This will also allow me to identify the missing or misleading part of the description above.

Thanks for your swift reply & totally understand.

Here is my export:

# sep/03/2018 15:31:25 by RouterOS 6.42.7
#
#
#
/interface bridge
add name=aux-lo protocol-mode=none
/interface ethernet
set [ find default-name=ether1 ] name=mgmt-106
set [ find default-name=ether1 ] name=technative-LAN-2002
set [ find default-name=ether1 ] name=wan-704
/interface ipip
add local-address=127.0.0.1 mtu=1500 name=ipip-inner remote-address=127.0.1.1
add local-address=127.0.1.1 mtu=1500 name=ipip-outer remote-address=127.0.0.1
/interface wireless security-profiles
set [ find default=yes ] supplicant-identity=MikroTik
/ip pool
add name=VPN-pool ranges=10.200.31.20-10.200.31.120
add name=DHCP-pool ranges=10.200.30.121-10.200.30.200
/ip dhcp-server
add address-pool=DHCP-pool disabled=no interface=technative-LAN-2002 name=\
    dhcp-technative
/ppp profile
add dns-server=10.200.30.254 local-address=10.200.31.254 name=l2tp-in \
    remote-address=VPN-pool
/interface l2tp-server server
set authentication=mschap2 default-profile=l2tp-in enabled=yes ipsec-secret=\
    [Secret] use-ipsec=yes
/ip address
add address=127.0.1.1 interface=aux-lo network=127.0.1.1
add address=10.200.30.254/24 interface=technative-LAN-2002 network=\
    10.200.30.0
add address=[public_IP]/27 interface=wan-704 network=[public_network]
/ip dhcp-client
add add-default-route=no disabled=no interface=mgmt-106
/ip dhcp-server network
add address=10.200.30.0/24 dns-server=10.200.30.254 gateway=10.200.30.254
/ip dns
set allow-remote-requests=yes
/ip firewall address-list
add address=10.222.203.0/24 list=bliss-mgmt
add address=10.220.220.0/24 list=bliss-mgmt
add address=10.200.30.0/24 list=internal-LAN
add address=10.200.31.0/24 list=internal-LAN
/ip firewall filter
add action=accept chain=input comment="Accept mgmt-in" src-address-list=\
    bliss-mgmt
add action=accept chain=input disabled=yes src-address-list=internal-LAN
add action=accept chain=input comment="Allows Ping" protocol=icmp
add action=drop chain=udp-4500-in comment=\
    "Prevent new VPN-conn to kill old one" src-address-list=\
    src-addresses-in-use
add action=add-src-to-address-list address-list=src-addresses-in-use \
    address-list-timeout=1m chain=udp-4500-in
add action=accept chain=udp-4500-in
add action=accept chain=input comment="Allow E&R In" connection-state=\
    established,related
add action=jump chain=input connection-state=new dst-port=4500 in-interface=\
    ipip-inner jump-target=udp-4500-in protocol=udp
add action=accept chain=input comment="Allow local loop" dst-address=\
    127.0.0.0/8 log-prefix=local127 src-address=127.0.0.0/8
add action=accept chain=input comment="Allow VPN protocols" connection-state=\
    new dst-port=500 protocol=udp
add action=accept chain=input protocol=ipsec-esp
add action=accept chain=input connection-state=new dst-port=1701 \
    ipsec-policy=in,ipsec protocol=udp
add action=accept chain=input connection-state=new dst-port=4500 protocol=udp
add action=accept chain=input in-interface=technative-LAN-2002
add action=drop chain=input comment="Drop all other" in-interface=wan-704 \
    log-prefix="drop pool"
add action=accept chain=forward comment=\
    "Allow forwarding of dst-nated packets in the first pass" \
    connection-state=new dst-address=10.0.15.254
add action=accept chain=forward comment="Allow E&R Fw" connection-state=\
    established,related
add action=accept chain=forward comment="Allow Internal FW" dst-address-list=\
    internal-LAN src-address-list=internal-LAN
add action=drop chain=forward comment="Drop all other" connection-state=\
    invalid
/ip firewall nat
add action=dst-nat chain=dstnat comment="Restore our public IP address on pack\
    ets after they've passed through the tunnel" dst-address=10.0.15.254 \
    in-interface=ipip-inner to-addresses=1.2.3.4
add action=src-nat chain=srcnat comment=\
    "src-nat the packets before sending them to the tunnel" out-interface=\
    ipip-outer protocol=udp to-addresses=10.0.0.1
add action=dst-nat chain=dstnat comment="Redirect packets to port 4500 to the \
    auxiliary destination address to give them the special treatment" \
    dst-address=1.2.3.4 dst-port=4500 protocol=udp src-port=!4500 \
    to-addresses=10.0.15.254
add action=masquerade chain=srcnat comment="Masq everything else" \
    dst-address=!10.0.0.0/8 out-interface=wan-704
/ip ipsec peer
add address=0.0.0.0/0 dh-group=modp1024 enc-algorithm=aes-256,aes-128,3des \
    exchange-mode=main-l2tp generate-policy=port-override secret=[Secret]
/ip route
add distance=2 gateway=[next_hop]
add distance=1 dst-address=10.0.0.0/20 gateway=ipip-inner
add distance=1 dst-address=10.0.15.254/32 gateway=ipip-outer
add distance=1 dst-address=10.220.220.0/24 gateway=10.223.203.254
/ppp aaa
set use-radius=yes
/ppp secret
add name=[test_acc1] password=[pass] profile=l2tp-in service=l2tp
add name=[test_acc2] password=[pass] profile=l2tp-in service=l2tp
/radius
add address=10.200.30.251 secret=[radius_secret] service=ppp src-address=\
    10.200.30.254
/system clock
set time-zone-name=Europe/London
/system identity
set name=Technative-vr2
/system logging
add disabled=yes prefix=VPNLog topics=ipsec
/system scheduler
add interval=3s name=l2tp-scheduler on-event=l2tp-helper policy=\
    ftp,reboot,read,write,policy,test,password,sniff,sensitive,romon \
    start-time=startup
/system script
add name=l2tp-helper owner=dionita policy=\
    ftp,reboot,read,write,policy,test,password,sniff,sensitive,romon source="\
    \":local cntr 0; \\\\\\\r\
    \n    \\n:local auxip [/ip firewall nat get [find chain=\\\"srcnat\\\" && \
    out-interface=\\\"ipip-outer\\\"] to-addresses]; \\\\\\\r\
    \n    \\n:while ([/ip firewall connection print count-only where src-addre\
    ss~\\\"^\\\$auxip\\\" && dst-address~\\\":4500\\\" && seen-reply]=1) \\\r\
    \n    do={\\\r\
    \n    \\n  :set auxip (\\\$auxip+1); \\\\\\\r\
    \n    \\n  :if (\\\$auxip>10.0.15.253) do={:set auxip 10.0.0.1};:set cntr \
    (\\\$cntr+1)\\\r\
    \n    \\n}\\\r\
    \n    \\n:if (\\\$cntr>0) do={\\\r\
    \n    \\n  /ip firewall nat set [find chain=\\\"srcnat\\\" && out-interfac\
    e=\\\"ipip-outer\\\"] to-addresses=\\\"\\\$auxip\\\"; \\\\\\\r\
    \n    \\n  /ip firewall connection remove [find dst-address~\\\":4500\\\" \
    && !seen-reply]\\\r\
    \n    \\n}\\\r\
    \n    \\n\""

As you can see, besides a radius instance and some mgmt rules, the setup is pretty much bare. The CHR is VLAN agnostic so the IDs are set per interface outside the VM. Must be something small that I’m missing :slight_smile:

Really appreciate taking the time to examine this!

Could it be as simple as that you haven’t realized that you have to replace 1.2.3.4 in my chain=dstnat rules by your actual public IP address?

Other than that, you have configured an IPsec peer with exchange-mode=main-l2tp manually, and you have also configured use-ipsec=yes and secret on /interface l2tp-server server. It works but you end up with two peers running and only one of them actually used.

Yes, those did slip through. Thanks for pointing that out!
I’ve changed the L2TP Server settings to “Use IPsec=no” and left only one IPsec Peer on. Also, did the public IP swap in the NAT dst rules.

Now I’m getting some hits on rules which defo counts as progress… The drop rule meant to prevent killing the old connection as well as the 127 loop are never hit thou :frowning:. Secondary connections do drop randomly within minutes and any effort to communicate simultaneously from test1 / test2 (while established) fails. Sometimes, the disconnection occurs immediately after attempting to actively send packets through a connection (pinging for example).

Noticed that, upon dialling, the generated IPsec policies differ:

1st established connection generates a policy with src-addr=router_public_Ip and dst-addr=10.0.0.1
2nd established connection generates a policy with src-addr=router_public_Ip and dst-addr=remote_public_ip

Would that be normal?

Does this overcomplicated thing really necessary, where you can switch to ike2 and forget about NAT problems?

I’m new to IKE1/2 really.. So, let me get this straight: Implementing a cert-based VPN would assign uniqueness to two or more clients dialling behind same NAT device?

IKE2 does require a PKI infrastructure, doesn’t it? That is overcomplicated just as well… it introduces the new
problem of certificate generation, installation, backup and renewal. Using a PSK is so much simpler.

Not really, can be done with EAP authentication.
And if clients are only ROS devices or IOS you can use PSK as well.

…and User Manager isn’t sufficient for this (unless I’ve missed something during past half a year) so you need an external RADIUS server.

I decided to experiment a little with this.
It appears to be possible to have a IKE2 IPsec peer alongside a main peer at the server side (where I do not use the auto-generated peer config), good.
However, at the client side I would like to use auto-generated peer but it is not possible to specify IKE2 anywhere.
WIth manually configured peer it is working. However, my L2TP client has a DNS name that resolves to 2 different IP addresses (2 different ISP at the main office)
so auto-generated peer is more convenient as it generates the correct peer for the actual remote IP in use for the connection.

I’m a bit lost - what has an auto-generated peer to do with IKE2 mode? If you use an IKE2-controlled SA to transport L2TP, you fall to the same rabbit hole like with IKE1-controlled SA regarding NAT and transport mode; if you use plain IKE2 mode, which is what @mrz suggests, I cannot see how the client-side peer could be a dynamic one.

I did not know that, I understood I only needed to change main exchange mode into IKE2 exchange mode at both ends.
However, that is not possible with auto-generated peer (setting IPsec checkmark and entering PSK).
How do you configure an L2TP client with “plain IKE2 mode”?

You don’t there is no need for L2TP when ike2 is used.

@voll, sorry, I need to concentrate on analysing your behaviour which is currently impossible due to other factors so we’ll come back to it later :slight_smile:

@pe1chl, as you’ve anticipated you cannot, you need to configure the peer manually. The auto-generated peer always uses main-l2tp, and the auto-generated peers for ipip, gre, eoip… tunnels always use main. If you want anything special (non-default exchange modes, non-default proposals used to auto-generate policies), you have to configure the peer manually.

Then what kind of tunnel do you use? I don’t want to use plain IPsec because I want to route traffic between several subnets, not only a single endpoint address.