IKEv2 SA killed after 5 seconds due to short DNS TTL (Surfshark)

shogunx · August 28, 2020, 1:06am

I have run into a weird problem with my IKEv2 IPSec VPN. About a week ago I set up my hEX router as an IKEv2 client using Surfshark. There were some teething issues regarding PMTU and fasttrack which I figured out eventually, but once I got that sorted the tunnel came up and was stable and fast for about a week.

A couple of days ago the tunnel started playing up - I could see in Webfig that bi-directional SAs were being established, then disappearing a few seconds later. In the logs I could see errors that showed EAP connection being authorised, then a “Killing ike2 SA” message with no other info as to why. I solved the problem temporarily by changing to a different VPN server, and figured it must have been something to do with the server I was on.

Yesterday the same thing started happening again, and this time changing servers didn’t help. After turning up the logging for IPSec I discovered the cause of the problem seems to be the very short TTL (5 seconds) for DNS records that surfshark gives to its server VIPs, and the fact that these VIPs resolve to a pool of separate IPs in a round-robin configuration. End result is that every 5 seconds you get a different IP address when querying the DNS name.

The IKE process doesn’t like this. After the initial SA is established, as soon as it notices the DNS name has a new IP address, it tears down the existing SA as invalid and established a new one, which also only lasts 5 seconds. This short window is so low that the tunnel is basically unusable.

I have no idea why this issue only popped up after a week or working fine - only thing I can think of is that surfshark only reduced the TTL for their DNS records in the last week, but that seems unlikely. If anyone has any ideas, I’d love to hear them. I don’t think this is a bug per-se, seems like it is working as it should and the TTL is the issue, but I have no control over the TTL on a domain I don’t own.

Has anyone else run into a similar issue, and if so how did you solve/work around it? I have tried setting a static DNS entry as a CNAME for the surfshark VIPs with a TTL of 5 minutes, but for some reason it doesn’t work - the CNAME won’t resolve giving a weird “invalid value for argument address: dns name exists but no appropriate record” error (seems like it won’t recurse the cname?). I have also tried creating a CNAME on a different external domain of my own with a longer TTL, but it seems like the TTL of the source record is still being respected.

At this point the only “work around” I can see is to set the VPN peer using a static IP, but that’s less than ideal obviously, no guarantee surfshark would even keep the same server IP for any reasonable period of time…

randomdude · August 28, 2020, 11:59am

Hi,

Just joined to say yes, you are not alone. Same issue for me, on Surfshark. Thought I was going nuts, glad I found your post.

You got further than I did in figuring out the root cause. Perhaps it’s time to reach out to SurfShark support? They seem pretty approachable? But then again maybe it’s more of a Mikrotik issue.

msatter · August 28, 2020, 12:46pm

Cname can only point to an other entry in the static DNS of your router. External gives that error “failure: dns name exists, but no appropriate record”.

I am a long time requester for having a minimal TTL in the options of the DNS but it was a long an fruitless quest.

In the meantime I have a own DNS server (Unbound) that allows to set a minimal TTL for those kind of domains.

Update:

You can run a schedule with your own time interval resolving and updating a static DNS entry which only exist locally:

/ip dns static set [find name=ikev2.wireshark.lan] address=[:resolve xxxxxxx.surfshark.com];

You have first to create the name ikev2.wireshark.lan with an IP. The script is looking for the name and sets it with current IP.
To connect you use the ikev2.wireshark.lan domain name.

shogunx · August 29, 2020, 1:17am

I have opened a support ticket with Surfshark, they are slow but the do respond. So far they have not been very helpful though, just asking me to try different DNS servers and send them screenshots of ipleak.net.

funnily enough, I worked out a similar work around as msatter suggested regarding the scripted static DNS entry, except I am using made up vip.surfshark.local static DNS entries with a 5 min TTL and running the schedule every 8 hours. Works well enough, but it does pretty much still guarantee sessions to drop if they are active when the rotation script runs and changes the IP address - so streams will temporarily die and I get logged out of FFXIV. I can live with the minimal disruption for now, but hopefully surfshark will fix their DNS… or at least create some additional VIPs with longer TTL.

msatter · August 29, 2020, 7:42am

I have opened a support ticket with Surfshark, they are slow but the do respond. So far they have not been very helpful though, just asking me to try different DNS servers and send them screenshots of ipleak.net.

funnily enough, I worked out a similar work around as msatter suggested regarding the scripted static DNS entry, except I am using made up vip.surfshark.local static DNS entries with a 5 min TTL and running the schedule every 8 hours. Works well enough, but it does pretty much still guarantee sessions to drop if they are active when the rotation script runs and changes the IP address - so streams will temporarily die and I get logged out of FFXIV. I can live with the minimal disruption for now, but hopefully surfshark will fix their DNS… or at least create some additional VIPs with longer TTL.

If you set the TTL then that could lead to the fact the server is not listening on that moment. Even 5 minutes seems to be long for Surfshark.

I set it without a TTL so not multiple DNS entries are created and the interval of the schedule determines the actual TTL. If you ADD then you can have multiple entries for the same domain name, then the DNS server will do a round-robin and use the matching dns entries in a looped sequence.

Complain with Surfshark that the 5 second TTL is just too short and if they get enough complaints the might use again a more sensible TTL.

ADD with a high schedule frequency and a set TTL will give you multiple entries that are used in a Round Robin way.
Set [find] with a set schedule frequency gives you one entry that is overwritten on the next scheduled run. The IP address is always overwritten.

I am using totally static IP addresses because I use the local domainnames/ip addresses in lists and rules so they have to be active for the time the connection is active…in fact permanent.

muhlpaul · August 29, 2020, 8:36am

Hello,

setting a local dns name with the static ip of the manually found ip adress of the surfshark.vpn-server is working…
but i didnt find any information how to setup the scheduled script to renew this static dns by RB start and when dropping the line f.ex…
any idea ?
shogunx can you maybe compare your script ?
What about only to request that script by rb start and after dropping the line and not every 8 hours ?

Sob · August 29, 2020, 8:36pm

I’d question whether what RouterOS does is correct. I’m not saying that it definitely isn’t, but it seems wrong, or at least not expected or desired.

If you have other kinds of VPN (SSTP, …), they don’t care about address changes. They take hostname when they connect, resolve it to IP address and connect to it. And as long as the tunnel is up and runnning, the hostname can change every second and they don’t care about it. Only when tunnel fails, hostname is resolved again and new address used.

IPSec is not as simple as SSTP’s one TCP connection. With this kind of client VPN it won’t happen, but generally the tunnel can be initiated from both sides. So if remote peer is configured with hostname, RouterOS must periodically resolve it, to be prepared to accept requests for new tunnel from peer’s new address, in case it changes.

But it seems wrong to tear down a perfectly fine and working tunnel. With site to site tunnels, maybe. But not with client to server configs like this, IMHO it should behave the same way as other VPN types, i.e. use the original address and switch to new one only when tunnel breaks.

AlexS · August 29, 2020, 9:54pm

I’d question whether what RouterOS does is correct. I’m not saying that it definitely isn’t, but it seems wrong, or at least not expected or desired.

If you have other kinds of VPN (SSTP, …), they don’t care about address changes. They take hostname when they connect, resolve it to IP address and connect to it. And as long as the tunnel is up and runnning, the hostname can change every second and they don’t care about it. Only when tunnel fails, hostname is resolved again and new address used.

IPSec is not as simple as SSTP’s one TCP connection. With this kind of client VPN it won’t happen, but generally the tunnel can be initiated from both sides. So if remote peer is configured with hostname, RouterOS must periodically resolve it, to be prepared to accept requests for new tunnel from peer’s new address, in case it changes.

But it seems wrong to tear down a perfectly fine and working tunnel. With site to site tunnels, maybe. But not with client to server configs like this, IMHO it should behave the same way as other VPN types, i.e. use the original address and switch to new one only when tunnel breaks.

This is my thoughts exactly - why break a working connection. If it was auth, then it should remain.

Then again surfshark are doing silly things as well

Sob · August 29, 2020, 10:13pm

If they have many servers for load balancing and want to quickly get rid of failed ones, short TTL makes sense, I don’t think it’s wrong. Try to connect from Windows (or whatever OS you have) instead of from router and see what it does. I’m too lazy to configure testing server to try it myself, but my guess is that it will work fine with same short-lived hostnames.

shogunx · August 31, 2020, 1:22am

My script config:

/ip dns static add address=180.149.228.117 name=syd-vip.surfshark.local ttl=5m type=A
/system script add dont-require-permissions=no name=update-sshark-vips owner=admin policy=read,write source="/ip dns static set [find where name=syd-vip.surfshark.local] address=[:resolve au-syd.prod.surfshark.com]"
/system scheduler add interval=8h name=rotate-sshark-dns on-event="/system script run update-sshark-vips" policy=read,write start-date=aug/29/2020 start-time=07:00:00

shogunx · August 31, 2020, 1:47am

I’d question whether what RouterOS does is correct. I’m not saying that it definitely isn’t, but it seems wrong, or at least not expected or desired.

If you have other kinds of VPN (SSTP, …), they don’t care about address changes. They take hostname when they connect, resolve it to IP address and connect to it. And as long as the tunnel is up and runnning, the hostname can change every second and they don’t care about it. Only when tunnel fails, hostname is resolved again and new address used.

IPSec is not as simple as SSTP’s one TCP connection. With this kind of client VPN it won’t happen, but generally the tunnel can be initiated from both sides. So if remote peer is configured with hostname, RouterOS must periodically resolve it, to be prepared to accept requests for new tunnel from peer’s new address, in case it changes.

But it seems wrong to tear down a perfectly fine and working tunnel. With site to site tunnels, maybe. But not with client to server configs like this, IMHO it should behave the same way as other VPN types, i.e. use the original address and switch to new one only when tunnel breaks.

The difference is that IPSec is an IP (layer 3) protocol; SSTP, OpenVPN, and other SSL based VPNs are TCP/UDP protocols, they operate at layer 4 (and higher). SSL VPNs usually have application clients that run in user land to establish and manage the connection. When an SSL VPN initiates a connection, the client looks up the IP once and caches it for itself for the life of the connection.

IPSec is different - because it operates at a lower layer most of the work happens in the kernel and is managed by operating system, so an IPSec VPN uses the operating system’s DNS cache, which generally respects the TTL of the DNS records it retrieves. If the VPN is configure use a DNS hostname, every time it has to send packets to that host, it has to do a DNS lookup. If the result isn’t cached due to low TTL, it will query the authorative server and use whatever result is returned.

IPSec also doesn’t understand “connections” the way TCP does. An IPSec tunnel is really just a pair or matching security classifications in the security policy database, which say “encrypt traffic from A.A.A.A to B.B.B.B with key X” and “Decrypt traffic from B.B.B.B to A.A.A.A with key Y”. If the IP address for one end changes to C.C.C.C, then the existing SAs are invalid and tearing them down is the correct response. Even if the OS cached the IP address for the lifetime of the SA, the next time the tunnel re-keyed when the SA expired (usually every hour), it would look up the hostname again and get a different IP. In this scenario, you would probably find things mostly worked, but any higher layer protocol using a long lived connection would die (e.g. a large file transfer).

Fundamentally, IPSec isn’t designed to work in situations where IP addresses change frequently. The design princples of IPSec are initially assumed that VPNs would be established using static IP addresses, not domain names (because DNS is a higher layer protocol). In a site-to-site configuration this is pretty easy to do. In a remote access situation, it would be assumed that at least the gateway has a relatively stable IP address. If you need to load balance, a single VIP across multiple routers works fine. Using a DNS round robin to return one of a dozen IPs every is not something that the protocol was designed to support. That is why generally speaking, vendors prefer SSL VPNs over IPSec - they are more fault tolerant because they allow support for more dynamic configurations.

Sob · August 31, 2020, 3:35am

Then how do other non-RouterOS clients deal with it? Do they also disconnect every five seconds? I’m guessing no, otherwise VPN provider could hardly use this config.

shogunx · August 31, 2020, 4:11am

Well for starters, this issue is specific to IPSec, so any client that uses the Openvpn protocol by default wouldn’t be affected.

Aside from that, when it comes to Surfshark’s native apps they have the luxury of knowing exactly how the back end works, so they can design their apps accordingly. If you look at most of the FAQs on their site, they seem to favor Openvpn over IKEv2, so I’m guessing they are probably using Openvpn by default in their clients. If not, they could be doing some other application layer magic, like dynamically creating IPSec profiles after they do DNS lookups (so the IP is static in the profile but they can still use a DNS name to do lookups when initially connecting). Any application client could do the same really, but it would have to be know that it needs doing and be programmed accordingly.

In most cases, IPSec clients will just rely on the low level functions provided by the OS rather than implement their own over the top functionality. RouterOS is a pretty bare bones OS, running on MIPS or ARM CPUs instead of x86, with minimal CPU speed and RAM compared to a desktop computer (or even a high end smartphone) - it just doesn’t make sense to implement complex application layer functionality when you can get the job done with low level, memory efficient system calls.

As far as other non-native apps, I haven’t got any info as to whether or not people have run into similar issues with other devices or IPSec clients like StrongSwan. My assumption would be that any IPSec client that is RFC compliant should probably experience the same issue.

Sob · August 31, 2020, 12:33pm

I don’t know the exact config and I don’t have access to their servers (and I’m not going to sign up just to test it), but it would be interesting if someone who does would test it with some other client.

Sob · September 2, 2020, 1:13pm

Look what’s in latest beta:

What’s new in 6.48beta35 (2020-Sep-02 07:50):
…
*) ipsec - refresh peer’s DNS only when phase 1 is down;

Rock · September 4, 2020, 7:10am

My version of the solution. I hope it will be useful.
Mikrotik HEX 6.47.2.
Configure Mikrotik to work with the Internet provider, standard firewall rules, and all the instructions on the site support.surfshark.com.

We make the following changes to the instructions:

Remove the local network list and add the “VPN” mark to the VPN packages

/ip ipsec mode-config
add connection-mark=VPN name=Surfshark_VPN responder=no

Getting a list of ip addresses of the desired surf shark server via the command line

nslookup server_of_surfshark.prod.surfshark.com

(Replace it with the one you need)

Add them as A record to the local static DNS Mikrotik

In IP-IpSec-Peer, we add the server to the Address field

When a channel is alive in Active Peers dynamic rule with marked packets appears in NAT

Next, you need to specify what to do with these packages. Go to IP-Firewall-Mangle:

/ip firewall mangle
add action=mark-connection chain=prerouting new-connection-mark=VPN passthrough=yes

Now we have a well-functioning Internet on a PC completely wrapped up via a VPN, but not working well on phones via Wi-fi. To fix this, we need to change the MSS.

Fixing MSS for forward packages

/ip firewall mangle
add action=change-mss chain=forward new-mss=1360 protocol=tcp tcp-flags=syn tcp-mss=1453-65535

That’s all.

msatter · September 4, 2020, 9:29am

Fixing MSS for forward packages

/ip firewall mangle
add action=change-mss chain=forward new-mss=1360 protocol=tcp tcp-flags=syn tcp-mss=1453-65535

There is a better way than this just limiting to a MTU of 1360

There is a problem of RouterOS not sending the ICMP 3-4 to the client using a IKEv2 connection.

http://forum.mikrotik.com/t/mtu-troubles-using-ikev2-providers-like-nordvpn-work-around/135154/34

m1tschen · September 5, 2020, 2:43pm

Many thanks guys :for your solution This problem has driven me nuts.
Cheers
Michael

singh33 · September 6, 2020, 6:20am

i have same problem with surfshark ikev2 , every few second killing ikev.. the getting new 1

hope mikrotik fix it.

msatter · September 6, 2020, 9:03am

Did you read the thread?