IPSec Phase 1 fails on restart, multiple IPs

joelwhrs · November 4, 2015, 2:22am

I am having an issue with Phase 1 of 2 IPSec connections failing on a router restart. It is showing as a Phase 1 timeout error.
As soon as I disable all external IP addresses (there are 4, all in the same subnet) except for the IP being used by the IPSec connection, it works. I can re-enable these IP addresses and it continues working. I can even disable the IPSec connection, flush the remote peers, kill all the IPSec connections and it works as soon as I re-enable the IPSec connection. Phase 1 is configured with a Src. Address.

Any ideas?

joelwhrs · November 7, 2015, 1:53am

I tried adding some routes to the remote IP address with a preferred IP of the one that I want it to use. This didn’t make any difference except when I disabled my IP addresses, my IPSec connection didn’t come up again. I had to remove the routes, disable the IP addresses and then restart for it to work. Is this a configuration issue or a bug? It’s quite unhandy to always have to go through this after a restart.

pe1chl · November 7, 2015, 9:20am

Maybe you forgot to allow UDP port 500 and/or protocol ESP/AH for input?
It will work ok when a router makes the outgoing connection and traffic keeps flowing, due to the ESTABLISHED rule, but when one side is rebooted the link may be dead.

joelwhrs · November 9, 2015, 1:02pm

There is a rule for this. I was suspecting this as well but it only reconnects when I disable all IP’s except for the one that’s used by the IPSec connection. If this were the case it would also block it with those IP’s disabled.

joelwhrs · November 12, 2015, 12:56pm

Should I assume this is a bug and file a bug report?

royalpublishing · November 17, 2015, 5:40pm

I am on 6.33 and I have just dealt with this ongoing issue for like over a year now and I am getting extremely frustrated by this. It seems like this problem just randomly pops up once every few months or so and this morning was one of them. It’s not a firewall issue here, all VPN routers have input rules to allow all traffic. Even after a reboot of all routers involved, they still won’t connect which just blows my mind. Disabling the non tunnel related IP addresses, IPsec rules, flushing the SA’s, and re-enabling rules didn’t seem to work for me however I was not onsite and had to connect remotely via winbox and had to be extremely careful not to lock myself out. It seems like the outages where these messages are displayed in the event logs last for around 30 minutes at a time and then traffic starts passing again.

joelwhrs · November 17, 2015, 7:20pm

My issue is that the IPsec trunk doesn’t connect at all. So far it has worked to disable all the IP addresses except for the IP address that IPsec uses. As soon as they are disable it connects and then I can re-enable everything and it stays up. I can even terminate the IPsec connection and upon re-enabling, it reconnects immediately. It’s extremely frustrating and definitely seems to be a bug.

ALDISBEHMANIS · November 19, 2015, 1:37pm

I have more or less the same problem that cannot be solved at the moment … at least by me

Problem is in fact that MT tries to reach gateway from lowest IP number.
For ex. if you have .3, .2, .1 on WAN and ipsec is made from .2 then MT is trying to push all traffic through .1 address to gateway. It NATs your .2 outgoing traffic to .1 and sends it out from .1 to remote end of ipsec.
How to solve this i have no idea
Tried everything i could come up with. Including ipsec-peer-local ip = …2, adding AS routes with preferred source etc … nothing seems to help in normal way.
To get it running disable/enable .1 ip on WAN (.2 becomes preferred output IP) and it is working till reboot when .1 comes to preferred output ip again

joelwhrs · November 19, 2015, 1:57pm

Sounds exactly like my issue. What seems really strange is the fact that there is an option to select the SA Source Address within Phase 1 of the IPSec rule. I would have imagined that this is the IP that it would use as the source address.

ALDISBEHMANIS · November 19, 2015, 2:34pm

It actually is replay IP but this MF NATs your source ip to his default ip for reaching GW
I have no idea how to change this situation
Using “smallest” ip for ipsecs is simply stupid idea …even it will work …

ALDISBEHMANIS · November 19, 2015, 5:33pm

ok. i got the solution!

Probably all your ip’s on WAN have equal mask … that is wrong. All except one has to have /32 (assuming all of them have same gateway ip)
2.0) Firewall - NAT add rule on top (before your masquerade) src-nat dest-addr protocol 50 action=accept
2.1) Firewall - NAT add rule on top (before your masquerade) src-nat dest-addr protocol 17 port 500 action=accept
Of course don’t forget to add one more accept rule before masquerade:
source-addr local-subnet remote-addr remote-subnet action=accept ← lets your packets enter tunnel
Check that you have filter rules that accept ipsec protocols and ports

Don’t forget to reboot both routers so all wrong connections (from wrong IPs) get killed. Or kill them manually on both ends.

…fuck … it took me 2 days to get it running

joelwhrs · November 21, 2015, 6:41pm

Perfect! I had a value for network set on the Address list as well. I had to remove that when I took the /28 subnet off or it wouldn’t communicate to my gateway.

Thanks!!

royalpublishing · December 11, 2015, 2:36pm

ALDISBEHMANIS:

ok. i got the solution!

Probably all your ip’s on WAN have equal mask … that is wrong. All except one has to have /32 (assuming all of them have same gateway ip)
2.0) Firewall - NAT add rule on top (before your masquerade) src-nat dest-addr protocol 50 action=accept
2.1) Firewall - NAT add rule on top (before your masquerade) src-nat dest-addr protocol 17 port 500 action=accept

Of course don’t forget to add one more accept rule before masquerade:
source-addr local-subnet remote-addr remote-subnet action=accept ← lets your packets enter tunnel

Check that you have filter rules that accept ipsec protocols and ports

Don’t forget to reboot both routers so all wrong connections (from wrong IPs) get killed. Or kill them manually on both ends.

…f**k … it took me 2 days to get it running

Crap, I still seem to have this problem even though all of my additional static NAT IP addresses are already using a /32 and the only one that actually uses the correct subnet mask is the WAN IP of that interface. Also, just to clear things up, on your 2.1) statement, you didn’t specify whether the port 500 on the NAT rule was src or dst port.

joelwhrs · December 11, 2015, 2:46pm

royalpublishing:

ALDISBEHMANIS:

ok. i got the solution!

Probably all your ip’s on WAN have equal mask … that is wrong. All except one has to have /32 (assuming all of them have same gateway ip)
2.0) Firewall - NAT add rule on top (before your masquerade) src-nat dest-addr protocol 50 action=accept
2.1) Firewall - NAT add rule on top (before your masquerade) src-nat dest-addr protocol 17 port 500 action=accept

Of course don’t forget to add one more accept rule before masquerade:
source-addr local-subnet remote-addr remote-subnet action=accept ← lets your packets enter tunnel

Check that you have filter rules that accept ipsec protocols and ports

Don’t forget to reboot both routers so all wrong connections (from wrong IPs) get killed. Or kill them manually on both ends.

…f**k … it took me 2 days to get it running

Crap, I still seem to have this problem even though all of my additional static NAT IP addresses are already using a /32 and the only one that actually uses the correct subnet mask is the WAN IP of that interface. Also, just to clear things up, on your 2.1) statement, you didn’t specify whether the port 500 on the NAT rule was src or dst port.

I setup mine for any port. Also check your addresses and make sure that the address entered in the “Network” field is the same as the address entered in the “Address” field, minus the subnet mask. This was causing most of my problems.

royalpublishing · December 11, 2015, 4:18pm

Just out of curiosity, are you using Dead Peer Detection on your IPSec Peers? I have it disabled on my end and I’m wondering if any time there is a drop out on the network it could have anything to do with my issue.

joelwhrs · December 12, 2015, 4:51pm

Dead peer detection is disabled on mine.

What exactly is happening with your connection?

royalpublishing · December 14, 2015, 3:41pm

Every once in a while, I’ll get errors like this in the log and the VPN doesn’t seem to want to re-establish the connnection for long periods of time.

phase1 negotiation failed due to send error. 11.22.33.44[500]<=>44.33.22.11 053e1ceacf95ca3b:3c9b14518f30b19c

I tried adding these additional NAT rules at the top of my list, will have to wait and see if the problem comes back.

royalpublishing · March 24, 2016, 2:08pm

I am still having this sporadic problem after adding all of the aforementioned NAT rules etc. As I mentioned before, whenever this happens, all VPN traffic stops flowing between the two sites. I’m not sure how to troubleshoot this issue any further, does anybody have any suggestions for me to try?

mattstephenson · March 31, 2017, 12:32am

I have this also on previous versions but still on 6.38.5 and at multiple different sites with RB3011.

This is usually evident at router startup, but does seem to have sporadically, perhaps when there is a drop in the connection at either end.

I have left it for hours and it still just fills up the log with errors.

By emptying the ‘remote peers’, the connections instantly rebuild and change to ‘established’.

sjoram · August 24, 2019, 4:06pm

Hi all,

I just came across this after a software upgrade to ROS, so it must be a change in behaviour between versions.

I had a srcnat rule at the top of my NAT rules

chain=srcnat
src=10.0.0.0/8
dst=10.0.0.0/8
action=accept

It would appear this was masquerading the lowest IP on the WAN interface. Like others, I disabled the other IPs and it worked.

I changed the rule to

chain=srcnat
src=10.0.0.0/8
dst=10.0.0.0/8
out-interface=
action=src-nat
to-address=

Then re-enabled the other public IPs, killed the active peers and session re-established OK. This appears to be causing some side effects on the LAN side, which I’m investigating.