Two days ago I configured 4 x GRE / IPSec tunnels on my CCR running 6.42.12. I use this exact same configuration elsewhere successfully on a CCR running 6.42.6. All 4 tunnels were up and stable and BGP neighbours connected and exchanging routes as expected.
Yesterday morning I noticed that the one tunnel is down. Log indicate ph2 cannot establish and the log is flooded with “ipsec failed to pre-process ph2 packet”. The policy for the tunnel was marked in red (I recall this was usually an indication that the policy was invalid).
Anyways I went through the process of clearing all SAs, enabling and disabling the peer, the policy, the associated addresses etc. multiple times without the ph2 re-establishing. Resetting the tunnel from the far end also had no effect. I deleted the peer and policy and recreated with the exact same result. I double checked all configs making 100% sure there were no overlapping subnets but I could not find any issue. None of the other 3 tunnels showed the same behavior.
At some stage I left the policy disabled for quite a while (guessing > 30 mins). After enabling it, to my surprise, the ph2 established. I was just wondering if this is perhaps know behavior that was introduced somewhere between 6.42.6 and 6.42.12? Or any other thoughts?
Thanks for the reply. I have read the release notes and though the versions there are a substantial amount of IPSec and IKE changes. I guess I will just test version by version until I get it fixed. Part of the issue here is that it seems that when the tunnel gets interrupted, it does not correctly re-establish ph2. Enabling and disabling the peer or policy does not recreate the issue but if you actually simulate an interruption (dropping traffic with firewalling) ph2 does not re-establish. Anyway, will update my findings here.
Does anyone here perhaps have any specific information on why the ph2 policies would out of the blue go into an invalid state? This happens randomly and I cannot reproduce on demand so trying to roll forward version by version is going take forever.
Like I mentioned before, disabling the policy for an extended amount of time and then enabling it seems to resolve the problem. The problem is not on the device at the far end as I have other routers terminating tunnels with the same configuration on the same device that never show this behavior.
As removing and re-creating the peers and policies didn’t help while “letting everything cool down for a while” did, I’d suspect some connection tracking issue somewhere, possibly in the network between the two devices, where an existing connection had to time out and disappear from the connection tracking tables in order to allow the peers to establish communication again. However, respond new phase 1 (Identity Protection) followed by no suitable proposal found suggest something more complex, like multiple local side peers being configured, with different Phase 1 proposals (aka peer profiles), and the wrong one to be hit by the incoming packet.
If you want a more detailed response, follow the hint in my automatic signature, configurations from both ends are necessary.
The phase 2 issue existed only for IKEv2 and it only happened once in many renegotiations while both ends showed everything to be just fine except that the data did not get through because the encryption keys differed between the ends. So I think your case is completely unrelated to that issue (and I don’t remember in which version it has been fixed a year ago or so although it was me who has reported it).
I have another Mikrotik router in another location that terminates GRE/IPSec tunnels on the exact same device on the far end (VyOS running StrongSwan) and I have never seen this behavior. As a test I downgraded the ROS version of the “problematic” router to 6.42.6. 3 days in the issue occurred again - one policy marked as invalid. Without any intervention, 6 hours later it recovered. I think you are correct that it might be due to connection tracking somewhere. The strange thing is that the far end indicates ph1 and ph2 up. Resetting the tunnels from the far side has not effect. The only thing that I can thing of that is different is that the connection over which the policies change to invalid states, is via a PPPoE internet connection. In other locations where this configurations successfully used, the internet connections are direct fibre connections. What are your thoughts on this? One thing I have not tried yet is actually disabling and re-enabling the PPPoE client session to see if that makes a difference (previously I just tried to remove the relevant UDP 500 and ESP connections from connection tracking). Btw, connection tracking on my Mikrotik routers are configured as “auto”.
One other thing that you have mentioned are the proposals. Even though all the IPSec tunnels on the router in question are configured to use the same proposal (add enc-algorithms=aes-128-cbc lifetime=1h name=vpn-core) there is another policy (the default one) used for inbound L2TP over IPSec connections from remote users (find default=yes ] enc-algorithms=aes-256-cbc pfs-group=modp2048) . If this policy is somehow used, then I can understand that there will be an issue but I cannot see how that would happen?
When I checked this morning, one policy marked as invalid once more.
Enabling and disabling the PPPOE had not effect.
Changing the crypto to the same settings as the ones used in the default policy had not effect.
The device on the far side indicated both ph1 and ph2 up. Bizarrely even though the policy on the Mikrotik is marked as invalid and clearly indicated as not established, the GRE interface is in a running state and the BGP neighbor connected. So this seems to me that even tough the policy is marked as invalid in the UI as well as the terminal, somewhere, somehow a valid policy is still installed. When I enable and disable the peer, the ph1 reestablished immediately but ph2 does not reestablish and obviously the BGP neighbor now looses comms as the GRE is not in a running state. No matter what I do, I cannot force the policy into a valid state again - after x hours, it just magically recovers. Diagnostic logs indicate: ipsec,debug no policy found for id:11473.
I quite urgently need to get this resolved - again, any assistance would be greatly appreciated.
Below the invalid policy with traffic flowing correctly as if the tunnel is working 100% (prior to resetting the peer):
What makes my brain slide a bit is that you use the IP address attached to the GRE interface also as a local address to send the GRE transport packets from. Could you, just for testing, create an /interface bridge without any member ports and attach to it the IP address which is used as local-address of /interface gre (and thus the src-address of the /ip ipsec policy) and use a different address for the GRE interface itself (I don’t mind which one of the two will remain the current one and for which purpose you’ll use a new one)? I have no idea how this would be done at VyOS side but it should be possible as well.
Also, I’m nor sure why you’ve assigned a /32 address to one GRE tunnel and a /30 address to another one.
As you seem not to mind publishing your public IP addresses, can you post the log which says “no policy found”? There should be more information in the log than just this.
Please show me also the output of /ip ipsec policy print and /ip ipsec installed-sa print when it runs but the policy is marked as inactive. The encryption and authentication keys shown in the installed-sa printouts are temporary ones so no need to edit them out, it is enough to wait up to 30 minutes before posting them.
Once more thank you for the reply. Below the output of the peer and sa status. I have also attached the log, obfuscated the IPs as per the config provided and marked non-relevant IPs as x.x.x.x.
The reason for the “odd” GRE configuration stems from VyOS documentation where it is recommended to use these /32 loopback IPs to match the IPSec policies on. The a /30 network is used as a link network and also functions as the BGP peer IPs on both sides of the tunnel. I agree that my implementation on the Mikrotik side can probable be simplified or improved which I will try as per your suggestion. SyslogCatchAll-2019-04-17 obfuscated.txt (512 KB)
First of all, all occurrences of ipsec policy not found in the log you’ve sent are preceded by a message ipsec searching for policy for selector: 169.254.200.49 ip-proto:47 <=> 169.254.200.50 ip-proto:47 which doesn’t match the parameters of the /ip ipsec policy you’ve shown in the quotation from your config in post #6 (dst-address=169.254.200.66/32 src-address=169.254.200.65/32). Post #9 shows the policy matching the log, not the configuration. Is this because you’ve split the IPs as I’ve asked you before?
Second, showing just parts of the configuration and status outputs breaks the purpose, my intention was to check the existing policies (including dynamically created ones) for shadowing each other, which you’ve ruined by showing just the single policy and SA. So please post the complete printout as requested - as text, not as screenshot. You can always direct the output of a print to a file by adding file=some-name, download the file and edit it before posting to hide the IPs, but such editing has to preserve consistency - the same IP address must be substituted by the same replacement string in the whole file, and the traffic selector addresses of the policies (src-address and dst-address) must remain completely unchanged to show the eventual shadowing of policies by preceding ones.
Hi, on your first question - no, as you can see in post #7, (dst-address=169.254.200.66/32 src-address=169.254.200.65/32) is another policy used by another tunnel. This tunnel also shows the same sporadic behaviour. The config for it was supplied in this thread as use the problematic tunnel as an example when the policies go into an invalid state - I cannot predict when this happens and I supplied the information that was available to me at the time.
I have subsequently moved the source address of the GRE tunnel to a bridge interface as per your suggestion.
My apologies that the configs were not supplied to your satisfaction. I have added it below as per your request. I have also substituted all the public addresses consistently.
Policies:
# apr/17/2019 14:55: 1 by RouterOS 6.42.6
# software id = PFRD-4CSN
#
Flags: T - template, X - disabled, D - dynamic, I - invalid, A - active,
* - default
0 T * group=default src-address=::/0 dst-address=::/0 protocol=all
proposal=default template=yes
1 DA src-address=x.x.x.x/32 src-port=any dst-address=x.x.x.x/32
dst-port=any protocol=udp action=encrypt level=unique
ipsec-protocols=esp tunnel=no proposal=default ph2-count=1
2 DA src-address=x.x.x.x/32 src-port=any dst-address=x.x.x.x/32
dst-port=any protocol=udp action=encrypt level=unique
ipsec-protocols=esp tunnel=no proposal=default ph2-count=1
3 DA src-address=x.x.x.x/32 src-port=any dst-address=x.x.x.x/32
dst-port=any protocol=udp action=encrypt level=unique
ipsec-protocols=esp tunnel=no proposal=default ph2-count=1
4 DA src-address=x.x.x.x/32 src-port=any
dst-address=x.x.x.x/32 dst-port=any protocol=udp
action=encrypt level=unique ipsec-protocols=esp tunnel=no
proposal=default ph2-count=1
5 DA src-address=x.x.x.x/32 src-port=any dst-address=x.x.x.x/32
dst-port=any protocol=udp action=encrypt level=unique
ipsec-protocols=esp tunnel=no proposal=default ph2-count=1
6 DA src-address=x.x.x.x/32 src-port=any dst-address=x.x.x.x/32
dst-port=any protocol=udp action=encrypt level=unique
ipsec-protocols=esp tunnel=no proposal=default ph2-count=1
7 DA src-address=x.x.x.x/32 src-port=any dst-address=169.1.101.47/32
dst-port=any protocol=udp action=encrypt level=unique
ipsec-protocols=esp tunnel=no proposal=default ph2-count=1
8 XI ;;; vr1.ue1a via ISP1
src-address=169.254.200.37/32 src-port=any
dst-address=169.254.200.38/32 dst-port=any protocol=gre action=encrypt
level=require ipsec-protocols=esp tunnel=yes
sa-src-address=x.x.x.x sa-dst-address=x.x.x.x
proposal=vpn-core ph2-count=0
9 XI ;;; vr1.ew1a via ISP1
src-address=169.254.200.53/32 src-port=any
dst-address=169.254.200.54/32 dst-port=any protocol=gre action=encrypt
level=require ipsec-protocols=esp tunnel=yes
sa-src-address=x.x.x.x sa-dst-address=x.x.x.x
proposal=vpn-core ph2-count=0
10 A ;;; vr1.ue1a via ISP2
src-address=169.254.200.41/32 src-port=any
dst-address=169.254.200.42/32 dst-port=any protocol=gre action=encrypt
level=require ipsec-protocols=esp tunnel=yes
sa-src-address=11.0.0.11 sa-dst-address=22.0.0.22
proposal=vpn-core ph2-count=1
11 I ;;; vr2.ue1b via ISP2
src-address=169.254.200.49/32 src-port=any
dst-address=169.254.200.50/32 dst-port=any protocol=gre action=encrypt
level=require ipsec-protocols=esp tunnel=yes
sa-src-address=11.0.0.11 sa-dst-address=33.0.0.33
proposal=vpn-core-2 ph2-count=0
12 A ;;; vr1.ew1a via ISP2
src-address=169.254.200.57/32 src-port=any
dst-address=169.254.200.58/32 dst-port=any protocol=gre action=encrypt
level=require ipsec-protocols=esp tunnel=yes
sa-src-address=11.0.0.11 sa-dst-address=44.0.0.44
proposal=vpn-core ph2-count=1
13 XI ;;; vr2.ue1b via ISP1
src-address=169.254.200.45/32 src-port=any
dst-address=169.254.200.46/32 dst-port=any protocol=gre action=encrypt
level=require ipsec-protocols=esp tunnel=yes
sa-src-address=x.x.x.x sa-dst-address=x.x.x.x
proposal=vpn-core ph2-count=0
14 A ;;; vr2.ew1b via ISP2
src-address=169.254.200.65/32 src-port=any
dst-address=169.254.200.66/32 dst-port=any protocol=gre action=encrypt
level=require ipsec-protocols=esp tunnel=yes
sa-src-address=11.0.0.11 sa-dst-address=55.0.0.55
proposal=vpn-core ph2-count=2
15 XI ;;; vr2.ew1b via ISP1
src-address=169.254.200.61/32 src-port=any
dst-address=169.254.200.62/32 dst-port=any protocol=gre action=encrypt
level=require ipsec-protocols=esp tunnel=yes
sa-src-address=x.x.x.x sa-dst-address=x.x.x.x
proposal=vpn-core ph2-count=0
16 DA src-address=x.x.x.x/32 src-port=1701 dst-address=x.x.x.x/32
dst-port=1701 protocol=udp action=encrypt level=require
ipsec-protocols=esp tunnel=no proposal=default ph2-count=1
My personal opinion is that has nothing to do with configuration. None of your other policies can shadow the one mysteriously going inactive, even if we’d admit that the order of the policies wouldn’t matter. It normally does but when you add a new policy this used to fail in older ROS releases, e.g. on the 6.43.8 where I’ve just tested it.
So it is either a bug or, if the other Routerboard/CHR /x86 on which you’ve never seen the issue to happen is the same model like the problematic one, there can be a hardware (RAM) issue on the problematic one.
So you may try the following steps and hope that one of them works around the bug (the last one may “fix” rather than “workaround” it although none of the changelog items listed below is guaranteed to be related):
separate the addresses used to send for the GRE transport (to be matched by the IPsec policy) from the addresses attached to the GRE interface, at least on Mikrotik side, the way I’ve suggested above; doing so will allow you to remove the protocol=gre from the traffic selector of the /ip ipsec policy
set the sa-src-address of the policy to 0.0.0.0. It may make things better (the policy won’t jump inactive) or worse (it won’t ever come up), the idea is just to force a different branch of the algorithm, not something precisely targeted to a known issue
set the level to unique instead of the current require - same motivation like above
upgrade to 6.43.12 (the last available one before 6.44 - to keep the structure of IPsec configuration you are used to)
The following items in the changelog may be related:
6.42.7: *) ipsec - improved invalid policy handling when a valid policy is uninstalled;
6.43:
*) ike2 - fixed initiator first policy selection;
*) ipsec - improved invalid policy handling when a valid policy is uninstalled;
*) ipsec - improved reliability on generated policy addition when IKEv1 or IKEv2 used;
The “improved invalid policy handling when a valid policy is uninstalled” is mentioned also in 6.44 changelog, so it seems to be a complex issue.
changed the policy level to unique. This unfortunately did not resolve the issue.
moved the IP addresses used to match the IPSec policy to a loopback adapter (bridge) and away from the GRE interface. This unfortunately did not resolve the issue.
Upgraded to v6.43.16. The issue however remains
One thing that I have been able to achieve is forcing the policy out of the invalid state. This is achieved by changing the protocol to something else - e.g. all. Changing it back to protocol 47, immediately puts it back into an invalid state. Even if the protocol is set to all, not 47, it still randomly goes into an invalid state (and some hours later, fixes itself, and the process repeats).
Hi! Do you have some fresh findings how resolve this bug with GRE (transport) + IPSEC
We have the same situation now…
By the way - for some tunnels i changed from GRE to IPIP tunnel, and they are work without problem, then i returned to GRE => one or more times for the day we have ipsec,error … dead ph2
Hello, Same here RB3011UiAS upgraded from v 6.44.1 to v 6.45.1 L2TP stopped working and getting error: “src IP address failed to pre-process ph2 packet.”
Downgraded back to v6.44.1 still getting the error.
Upgraded to v 6.44.5 (Long term) and still having the issue with L2TP ph2 fail…
PPTP, SSTP working fine…
The configuration is the sames as before first upgrade… Did the “export compact file=“config.rsc”” and re checked everything…
Okay, I found that my default IPsec Policies was disabled…
I enabled it…
Now I don’t receive error, but still can’t connect - MT log at attachment: L2TP log.txt (5.45 KB)
After trying many different approaches and new versions, I have upgraded my CCR1036 to v6.45.6 on the 18th of September and so far the IPSec policies have remained stable. I have noticed that dynamic policies used for L2TP over IPSec sometimes go into an invalid state but this seems to be isolated to L2TP and I will dig deeper into that at a later state. For now it looks good. I believe that *) ipsec - fixed policies becoming invalid after changing priority; in the 6.45.1 changelog addressed this. Holding thumbs it stays stable.
@Valdis, the 6.45.6 upgrade also broke my L2TP server. It seems to have removed the IPSec secret which I had to add back and there was also something wrong or mismatched with the policy. I just reconfigured the default policy and cleaned up the auto generated stuff and that solved the issue.