Community discussions

MikroTik App
 
loca995
just joined
Topic Author
Posts: 16
Joined: Wed Sep 01, 2021 10:31 am
Location: Italy

GRE over IPSec stops working when PPPoE interface flaps.

Wed Sep 01, 2021 8:37 pm

Hello everyone,
The situation is the following (If you need the network diagram I will provide you one).
HQ: two WANs
BO: two or more WANs.
In order to provide VPN connection with failover, for every WAN connection in the BO there are two GRE tunnels with the HQ. In this specific case, we have 3 WANs in the BO, so 6 GRE tunnels. Every tunnel has a static route with a different distance.
Everything works fine, until the PPPoE interface of the WAN in the BO flaps. Since this moment, the GRE tunnel is down. I am not able even to ping the remote address used to make the GRE tunnel, forcing the correct source address.
Gre Interfaces - BO side
BO-Gre-List.png
Tunnel configuration - BO side
BO-Gre-Int.png

Gre interfaces on HQ (interfaces names are the same on both sides)
HW-Gre-List.png
Same Tunnel configuration - HQ side
HQ-Gre-Int.png
I started the tests from the IPSec phase one: the state is established, but the RX value is 0. Phase 2 is established too. IPSec is configured with tunnel mode.
IPSEC-Status.png
IPSEC-PHASE.png
If I try to ping the remote address with a src address, like is defined in the Phase2 policy, the ping doesn’t work. This makes me think about an issue with IPSec, but the next steps make me even more confused.
Ping from BO -> HQ on the loopback interfaces
Debug1.png

In the screenshot above, in the connection tracking, a GRE connection is shown. The timeout on the GRE connection is 3 minutes.
Now if I keep the GRE interfaces disabled for three minutes on both sites, the timeout expires. Since this moment, I am able to ping the loopback interfaces on the other side.
Debug2.png
So, at this moment, I bring up the GRE Interfaces on both sides, and everything start working well again.
Debug3.png
The manual troubleshooting works everytime, but I would like to understand what causes this issue. I have a big number of GRE Tunnel and every morning I have to check if this problem is active and fix it.
I am very confused. In the first moment I thought it could be a problem of IPSec when the PPPoE interface flaps. Now I would say that in the flap moment something happens in the routing table, and in some way the GRE keepalive packets go through another route and remain active for some reason.

I would really appreciate to resolve this problem, I can provide any type of information or test if needed.
Thank you
Lorenzo
You do not have the required permissions to view the files attached to this post.
 
loca995
just joined
Topic Author
Posts: 16
Joined: Wed Sep 01, 2021 10:31 am
Location: Italy

Re: GRE over IPSec stops working when PPPoE interface flaps.

Mon Sep 27, 2021 5:48 pm

Hello,
anyone has any suggestion?

I am able to create the problem everytime I want, but I need a solution to avoid GRE stops working
Thank you
 
sindy
Forum Guru
Forum Guru
Posts: 7925
Joined: Mon Dec 04, 2017 9:19 pm

Re: GRE over IPSec stops working when PPPoE interface flaps.

Mon Sep 27, 2021 6:25 pm

Would it be too painful for you to change the tunnels from GRE to IPIP?

The thing is that at least since a fix of some GRE-related vulnerability somewhere in 6.45.x, the issue you describe exists, plus only on some CPU architectures to make it even more entertaining. I've migrated all my affected tunnels to IPIP (enable ipencap, not ipip in the firewall rules!) and it's been working fine ever since. As a bonus point, you'll win a few bytes of MTU - the GRE headers are larger than the IPIP ones, but Mikrotik cannot make use of them, so you cannot set up multiple GRE tunnels between two peers, differentiated by Tunnel-ID.

I was never able to collect enough evidence to open a support case with Mikrotik because it could never reproduce it on my set of lab machines, and I'm not going to send supout.rif from production machines with PSK authentication on IPsec even though Mikrotik states the PSKs are not saved into the supout.
Don't write novels, post /export hide-sensitive file=x. Use find&replace in your favourite text editor to systematically replace all occurrences of each public IP address potentially identifying you by a distinctive pattern such as my.public.ip.1.
 
loca995
just joined
Topic Author
Posts: 16
Joined: Wed Sep 01, 2021 10:31 am
Location: Italy

Re: GRE over IPSec stops working when PPPoE interface flaps.

Tue Sep 28, 2021 12:06 pm

Hello Sindy!
Thanks for the answer.
Well, at least I know that If I migrate from Gre to IPIP I should resolve this problem! For sure it's not easy, because we are speaking of more than 200 tunnels... But I can try on one site and see...

But anyway it should be interesting resolve this issue with GRE.
I was never able to collect enough evidence to open a support case with Mikrotik because it could never reproduce it on my set of lab machines, and I'm not going to send supout.rif from production machines with PSK authentication on IPsec even though Mikrotik states the PSKs are not saved into the supout.
I wasn't able to reproduce it in lab too..(which made me think I could have an issue with some firewall rules on the HQ?), but I have a production environment where I am able to reproduce it every time I want.
It could be interesting if the support could connect with a remote session and see what is going on...
Meanwhile I opened a case with Mikrotik and I sent this thread... let's see what happen
 
sindy
Forum Guru
Forum Guru
Posts: 7925
Joined: Mon Dec 04, 2017 9:19 pm

Re: GRE over IPSec stops working when PPPoE interface flaps.

Tue Sep 28, 2021 12:52 pm

Meanwhile I opened a case with Mikrotik and I sent this thread... let's see what happen
So to contribute - if I remember right, I had this problem when CHR was at one end and RB1000AHx4 at the other one.
Don't write novels, post /export hide-sensitive file=x. Use find&replace in your favourite text editor to systematically replace all occurrences of each public IP address potentially identifying you by a distinctive pattern such as my.public.ip.1.
 
pe1chl
Forum Guru
Forum Guru
Posts: 7796
Joined: Mon Jun 08, 2015 12:09 pm

Re: GRE over IPSec stops working when PPPoE interface flaps.

Tue Sep 28, 2021 2:01 pm

I am using such a config and I do not see any issues. Watch out for:
- firewall errors (as sindy mentioned, there is a bug in RouterOS for the past couple of versions. incoming GRE traffic is marked "invalid" instead of "established" or "new", when you drop invalid traffic before accepting GRE it fails
- NAT issues in other routers. when IPsec is sent via NAT and the session is interrupted, it can get into an unrecoverable state. This happens when your router is behind some other product like AVM Fritzbox. It often helps to "forward" the traffic, in this case UDP port 500 and 4500 to the router.
 
sindy
Forum Guru
Forum Guru
Posts: 7925
Joined: Mon Dec 04, 2017 9:19 pm

Re: GRE over IPSec stops working when PPPoE interface flaps.

Tue Sep 28, 2021 2:46 pm

@pe1chl, unfortunately the PITA the OP has described exists in addition to the two you've mentioned. I've done all my homework to work these around (exemption of GRE from "drop invalid", measures to make sure that IPsec recovers from an interruption/restart of a mid-path router properly, filter rule allowing forwarding of GRE packets emerging from GRE tunnel (you favourite keepalive theme) - this one is only necessary with paranoid firewalls) and still there's a problem sometimes, where after an outage or after initial configuration of the GRE tunnel, without touching anything else, you have to just disable the GRE for 10 minutes at both ends and then re-enable it to make it work again. If you don't, it stays broken forever, so not related even to SA rekeying.
Don't write novels, post /export hide-sensitive file=x. Use find&replace in your favourite text editor to systematically replace all occurrences of each public IP address potentially identifying you by a distinctive pattern such as my.public.ip.1.
 
pe1chl
Forum Guru
Forum Guru
Posts: 7796
Joined: Mon Jun 08, 2015 12:09 pm

Re: GRE over IPSec stops working when PPPoE interface flaps.

Tue Sep 28, 2021 3:02 pm

I recognize that only when there is another NAT router inbetween. The NAT router has translated the original session to a different address but the same port numbers (500 and 4500), but after the outage the NAT router thinks there is a new connection and has not yet deleted the old one, and decides to translate the port number because it sees a session to the same IP/port (500). So the NAT rule translates to some random port number and the two sides never meet again.
Disabling for 10 minutes deletes this entry from the NAT table and sessions are possible again.
When I do a "port forwarding" in the NAT router this usually does not happen because the router knows to translate only the address and not the port number.

On routers that are directly on internet without extra NAT router inbetween I never see this problem, not even when one of them uses PPPoE.
I *do* see another issue with PPPoE sometimes: when there is a reset in the backbone network and the PPPoE is not cleanly closed and reopened, it sometimes occurs that the PPPoE interface does not get an IPv4 address. It does get the IPv6 address. As I make 2 tunnels, one GRE and one GRE6, between each location, I then see that the GRE6 works normally and the GRE does not.
I have a script to detect this problem and disable/enable the PPPoE interface, which recovers both tunnels without requiring further action on the tunnel interfaces.
 
loca995
just joined
Topic Author
Posts: 16
Joined: Wed Sep 01, 2021 10:31 am
Location: Italy

Re: GRE over IPSec stops working when PPPoE interface flaps.

Wed Sep 29, 2021 10:53 am

I have some news:
Would it be too painful for you to change the tunnels from GRE to IPIP?
I tried on one site but I have the same issue! Could it mean that I have some issue on the HQ firewall?
So to contribute - if I remember right, I had this problem when CHR was at one end and RB1000AHx4 at the other one.
I have a CCR1036 in the HQ and a RB3011 in the BO...

I read your nice considerations about the thread and what I can say is that I always have a public IP over the PPPoE interfaces
I also have other environments with other Mikrotik models and there the issue never happens.
The only difference is that in the other enviroments the GRE-IPIP tunnel is not the default gateway in the BO: is it possible that cause the issue?
But still, I am not able to reproduce the issue in the lab environment.

About what @pe1chl said, this is quite interesting but I will focus on one thing:
IPsec is in tunnel mode, with phase2 established.
The phase two policy is:
- src : x.x.x.x
- dst: y.y.y.y
Where these two are the remote and local address used for initializate the GRE interface.
I leave a CLI session opened with a ping from x.x.x.x to y.y.y.y

CASE 1: GRE tunnel is running and enabled
1) I voluntary disable and enable the PPPoE interface
2) The ping stops working and it won't work anymore until I disable the two GRE interfaces (one in BO and one in HQ)

CASE 2: GRE tunnel is disabled
1) I voluntary disable and enable the PPPoE interface
2) The ping stops working but when the PPPoE comes up the ping works correctly.

What is seems to me, is that the GRE invalid session break the IPSec somehow..
But still, I should say it's maybe a firewall issue on the HQ?
Last edited by loca995 on Wed Sep 29, 2021 11:13 am, edited 1 time in total.
 
sindy
Forum Guru
Forum Guru
Posts: 7925
Joined: Mon Dec 04, 2017 9:19 pm

Re: GRE over IPSec stops working when PPPoE interface flaps.

Wed Sep 29, 2021 11:11 am

I tried on one site but I have the same issue! Could it mean that I have some issue on the HQ firewall?
...
The only difference is that in the other enviroments the GRE-IPIP tunnel is not the default gateway in the BO: is it possible that cause the issue?
Hopefully one of these is the reason, otherwise it would mean that the IPIP handling would be flawed too.

So please post the anonymized configuration of both devices (the BO one currently running IPIP and the HQ one), see my automatic signature below for a hint.
Don't write novels, post /export hide-sensitive file=x. Use find&replace in your favourite text editor to systematically replace all occurrences of each public IP address potentially identifying you by a distinctive pattern such as my.public.ip.1.
 
pe1chl
Forum Guru
Forum Guru
Posts: 7796
Joined: Mon Jun 08, 2015 12:09 pm

Re: GRE over IPSec stops working when PPPoE interface flaps.

Wed Sep 29, 2021 11:29 am

What is seems to me, is that the GRE invalid session break the IPSec somehow..
But still, I should say it's maybe a firewall issue on the HQ?
When it is the same issue as what I see, it is not a GRE or other tunnel issue but an IPsec issue.
Any stateful firewall (including the NAT example I gave but also the firewall you have) can cause IPsec issues that go away when the session is silenced for a while.

I even wrote a script for this I run on a router that sometimes has those issues:
/system script
add dont-require-permissions=no name=ipsecerrorhandler owner=admin policy=\
    ftp,read,write,policy,test source="# scan log buffer for ipsec error messa\
    ges\r\
    \n# when error is \"phase1 negotiation failed due to time up\", temporaril\
    y block the sender\r\
    \n# this is done to work around problems with some NAT routers\r\
    \n\r\
    \n:global lastTime;\r\
    \n\r\
    \n:local currentBuf [ :toarray [ /log find topics=ipsec,error and message~\
    \"phase1 negotiation failed due to\" ] ] ;\r\
    \n:local currentLineCount [ :len \$currentBuf ] ;\r\
    \n\r\
    \n:if (\$currentLineCount > 0) do={\r\
    \n    :local currentTime [/log get [ :pick \$currentBuf (\$currentLineCoun\
    t -1) ] time ] ;\r\
    \n    if (\$currentTime != \$lastTime) do={\r\
    \n        :set lastTime \$currentTime ;\r\
    \n        :local currentMessage [/log get [ :pick \$currentBuf (\$currentL\
    ineCount -1) ] message ] ;\r\
    \n        :if (\$currentMessage~\"due to time up\") do={\r\
    \n            :local ipaddress [:pick \$currentMessage ([:find \$currentMe\
    ssage \"<=>\" ]+3) 99 ] ;\r\
    \n            :set ipaddress [:pick \$ipaddress 0 [:find \$ipaddress \"[\"\
    \_] ];\r\
    \n            :local activepeers [ /ip ipsec active-peers find where remot\
    e-address=\$ipaddress and state=established ] ;\r\
    \n            :if ( [ :len \$activepeers ] = 0 ) do={\r\
    \n                :log info \"Temporarily blocking \$ipaddress due to erro\
    rs\" ;\r\
    \n                /ip firewall address-list add list=blocked address=\$ipa\
    ddress timeout=\"00:05:00\" ;\r\
    \n            }\r\
    \n        }\r\
    \n    }\r\
    \n}"
The script is scheduled to run every 2 minutes, and when it finds a remote that is
having problems establishing the connection it puts its address in a list and in the
firewall you should block the packets from source address in that list. After 5
minutes (you can use 10 minutes when required) it expires and the connection is
established again.
 
loca995
just joined
Topic Author
Posts: 16
Joined: Wed Sep 01, 2021 10:31 am
Location: Italy

Re: GRE over IPSec stops working when PPPoE interface flaps.

Wed Sep 29, 2021 2:06 pm

So please post the anonymized configuration of both devices (the BO one currently running IPIP and the HQ one), see my automatic signature below for a hint.
I really appreciate the help but in this moment the HQ configuration is a disaster, we are fixing it but I think we need some weeks before having a precise configuration. If the issue will happen after the configuration will be corrected, I will send you the configuration
Anyway, I have other news:

@pe1chl
I made other tests with some colleagues and probably you got the point.
What we made is establishing a new IPIP-TUNNEL from the same BO to another Mikrotik (let's call it HQ-TEST), which is not the HQ.
So, we should exclude any problem due to the default route in the BO (so the IPIP tunnel itself towards HQ)
The issue still happens.

So I kept the new IPIP-TUNNEL with HQ-TEST disabled, I disabled and enabled the PPPoE interfaces, and the phase2 doesn't work.
For the BO and HQ-TEST mikrotiks the state of the policy is "established", but a ping from the source to the destination doesn't work.
My colleague tried to decrease the DPD timeout on phase1 (something very strict: like 1-1), and in this way, the phase2 always comes back up after the PPPoE flap.

So I am starting to think it's something more related to IPsec... but still no idea...
Which DPD configuration would you suggest?

EDIT
Seems that if there is no traffic between BO and HQ-TEST, when the PPPoE flaps the IPsec phase 2 works
If there is traffic between BO and HQ-TEST, like IPIP keep alive or just a ping, when the PPPoE flaps, the IPSEC phase 2 remains established but stop working.
 
loca995
just joined
Topic Author
Posts: 16
Joined: Wed Sep 01, 2021 10:31 am
Location: Italy

Re: GRE over IPSec stops working when PPPoE interface flaps.

Sat Oct 02, 2021 10:08 pm

I tried on one site but I have the same issue! Could it mean that I have some issue on the HQ firewall?
...
The only difference is that in the other enviroments the GRE-IPIP tunnel is not the default gateway in the BO: is it possible that cause the issue?
Hopefully one of these is the reason, otherwise it would mean that the IPIP handling would be flawed too.

So please post the anonymized configuration of both devices (the BO one currently running IPIP and the HQ one), see my automatic signature below for a hint.
@sindy: after other tests, the behavior is exactly the one described in this topic viewtopic.php?t=122014
I saw that you already visited it, any suggestion about it?
 
sindy
Forum Guru
Forum Guru
Posts: 7925
Joined: Mon Dec 04, 2017 9:19 pm

Re: GRE over IPSec stops working when PPPoE interface flaps.

Sun Oct 03, 2021 2:21 pm

As @pe1chl wrote, there may be firewall/NAT issues associated with the PPPoE flap. If there is NAT somewhere between the peers, both IKE (or IKEv2) and the transport packets use the same UDP stream, and either Mikrotik's own NATs or those on the ISP's devices may behave in an unexpected way when the address and port at one end changes whilst the other end keeps remembering the old ones.

With multiple WANs, there's one more complication - I had cases where the SAs chose a wrong peer locally after restart or path glitch, so the remote peer was ignoring the transport packets as they came in via a wrong SA.

To suggest some analysis steps, the anonymized configuration is necessary. Maybe start from posting the one from the BO side running the IPIP tunnel (which is not a "disaster" like the HQ one), and stating whether the WANs get the same IP each time the PPPoE connects and whether that IP is a public one or not.
Don't write novels, post /export hide-sensitive file=x. Use find&replace in your favourite text editor to systematically replace all occurrences of each public IP address potentially identifying you by a distinctive pattern such as my.public.ip.1.
 
loca995
just joined
Topic Author
Posts: 16
Joined: Wed Sep 01, 2021 10:31 am
Location: Italy

Re: GRE over IPSec stops working when PPPoE interface flaps.

Mon Oct 04, 2021 6:43 pm

As @pe1chl wrote, there may be firewall/NAT issues associated with the PPPoE flap. If there is NAT somewhere between the peers, both IKE (or IKEv2) and the transport packets use the same UDP stream, and either Mikrotik's own NATs or those on the ISP's devices may behave in an unexpected way when the address and port at one end changes whilst the other end keeps remembering the old ones.

With multiple WANs, there's one more complication - I had cases where the SAs chose a wrong peer locally after restart or path glitch, so the remote peer was ignoring the transport packets as they came in via a wrong SA.

To suggest some analysis steps, the anonymized configuration is necessary. Maybe start from posting the one from the BO side running the IPIP tunnel (which is not a "disaster" like the HQ one), and stating whether the WANs get the same IP each time the PPPoE connects and whether that IP is a public one or not.
Hello Sindy,
in this moment I have two test WANs in the datacenter, so I took a RB2011 for tests, I put a simple configuration, and I simulated that it's the TEST-HQ.
So I connected to the production BO with a single IPSEC tunnel.
If I disable the PPPoE on TEST-HQ site, without any tunnel (IPIP / GRE or what else) configured, the flap doesn't make any problem on the IPsec Phase2 status.
If I configure a IPIP or even a GRE tunnel between the TEST-HQ and the BO, if I disable and enable the PPPoE on HQ site, the IPsec phase 2 stop working.
I attached the configuration of the TEST-HQ, which is pretty clear and easy, compared to the production HQ.
Could you please take a look and tell me if there is something wrong?

I am working on the BO configuration but I won't be ready until the next days.
You do not have the required permissions to view the files attached to this post.
 
loca995
just joined
Topic Author
Posts: 16
Joined: Wed Sep 01, 2021 10:31 am
Location: Italy

Re: GRE over IPSec stops working when PPPoE interface flaps.

Tue Oct 12, 2021 3:43 pm

Hello,
any news?

Meanwhile I realized a test environment with one RB2011 and one RB3011 and a internet connectivity for each routerboard, provided by the same provider.
One WAN connection on the RB2011, made with a PPPoE interface.
One WAN connection on the RB3011, made with a PPPoE interface.

If I disable and re-enable the PPPoE interface on the RB2011, the IPSec works properly.
If I disable and re-enable the PPPoE interface on the RB3011, the IPSec stops and I need to reset it manually.

I attached the two configurations. It's a very simple configuration, could you please take a look?

The only difference between the two setup is:
on the RB2011 the internet connection is made with a wireless link, on the RB3011 the internet connection is made with a FTTC Modem in bridge mode. But for the routerboards there should not be any difference, it should be transparent...
You do not have the required permissions to view the files attached to this post.
Last edited by loca995 on Fri Oct 15, 2021 2:00 pm, edited 2 times in total.
 
markmcn
Member Candidate
Member Candidate
Posts: 103
Joined: Wed Mar 03, 2010 2:15 am

Re: GRE over IPSec stops working when PPPoE interface flaps.

Thu Oct 14, 2021 1:11 am

Not helpful but you're not alone having IPSec issues with PPPoE flaps
I'm seeing IPSec break on a RB1100AH4 every time PPPoE flaps and the installed SA's clear following this ipsec is broken until reboot, or if you hold the PPPoE down for like 3~5 min, I've a case with Tik Support going on about it. Might be worth reaching out to support also,
If I make any progress I'll share here
Cheers
Mark
 
loca995
just joined
Topic Author
Posts: 16
Joined: Wed Sep 01, 2021 10:31 am
Location: Italy

Re: GRE over IPSec stops working when PPPoE interface flaps.

Thu Oct 14, 2021 2:01 pm

If I make any progress I'll share here
I'll wait for some news, I am quite desperate at the moment
installed SA's clear following this ipsec is broken until reboot, or if you hold the PPPoE down for like 3~5 min
And if you try to disable and enable the phase2 policy on the mikrotik, what happens?
 
markmcn
Member Candidate
Member Candidate
Posts: 103
Joined: Wed Mar 03, 2010 2:15 am

Re: GRE over IPSec stops working when PPPoE interface flaps.

Thu Oct 14, 2021 3:52 pm

I've mixed success with going through and disabling all elements of the ipsec config peer, policy etc and leaving them disabled for a while but this doesn't always work.
The only sure fire ways that work for me are holding PPPoE down for 3-5 min once the issue presents or just rebooting the device.
This is a real pain and I hope they can get it fixed soon, It's really strange for me it's only impacting certain devices I'm seeing it on a RB1100AH4 but i've a pair of these side by side and only one is having the issue even after a full factory reset and only basic config applied
 
pe1chl
Forum Guru
Forum Guru
Posts: 7796
Joined: Mon Jun 08, 2015 12:09 pm

Re: GRE over IPSec stops working when PPPoE interface flaps.

Thu Oct 14, 2021 4:26 pm

Are these devices all connected to internet the same way? I.e. is it always a plain PPPoE connection to an ISP that offers a real and static external IP, not using cg-nat?
I ask that because the behavior may very well depend on NAT occurring external to the MikroTik routers, or depend on the IP changing after the PPPoE interface flap.
That would explain why certain people see it and others don't, or why you see it on some router and not on another.
 
markmcn
Member Candidate
Member Candidate
Posts: 103
Joined: Wed Mar 03, 2010 2:15 am

Re: GRE over IPSec stops working when PPPoE interface flaps.

Thu Oct 14, 2021 6:56 pm

Hey pe1chl,
In my case both devices have public IP's landing directly on the device and there is no CG-NAT in between them 100% sure of this.
In my case with the 1100AH4 that' doing it actually both plug into the same VDSL modem in bridge mode and just have different PPPoE cred's to get different ip's but always the same ip's are assigned.
I had a similar case with Mikrotik maybe a year or more back and it got resolved but that was on a 2011 device. It's really like the the IPSec daemon just gets stuck in some sort of bad state either holding onto a stale connection or something, Given we can't see the innards of RouterOS it's hard to tell, I've upload support outputs and logs before and after the PPPoE bounce so I'm hoping tik support can find something.
Cheers
Mark
 
sindy
Forum Guru
Forum Guru
Posts: 7925
Joined: Mon Dec 04, 2017 9:19 pm

Re: GRE over IPSec stops working when PPPoE interface flaps.

Thu Oct 14, 2021 9:42 pm

You forgot to obfuscate HQ.rsc, maybe you want to withdraw it and post it again once anonymized?
Don't write novels, post /export hide-sensitive file=x. Use find&replace in your favourite text editor to systematically replace all occurrences of each public IP address potentially identifying you by a distinctive pattern such as my.public.ip.1.
 
loca995
just joined
Topic Author
Posts: 16
Joined: Wed Sep 01, 2021 10:31 am
Location: Italy

Re: GRE over IPSec stops working when PPPoE interface flaps.

Fri Oct 15, 2021 2:06 pm

The only sure fire ways that work for me are holding PPPoE down for 3-5 min once the issue presents or just rebooting the device.
Did you check in the /ip firewall connection if the ipsec-esp(50) has a flag for dst-nat? If yes, you should avoid it
Are these devices all connected to internet the same way? I.e. is it always a plain PPPoE connection to an ISP that offers a real and static external IP, not using cg-nat?
Both devices are connected to internet the same way, with a real and static external IP.
You forgot to obfuscate HQ.rsc, maybe you want to withdraw it and post it again once anonymized?
Omg, I am so dumb, sorry I posted the wrong file. I re-posted it anonymized
 
sindy
Forum Guru
Forum Guru
Posts: 7925
Joined: Mon Dec 04, 2017 9:19 pm

Re: GRE over IPSec stops working when PPPoE interface flaps.

Fri Oct 15, 2021 11:16 pm

Looking at your minimalistic configurations, I can currently imagine only the following things:
  • something weird regarding handling packets that cannot be routed anywhere in RouterOS kernel - the only routes the packets between 10.199.199.0 and 10.199.199.1 can take are the default ones, and the only default routes have got the PPPoE client interfaces as gateways, so if a PPPoE goes down, the default route becomes inactive. To get intercepted by an IPsec policy, a packet must first get routed somewhere by the regular routing, hence while the PPPoE is down, the GRE or IPENCAP transport packets cannot be routed anywhere. I have no clear idea what exactly it should break, though - normally, the packets should get routed as soon as the PPPoE interface goes up.
  • the IPsec policy goes mad as it loses the sa-src-address while the PPPoE interface is down - but again, since the issue only strikes when the PPPoE glitches on one of the devices, it doesn't seem likely.
  • some strange behaviour of your ISP while getting packets for an IP address which is currently not reachable via the PPPoE tunnel. The fact that both public IPs involved in this test setup come from the same prefix/range may or may not be relevant - is it the same case for the "real" tunnels?
In any case, it must be related to the fact that a payload packet that would normally get intercepted by an IPsec policy and sent encrypted and ESP-encapsulated to the destination causes something to go wrong if it is processed while the PPPoE is down.

What I would suggest to do now as you've put together this pleasantly simple testbed would be to sniff on the physical interfaces carrying the PPPoE client ones at both routers while the issue exists, pinging inside the tunnel from both ends, to see whether PPPoE frames carrying the ESP packets carrying the ICMP ones are coming and leaving. The less other traffic you'll have there the better as you don't want the other traffic squeeze the interesting one out of the rolling buffer; if this is a problem, you'll have to connect PCs with Wireshark or tcpdump and use the "streaming" capability of the /tool sniffer, allowing to send copies of the sniffed frames to an IP address, encapsulated into TZSP.

By comparing the results, you should be able to find out quite quickly whether the issue is inside or outside the Mikrotiks.
Don't write novels, post /export hide-sensitive file=x. Use find&replace in your favourite text editor to systematically replace all occurrences of each public IP address potentially identifying you by a distinctive pattern such as my.public.ip.1.
 
loca995
just joined
Topic Author
Posts: 16
Joined: Wed Sep 01, 2021 10:31 am
Location: Italy

Re: GRE over IPSec stops working when PPPoE interface flaps.

Sun Oct 17, 2021 7:24 pm

The fact that both public IPs involved in this test setup come from the same prefix/range may or may not be relevant - is it the same case for the "real" tunnels?
In the test case we have both IPs from the same provider, but for the real tunnels where I noticed the issue, other providers are involved, so other prefixes.

I understood all your explaination but honestly I am less skilled on the subject than you, so I am going to follow your suggestion.
Just one question, forgive my ignorance on the topic/tools:
If I start a tool sniffer session on the physical interface, I get a .pcap file containing the PPP frames. Using wireshark I am able to see something like this:
pppoe-pcap.PNG
What I should do to see properly where PPPoE frames contains ESP packets and understand then whether the packets are coming and leaving?
Edit-Solved
I am dumb part2, it was enough to change the profile associated to the PPPoE client from default-encryption to default.
I'll let you know asap.
You do not have the required permissions to view the files attached to this post.
 
sindy
Forum Guru
Forum Guru
Posts: 7925
Joined: Mon Dec 04, 2017 9:19 pm

Re: GRE over IPSec stops working when PPPoE interface flaps.

Sun Oct 17, 2021 8:06 pm

Oops. I haven't encountered an ISP to use compression on PPPoE yet, so I've never noticed that Wireshark doesn't inflate the payload. The only information I could find is that this has been the case 8 years ago.

Try to disable compression on the /ppp profile row used by the PPPoE client (by setting use-compression to no) - if the ISP accepts that, Wireshark will show you the payload. If it doesn't, you'll have to test using bandwidth-test with manually specified test packet size and random data as payload instead of ping, so that the ESP packets could be easily identified by size, as it should be impossible to actually compress the random data. But it will be a try and fail process.
Don't write novels, post /export hide-sensitive file=x. Use find&replace in your favourite text editor to systematically replace all occurrences of each public IP address potentially identifying you by a distinctive pattern such as my.public.ip.1.
 
loca995
just joined
Topic Author
Posts: 16
Joined: Wed Sep 01, 2021 10:31 am
Location: Italy

Re: GRE over IPSec stops working when PPPoE interface flaps.

Tue Oct 19, 2021 4:05 pm

Oops. I haven't encountered an ISP to use compression on PPPoE yet, so I've never noticed that Wireshark doesn't inflate the payload. The only information I could find is that this has been the case 8 years ago.

Try to disable compression on the /ppp profile row used by the PPPoE client (by setting use-compression to no) - if the ISP accepts that, Wireshark will show you the payload. If it doesn't, you'll have to test using bandwidth-test with manually specified test packet size and random data as payload instead of ping, so that the ESP packets could be easily identified by size, as it should be impossible to actually compress the random data. But it will be a try and fail process.
Ok, so here I am with the results of the test.
Here a small summary of the configuration.
HQ - RB2011 - public IP : hq.test.it
BO - RB3011 - public IP : bo.test.it

Here the SPI
13767fb : src.address: hq.test.it ; dst.address: bo.test.it
e1ded53: src.address: bo.test.it ; dst.address: hq.test.it
SPI-IPSEC.png
CASE 1: ping from HQ to BO. I try to disable and enable the PPPoE interface on BO to see if the ESP packets arrive on the BO mikrotik. The packet capture is enabled on BO.
Result: the ESP packets arrive but are not processed, there's no reply.
BO-PACKET-CAPTURE.png
CASE 2: ping from BO to HQ. I try to disable and enable the PPPoE interface on BO to see if the ESP packets leave the BO mikrotik. The packet capture is enabled on HQ.
Result: the ESP packets don't arrive at all.
HQ-Packet-Capture.png
It seems that the 13767fb SPI still works, meanwhile the e1ded53 doesn't work anymore.

I also saw another thing.
If I leave the IPSec tunnel without traffic for 10 minutes (which is the IPSEC(50) connection tracking timeout on Mikrotik), the tunnel starts working again.

Any idea?
You do not have the required permissions to view the files attached to this post.
 
sindy
Forum Guru
Forum Guru
Posts: 7925
Joined: Mon Dec 04, 2017 9:19 pm

Re: GRE over IPSec stops working when PPPoE interface flaps.

Tue Oct 19, 2021 4:47 pm

Any idea?
At both ends, use /ip firewall connection print detail where protocol~"esp" before and after doing the pppoe cycle, look for differences between "before" and "after" state.
Don't write novels, post /export hide-sensitive file=x. Use find&replace in your favourite text editor to systematically replace all occurrences of each public IP address potentially identifying you by a distinctive pattern such as my.public.ip.1.
 
pe1chl
Forum Guru
Forum Guru
Posts: 7796
Joined: Mon Jun 08, 2015 12:09 pm

Re: GRE over IPSec stops working when PPPoE interface flaps.

Tue Oct 19, 2021 5:00 pm

Also it is possible to set an "on up" script in the PPP profile used for the PPPoE connection (probably best to copy it from default and assign an appropriate name, then set that in the PPPoE connection).
In this script you can do things like removing all tracking entries related to the connection.
I have done that in the past to remove entries related to my SIP phone, which also caused trouble in some cases.
 
sindy
Forum Guru
Forum Guru
Posts: 7925
Joined: Mon Dec 04, 2017 9:19 pm

Re: GRE over IPSec stops working when PPPoE interface flaps.

Tue Oct 19, 2021 5:10 pm

I have done that in the past to remove entries related to my SIP phone, which also caused trouble in some cases.
Well... my personal preference is analysis first, workarounds next :) So far we only suspect that the issue has something to do with connection tracking although no NAT is involved, so if there is no difference between "before" and "after", it may be an internal issue of the firewall and therefore removal of the tracked ESP connection may help, but it may as well be a coincidence that the SA has rekeyed after 10 minutes and the actual issue may have been a too large gap in ESP sequence numbers. /ip ipsec statistics print should help here - if, after the glitch, some counter (in-state-sequence-errors?) keeps growing there, it's an IPsec behaviour (not necessarily a wrong one), otherwise it's more likely a firewall issue. I incline to the firewall explanation as few lost ESP packets should be silently ignored, but who knows.
Don't write novels, post /export hide-sensitive file=x. Use find&replace in your favourite text editor to systematically replace all occurrences of each public IP address potentially identifying you by a distinctive pattern such as my.public.ip.1.
 
loca995
just joined
Topic Author
Posts: 16
Joined: Wed Sep 01, 2021 10:31 am
Location: Italy

Re: GRE over IPSec stops working when PPPoE interface flaps.

Tue Oct 19, 2021 6:10 pm

At both ends, use /ip firewall connection print detail where protocol~"esp" before and after doing the pppoe cycle, look for differences between "before" and "after" state.
Here the screenshots. Apparently no differences.
HQ-IP-FIREWALL-PRINT.png
BO-IP-FIREWALL-PRINT.png
Also it is possible to set an "on up" script in the PPP profile used for the PPPoE connection (probably best to copy it from default and assign an appropriate name, then set that in the PPPoE connection).
In this script you can do things like removing all tracking entries related to the connection.
I have done that in the past to remove entries related to my SIP phone, which also caused trouble in some cases.
I was used to it too: I want to keep it as my last solution because sometimes scripts don't work as expected.
/ip ipsec statistics print should help here - if, after the glitch, some counter (in-state-sequence-errors?) keeps growing there, it's an IPsec behaviour (not necessarily a wrong one), otherwise it's more likely a firewall issue.
After the cycle, the IPSec counters don't increase at all, they remain the same.
You do not have the required permissions to view the files attached to this post.
 
loca995
just joined
Topic Author
Posts: 16
Joined: Wed Sep 01, 2021 10:31 am
Location: Italy

Re: GRE over IPSec stops working when PPPoE interface flaps.

Mon Oct 25, 2021 9:48 am

Hello
@sindy, @pe1chl. Any other suggestion?
 
sindy
Forum Guru
Forum Guru
Posts: 7925
Joined: Mon Dec 04, 2017 9:19 pm

Re: GRE over IPSec stops working when PPPoE interface flaps.

Mon Oct 25, 2021 11:18 pm

The only thing to come to my mind is to avoid tracking of ESP connections in hope that the behaviour is caused by a bug in connection tracking. In particular:

/ip firewall raw
add chain=prerouting protocol=ipsec-esp src-address-list=allowed-ipsec-peers action=notrack
add chain=output protocol=ipsec-esp action=notrack


Add untracked to the connection-state list in the first rule in input chain in filter (so that it says action=accept connection-state=established,related,untracked), and either add the addresses of the peers to the address list allowed-ipsec-peers, or omit matching on that src-address-list in the rule in raw completely.

If that helps, it may be possible to slightly improve security by populating that address-list dynamically.
Don't write novels, post /export hide-sensitive file=x. Use find&replace in your favourite text editor to systematically replace all occurrences of each public IP address potentially identifying you by a distinctive pattern such as my.public.ip.1.

Who is online

Users browsing this forum: Ahrefs [Bot], Baidu [Spider], petertosh, Yasir and 44 guests