so guys, i am struggling quite some time with my tunnels(ipsec).
Status is established, but there is no traffic allowed, I must disable/enable to get that working.
So, what is first step, what to do?
What i tried?
I have several tunnels Mikrotik - Mikrotik, and I change it to Mikrotik - PFSense router and seems same.
hq have mikrotik and remote locations also, but those locations are connected through openvpn to pfsense. JUst to have image in head what is where connected and what are those tunnels for.
Also, i can make two MikroTik to have tunnels, one in HQ and another instead of Pfsense is in another HQ, so i can do it like that, but results are the same, in some random time, there is problemwith connecting to remote shops, no ping, cant connect to routers, etc.
So i just disable/enable.
I cant put some scripts to ping, because local subnet in HQ are on core switches.
A blind shot here would be a mismatch of PFS settings, causing the first rekeying of the SAs to fail (so the tunnel would work for just about 25 minutes after establishing if default lifetime=30m is set in /ip ipsec proposal).
Another blind shot would be that pinholes in some external firewall on the path between the two devices expire as no traffic passes through them for an extended period of time, but this would only be relevant if both IPsec peers are running on public IPs so the NAT traversal mechanism is not activated - keepalives are part of that mechanism.
For starters, can you post the configuration exports of the two Mikrotiks which suffer from this issue? Follow the hint in my automatic signature below to remove the sensitive information without breaking the internal logic of the configuration, i.e. substitute eventual public IPs and eventual domain names by the same strings in both exports.
peer84 is MikroTik hq1- PFSENSE - most of the problems come from this peer
and peer13 is Mikrotik hq1- Mikrotik hq2 this peer is been used just for some subnets, but in the first time there was a tunnels only from this peer and also didnt work as well
Hey Tomislav91,
If you successfully establish both VPN tunnels but still experience connectivity issues, then: Check for network ACLs in your VPC that prevent the attached VPN from establishing a connection. Verify that the security group rules assigned to the EC2 instances in your VPC allow appropriate access.
Regards,
Mika Hawkins
I’ve only mentioned the lifetime as a troubleshooting hint - if the connections break in 80-100 % of the SA lifetime configured after the connection establishes, it makes sense to look at the PFS settings, as the first rekeying takes place at this time, and on that the mismatch of PFS settings causes the SA to become unusable.
If the pfSense can trigger SA rekey also on number of bytes transported (which RouterOS cannot), it may happen even sooner.
But as you’ve got pfs-group=none on all /ip ipsec proposal rows in the configuration of the single Mikrotik you’ve chosen to post, I assume that it’s the same at the other Mikrotik(s) and therefore at least at the links between two Mikrotiks, this is not the reason why they fail.
OK. That means that the NAT traversal mechanism is not in use (you have even forbidden its use in the /ip ipsec profile settings), and this practically means the following:
whereas the control packets (IKE) are sent as UDP packets on port 500, the transport packets of the SAs are bare ESP packets
as no pinholes (tracked connections) are expected to need to be forcifully kept open, no keepalives packets are sent
As a consequence, if no traffic is sent using the SA, no ESP packets are sent. So if there is a firewall somewhere between the devices or on one of them which closes the pinhole if no traffic passes through it for some time, the pinhole closes. Since the traffic works for a while after connection re-establishment, ESP traffic in one direction must be capable of opening the pinholes on all the path, otherwise it would not work at all. But if the pinhole is closed and the request in the payload goes in the “wrong” direction, the ESP packet carrying it never makes it to the destination, so no response to that request is ever generated, and thus the pinhole never can be re-created (until eventually a request in the payload would be sent from the “proper” side).
The firewall at the router whose configuration you have posted is complex and leaky, it seems you have missed the fact that the default handling in all chains is “accept”, i.e. packets which do not match any rule are accepted. It’s best seen in chain=output where you have two action=accept rules but no action=drop one, which just burns CPU on every packet sent by the router but it has no actual effect - everything gets through anyway.
In the context of the above, this means that SA traffic initiated from this router is allowed to be sent (it is accepted in chain=output), creates a tracked connection (pinhole) as there is a couple of rules which refer to connection-state so connection tracking is activated, and therefore the packets in the opposite direction of the SA are accepted by the “accept established,related” rule in chain=input, so even if you added a “drop the rest” rule to the end of chain=input, the firewall at this router would still accept them. As there is no “drop the rest” rule in chain=input, nor any rule selectively dropping the ESP packets from the remote peer, this router’s firewall doesn’t explain why the IPsec connections break.
But it may not be the case on the other routers. That’s why I’ve asked you for a configuration of both Mikrotiks between which you encounter the issue, not just a single one.
Firewall od the other router are the same, it has just more src or dst nat there, so rules which are for us interesting are the same. But now at this point, tunnel is established thorugh pfsense and there is no much rules on the WAN side except one that we are using for internal purpoeses
dropping the ESP packets from the remote peer, this router’s firewall doesn’t explain why the IPsec connections break.
How can i upgrade firewall on mikrotik side to tell me more for this problem. I am quite annoyed with this issue, and i am not sure even if tunnel is MikroTik-MikroTik or Mikrotik - PfSense still is a issue of connection to those peers. We are speaking about 350+ peers. Two Subnets are full 10.10 and 10.11 is at half, so you understand the math there. Also with 192.168.200, also is a problem with all subnets from HQ side 192.168.50, 192.168.23, 192.168.30 and so on.
Just to understand better, under MiKroTik L3 core switch and lower are l2 in the branches.
What tunnel are you talking about above? You’ve said that the issue exists on multiple ones or even on all, so why do you now concentrate on one which uses a pfSense?
action=log firewall rules won’t help you much in diagnosing an issue which occurs only randomly. I’ve mentioned the firewall because I started looking at it just to see whether it might cause the outages after a long period of bi-directional silence, and found that the one you’ve posted cannot be blamed for this. But I have also found that it is leaky and allows access to the router from outside (you have created a complex state automaton I haven’t understood completely to protect ftp,ssh,telnet,tftp etc., but it doesn’t deal with the API service and maybe other ones, and none of the services is disabled), and also slightly inefficient as you have several selective action=accept rules before the “accept established&related” one (so every single packet has to be checked against through those rules).
If you can pinpoint a peer where the issue occurs regularly, I would sniff on the public address of that peer as well as the LAN addresses of it on the CCR1009, so that both the transport packets and the payload packets would be sniffed, and do the same at the peer Mikrotik itself. The purpose is to find out whether there is a gap in the payload traffic or some rekey failure or whether the SA just stops working without a visible reason, so due to a software bug or CPU overload on one of the devices.
You can either sniff into a file on the router’s flashdisk itself (but the volume of data until the issue occurs may be too high to fit on the disk, so an external flash drive connected via USB is a better option), or you can connect an external PC with tcpdump or Wireshark’s dumpcap and use mangle rules with action=sniff-tzsp to copy the packets of interest there.
Sending IPsec logs into a file on the peer in the branch office would also be helpful - or also to the external PC running tcpdump using action=syslog in /system logging. The IPsec logs only log the IKE/IKEv2 processing, not the actual transport of encrypted data, but they are still very verbose and there’s no way to restrict the logging to a particular peer. So on the CCR, it has to be done using syslog or to an external USB disk.
What’s the CPU load (/system profile)? The CCR1009-8G-1S should support hardware accelerated encryption according to the IPsec manual, does it really, i.e. is the H mark shown in /ip ipsec installed-sa print?
What do you mean by “lower are L2”? Are you using the CRS line at the branches, meaning their CPUs are of the SOHO grade, like mipsbe at 400 MHz? That should not matter much if there is just a single tunnel on each of these devices and the traffic volume through the tunnel is in the range of few Mbps. But sure, /tool profile on these should also be looked at.
On what hardware do you run the pfSense?
I would sniff on the public address of that peer as well as the LAN addresses of it on the CCR1009
So if the public IP of the Mikrotik 1009 “live” on ether1 I will add that ether1 to sniffer? You say LAN addresses, but those vlan are not on CCR, they are on cisco core switch. So I can maybe add VRRP_LAN interface? Where is option to put it on USB. I tried to stream to server from this tutorial https://www.wizzycom.net/traffic-capture-from-a-mikrotik-device-to-wireshark/ but nothing happend in wireshark, it is maybe due to fact that when i try to ping my IP 192.168.50.13 there is no ping to this IP?
I noticed that sniffer burn my CPU, so to left it a couple of days to check when problem appear, hm, maybe not good idea, i am not sure. Maybe the best idea is to put it on usb 64GB and let it running and tell support to notice me when problem appears, just cant find the option to put it on external media.
What do you mean by “lower are L2”
So this is our infrastructure:
Two HQ.
One have MikroTik and its our local IPs from where support “go” and solve problem to remote shops. Below that MikroTik are switches, L3 main one(more than one, but thats only for HSRP, not relevant to this problem) and floor switches L2 which have tagged vlans to the offices. One office is most important, it is office from where support works.
So this MikroTik have a tunnel between PfSense from another HQ and on that PfSense is created openvpn servers which are remote shops connection (they are using mikrotiks also, but smaller ones, not CCR). And this is how we communicate with those shops.
Can you help me with commands to put those logs working? Is it good to put it like this
Razmišljam da nebi bio razgovor po telefonu mnogo brži…
So if I get you right, there is not an IPsec tunnel to each remote shop, but there is one fat pipe to the pfSense machine which concentrates those 350+ OpenVPN tunnels from the remote shops? I was wondering where the 350 IPsec identities went missing from the configuration export So when you mentioned “lower are L2”, I thought you talk about lower grade Mikrotiks in the remote shops.
I did mean using the IP addresses, not the interface names, as the sniffing filter. The thing is that you need to have both the payload packets (between local LAN subnet and remote LAN subnet) as well as the IPsec transport packets carrying them in the same .pcap to be able to make some conclusion. So on the HQ machine, you have to filter on the LAN subnet of the remote machine and on the public IP of the remote machine, and vice versa on the remote shop machine (where you need to sniff on any remote address to/from which a traffic may go via the IPsec tunnel). But as there are not 350 small remote machines (one per remote shop) but just the single big one running towards the pfSense, it looks like a mission impossible, as you have to sniff almost all the traffic of the CCR1009
as for Wireshark not receiving the TZSP packets, it is more likely that a firewall/antivirus software blocks the TZSP packets as they are unexpected (there is no related connection initiated by the PC). I had a case where I had to disable AVG, allowing the TZSP destination port in the Windows firewall wasn’t sufficient.
sniffing to USB disk would be fine if it was fast enough - as the CCR1009-8G-1S uses the mini AB connector, it has only USB2.0 and hence 480 Mbit/s raw bitrate, which may not be enough depending on the traffic volume to&from the peer (and each packet appears at least twice in the sniff; if they pass through a bridge, then even more times, once from the physical interface and once from the bridge. As you say that the CPU is at 20% during normal operation (no sniffing), I’m afraid the throughput of the USB won’t be sufficient unless the traffic to the remote shops is only a fraction of the total one (i.e. unless most of the traffic is inter-VLAN routing at the HQ itself)
Yes, this is exactly what I had in mind - first that, and then /system logging add topics=ipsec,!packet action=usb.
But when defining the logging action, it is probably better to allow more lines per file and more files to rotate than the defaults, as we don’t know in advance how big the log will become.
Razmišljam da nebi bio razgovor po telefonu mnogo brži…
Mozemo i tako ako je lakse posle ovde okaciti resenje
So if I get you right, there is not an IPsec tunnel to each remote shop, but there is one fat pipe to the pfSense machine which concentrates those 350+ OpenVPN tunnels from the remote shops? I was wondering where the 350 IPsec identities went missing from the configuration export So when you mentioned “lower are L2”, I thought you talk about lower > grade > Mikrotiks in the remote shops.
firstly, we did it like that, but this for us is better solution, to have one ipsec to pfsense and that pfsense create openvpn servers for those shops. This is not a big deal really, but it is annoying, because i cant figure out where is break point.
Maybe the best start is to switch on logging of the IPsec and to run a netwatch pinging through the tunnel which will log failures (on-down={:log warning message=“ping through tunnel down”}) to see in the logs whether the issue is correlated with a rekey or not.
If it turns out not to be correlated, I’d say the only way ahead is to make the Wireshark work (better using mangle rules with action=sniff-tzsp to send just a small amount of traffic to it first), and once that’s done, add more and more traffic until we either have it all or the CPU usage reaches, say, 60 % - after that I think it isn’t safe to continue.
Another way to determine the traffic volume in advance is to run /ip ipsec installed-sa print interval=10s where spi=0xsomething for a while and get bytes per second as a 1/10 of the average of the differences between the current-bytes figures - provided that the traffic is more or less even all the time. This is the transport traffic in one direction, and if sniffing also the payload using the mangle rules, we avoid multiplicated packets in case of bridging, so the bandwidth of the sniff will be just double the bandwidth of the IPsec transport packets.
In general, I know there was an issue in Mikrotik’s IPsec interworking with Strongswan (but does the pfSense use Strongswan?), I can see cases where the connection fails for some minutes and then starts working again, but I can’t tell you right now which Strongswan version I’m running there, and I’m using IKEv2 at that link. Until 6.43.something, even IKEv2 connections between two Mikrotiks were showing similar symptoms, caused by an error in the rekey procedure - the next automatic rekey was fixing it. BTW, the rekey procedure differs in IKE(v1) and IKEv2, so maybe in your case, switching from IKE/aggressive to IKEv2 could help.
How can you forward traffic using IPsec if the Mikrotik isn’t configured as a gateway, i.e. if it doesn’t have an IP address in the sender’s subnet? The 'Tik must first receive the packet in order to match it to a policy and send it via an SA… what am I missing here?
sure, you have to replace “something” with the actual SPI value - they are dynamic so you have to look for one first. You can even use spi~“0x(value1|value2)” to watch multiple SAs at a time (it’s a normal regexp applied on the parameter value converted to a string)
so i will first get spi value where src is pfsense and dst is mikrotik and vice versa? two spi?
Which one spi to choose? I have a bunch of spi’s with state dying and state mature.
l suppose to use those mature?
Yes, the mature ones are in use. And yes, you need one per direction. The dying ones should not exist for more than a couple of seconds, so if they do, it is already weird (or the traffic volume is so low - the dying SA is normally there after a rekey until the first packet arrives through the new SA and then the dying one can be safely dropped.
The /ip ipsec installed-sa print is not as easy as I’ve expected, it overwrites the old results on the screen, and if you run it with file=somename append, the interval is ignored. So you have to use a script to run it every 10 seconds (or every minute), indicating the file and the append (so that the contents of the file was not overwritten).
The bad news is that RouterOS is not great when it comes to file manipulation, but the good news is that the timestamp is part of every print “job”. So just do /ip ipsec installed-sa print file=somefilename append
periodically, and each list of SAs will begin with a comment block which will include the timestamp.
But you cannot open the file in RouterOS as soon as it exceeds some size (which is not really big), so you’ll have to download it somewhere.