SIP Through IPSEC VPN Site to Site drops calls randomly

Good afternoon

After a successful IPSEC site to site VPN that had some routing issues with a VOIP PBX on the HQ side

http://forum.mikrotik.com/t/ipsec-vpn-fails-to-ping-to-2-assets-out-of-5/138292/1

I am facing something new now that has to do with the above post.

The SIP extention that registers to the PBX network experiences dropped calls every now and then at different timings.
Sometimes it cuts after 20-40 seconds, other as much as after 6 minutes.
When it drops the phone doesn’t lose registration to the HQ side…only the call drops.
Reading online i found about (1000) reasons but mostly this happens because of NAT errors.
I want to narrow it down abit…
I have never used any tools before like Wireshark, Mikrotik so i don’t know what will tell me what is causing me this drop.

Any pointers?

Thanks in advance

Attached Firewall
BRANCH.txt (8.73 KB)
HQ.txt (13.2 KB)

From the networking perspective, there are three independent packet exchanges when it comes to SIP phones:

  • SIP registration: the phone periodically updates the user account on the PBX with its current contact IP address and port. There is no other payload in the messages but the contact itself. The relevant messages are exchanged periodically as long as the phone is up, minutes or tens of minutes apart.
  • SIP call control (setup and termination) - the phone announces to the PBX, or vice versa, a beginning of a call, that it has been received, and that it has been terminated. An important part of the information payload is the IP address and port where each party expects to get the media from the remote one. Unless the call state changes, there are no mid-call messages, except from a periodic “session keepalive” exchange which, if used at all, is usually also sent just once in a few minutes.
  • media exchange (RTP) - during the call, this flow is almost constant (except if “silence suppression” is in use), typically at about 50 packets per second per direction. Some phones and telephone exchanges consider an outage in media reception which is longer than a few (tens of) seconds a reason to terminate the call at signalling level

So the fact that the phone doesn’t report a loss of registration indicates nothing at all regarding short-time outages.

NAT “errors” (actually, NAT operation as such), as well as “normal” firewall, most often prevent the calls from even setting up properly, but they do not cause mid-call drops. Various techiques are used to overcome NAT-related issues; most public VoIP services deal with client-side NAT routinely without any issues; these measures are typically not necessary on PBXes operating in closed network environments.

If I got your setup right, the two subnets (at HQ and BO sites) are interconnected using the VPN, so there is no NAT on the path between the PBX and the phone (the NAT which the IPsec must overcome is invisible from the inside of the tunnel).

When the call “drops”, do you have one-directional or bi-directional silence but the phone doesn’t indicate an end of the call on its display or by an “end of call” tone, or does it look as if the remote porty just hung up mid-word, including the indications?

Hello Sindy,

No it doesn’t come in the form of silence.
It hungs up including the indications.
Two interconnected subnets via IPSEC.
Since they are interconnected and since my phone registers do i have to port forward (dst-nat) or it is not necessary cos of the IPSEC?

I have already checked that there is no SIP reinvite running on my PBX and on the phone the VAD is disabled.

It could be the SIP device or the PBX itself as they are not of modern technology ( more than 7 years old ).

Next step is to install a SoftPhone on my laptop and try if it hungs up there also.

But i want to make sure the MikroTik Tunnel isn’t causing me this issue first.

To add to what @sindy said, make sure on both your sides under Firewall → services sip alg is disabled…
Also, on your PBX, do you see the phone being registered with its local IP ? And not the Routers…

SIP ALG is disabled on both sites
The IP of the SIP phone is 192.168.5.250
The PBX IP is 192.168.0.250
If i go to the PBX menu to see the registration info
it shows me registered as 192.168.5.250

dst-nat is not necessary. The phone and the PBX can see each other transparently thanks to the VPN. And yes, do disable the SIP ALG at both Tiks, although it usually doesn’t cause this kind of problems.

Other than that, set /tool sniffer at both ends to sniff into file named something.pcap and filter on the IP address of the phone at both ends. Then let the sniffer run at both ends and stop sniffing once the call fails. Then download the files, open them using Wireshark and start discovering the new world. I recommend the menu Telephony->VoIP calls. Or post the files (not screenshots), but be aware that all the conversations can be heard in them.

Does your PBX write logs? Is there anything interesting in the logs?
What is the indicated termination cause for the dropped calls in question?

Good pointer - I have enabled the PBX’s SysLog and i have it collecting info to see if it points out to something from its side.

Excellent - I have set up Wireshark on my local network (subnet that has the SIP phone) and i can capture the packets on the go while i initiate the call..is that sufficient or i have to do exactly as you
described?

If the issue is within SIP alone, capturing at one site is sufficient. If it is related to VPN outage, capturing at both sites allows you to compare them and find out what didn’t get through.

Running the capture into a file on the router itself allows you to just retrieve the result once the call fails; if every call fails sooner or later, there is no reason to sniff into file at your local site. On the remote one, you can redirect the sniff to your local Wireshark too, but if the VPN has an outage, you won’t see that as the tzsp packets are UDP ones, so they won’t get delivered after the outage ends.

I do not know what PBX you use, but in some cases you must add an entry to the PBX Whitelist, in your case the subnet 192.168.5.0/24 or just 192.168.5.250 …
A call that cuts after 6 minutes is not a NAT problem to me…
Also make sure that the PBX allows to the extention to be used outside the LAN, that might as well cause problems, in some cases…

Finally, when a call drops, what is that you do to fix it ? you reboot the phone? the PBX? Nothing, it just works after some minutes ?

PABX Allows outside the LAN
Only WhiteList option i see is the ACL WAN IP Whitelist. Not sure if that is necessary.
No need to reboot the phone as i don’t lose registration. It just hangs up.
I just initiate the call again.
Anyway i will do abit of logging and capturing and will get back to you guys when i have some news.

Thank you for your time once again

You could start a ping to the PBX and when the call hungs check if you lost ICMP packets…

Will try that as well
Thanks

@zacharias @sindy have a look at this screenshot from wireshark locally. This happened within the 7th minute

I tried sniffing damn thing i was on for 15 minutes never dropped. i will try it again when someone is available on the other side tomorrow.

@andriys PBX log shows nothing of use..even when the call drops it shows it as successful call…
export.png

I see that a different port is used and the log could looking for ports 5060 and 5061. In /ip service you can add your port to the SIP line.

https://help.vantact.com/index.php?/Knowledgebase/Article/View/25/0/how-to-disable-sip-alg-on-a-mikrotik

i use 6060 port instead of the common 5060.

difference is i have to use PBXIP:6060 for the phone to register.

Since SIP ALG is disabled why should i add the port to the field?

FYI i tried everything with 5060 port and the behaviour is identical as with port 6060.

OK. So if you look higher into the capture, you should see that at the beginning of the call, both parties (the PBX and the phone) send RTP packets to each other. Then, at some moment, RTP stops coming to the phone because the VPN stops working for a while, and after some time since the last RTP packet received, the phone sends the first BYE because it tracks the RTP reception and terminates the call if the RTP is absent. Since the 200 OK response to the BYE doesn’t come, the phone retransmits the BYE a few times before giving up.

So as for me, it is time to stop sniffing at LAN side and to start searching why the transport fails. So in addition to pinging the address of the PBX, ping also the HQ Tik’s LAN IP and the public IP of the HQ’s edge router from the BO Tik (from three different windows). This should tell you whether the problem is the VPN itself or something on the HQ Tik or the connectivity between the two sites. It is possible that there needs to be some “heavy” traffic between the sites so that the issue would pop up, so it may be that all three pings will work for hours and you’ll need to set up a call to see the failure.

A few notes if they are relevant.
Ever since i enabled the DMZ’s on both the ISP routers to point to the MikroTik’s and removed the static routes for the VPN /ip routes as zacharias stated i can’t ping anything from within the Tik’s. But the tunnel works and i can ping from the network devices across the tunnel vice versa and web interfaces work and sip registers fine and so on. I had to add two static routes to point to the gateways so that i can ping them from within the Tiks as my scripts requite /tool netwatch to see if the tunnel is active.

2nd I have fasttrack enabled on the branch site on my firewall BUT i am marking the ipsec packets and marking them in the fasttrack connection mark. See my attached Branch.txt above.

3rd on the Branch site i use mangle to change MSS.

These notes have me thinking abit if they could cause my issues so we can avoid further investigating.

Other than that i will procede to your request but clarify if you want the ping from within the Tik’s (not working) or from my PC (192.168.5.252) which pings and accesses everything on the 192.168.0.0/24 network.

Thank you