Hot to handle VOIP on multiple WANs/backup

ecastellani · Mon Mar 15, 2021 10:17 am

Hi,

we give to our standard customer two internet connections managed by a 4011 and often we have a 3CX PBX cloud service with some phones inside the lan.

VOIP routing is quite simply demanded to distances (connections to SIP server).

When the distance 10 route fails (internet on WAN1 goes down) all the traffic goes to WAN2 (distance 20).

And this works fine for VOIP too.

When WAN1 come back to UP, VOIP sessions remains hanged on old route until reboot of the 4011.

I red about a workaround with a "flush nat" script on WAN change but I would like to know if someone has a better suggestion

Thanks in advance

sindy · Mon Mar 15, 2021 2:40 pm

I don't get how the connections can continue working as they move from WAN 1 to WAN 2 whereas they hang on WAN 2 when WAN 1 becomes available again.

There is a possibility to run a virtual router somewhere in a data center, create two VPN tunnels to it from the customer premises router, each via one of the WANs, and do the NAT for the connections to the 3CX on the virtual router. A failover of the DC normally doesn't change the IP addresses of the hosts running there, and the virtual router is normally migrated to a backup hardware if the current one fails.

anav · Mon Mar 15, 2021 4:05 pm

Interesting, I too have experienced similar issues. What I would do is manually unplug my obi2 modem and then plug it back in.
My case is where it would switch to the second router successfully but not switch back to the primary. I think you may be on to something about flushing the DNS or NAT cache. I tried to change settings on the Modem itself and it was successful to a point. For whatever reason the obi2 modem wants to connect to home via the WAN2 connection, or HOME thinks the Obi2 is still accessible through WAN2.........

ecastellani · Mon Mar 15, 2021 9:58 pm

yes, it seems as some of the traffic would maintain the backup path once swapped for the main connection failure. The backup is flawless because there is "no choice".. the sessions are dead for the down of the WAN1, but when WAN1 comes up again you don't have a down of the WAN2 so all that remains alive in WAN2 becomes a zombie..

specially the voip sessions: the cloud PBX (we use mainly 3CX) has the sessions still opened from the internet public IP of the WAN2 but begins to have requests from the WAN1 and voip begins to mess up: maybe you can open new calls but you are not able to receive..

It depends even from the device you are using inside the lan: a yealink phone surely will mess up (and so does a FritzBox), but with a local SBC the issue disappear (probably is smarter in handling sessions).

Oh yes, the second workaround is placing an SBC in the lan configuring sip accounts reg for the local phones. It works but I don't like it because you gain stability but even one more point of failure and you have an extra expense..

sindy · Mon Mar 15, 2021 10:19 pm

The failover from primary to backup WAN can't be flawless either - some time will elapse until the phones re-register (and thus the registration arrives to the remote registrar from the new WAN IP), and until that happens, the incoming calls keep being sent to the dead WAN IP.

So what you have in mind is that whilst the tracked connections NATed to the primary WAN's IP are dropped when the primary WAN goes physically down (as the NAT is done using masquerade), the backup WAN doesn't go physically down when the primary one gets up again, so the tracked connections NATed to the backup WAN's address stay like that, and even though the phones eventually re-register, the re-registration reuses the tracked connection so the REGISTER packets still leave with the wrong public IP.

And if the primary WAN remains physically up but the path to internet is broken somewhere at your ISP, the situation is the same, as the existing tracked connections aren't removed either.

A virtual VPN router or virtual SBC located in a data center is still a SPOF, but it should be less likely to fail as redundancy mechanisms exist in the data centers. It is definitely an extra expense, yes.

At local level, there is no other way but the script removing the tracked connections NATed to secondary WAN's IP whenever the primary WAN becomes available again; as suggested above, in fact you should track the transparency of the primary WAN rather than just its physical state, and the script removing the tracked connections should act also when the traffic fails over to the backup WAN.

Since there were some woes that the removal of the tracked connections using /ip firewall connection remove occasionally fails, an /ip dhcp-client release on the backup WAN, or even /interface ethernet disable etherX ; delay 1s ; /interface ethernet enable etherX may be necessary as a "bigger hammer" to get rid of the stuck connections.

But no matter what you do, the new connections are not created until each phone decides to renew the registration.

jbl42 · Mon Mar 15, 2021 10:34 pm

yes, it seems as some of the traffic would maintain the backup path once swapped for the main connection failure. The backup is flawless because there is "no choice".. the sessions are dead for the down of the WAN1, but when WAN1 comes up again you don't have a down of the WAN2 so all that remains alive in WAN2 becomes a zombie..

For contracked NAT sessions this is expected behaviour: If WAN1 goes down, IP connections drop and are newly initiated by clients using the only available WAN2 interface. If this happens, a Masquerade/NAT session between the internal IP and the WAN2 IP exists and is used to NAT ingress traffic beloning to this connection. If WAN1 comes back up in parallel to WAN2, existing NAT connections will stay on WAN2 until they are closed or time out. While new outbound NAT connections will use WAN1 because of the lower distance of 10.

What I did to solve this problem in an installation is having a acript running on WAN1 link up disabling WAN2 interface, waiting for 100ms and renabling it. This drops all connections on WAN2 and if clients reestablish them they will use WAN1 because of the lower distance of 10. An ugly solution, but it works reliable.

A general problem of any dual WAN approach with different IPs is that everytime your WAN IP changes, all ongoing SIP calls will be dropped. If you want to avoid this, you will need an SBC in the internal LAN. (Or a dynamic routing solution on your WAN connection allowing to change the line bwteen ISPs while keeping your public IP. But this is usually off the chart for most non-Enterprise installations).

PS: In case of SIP another question would be how the RB4011 SIP Alg handles changes of WAN route and IP.

msatter · Mon Mar 15, 2021 10:57 pm

You could look at the STUN (UDP/7080-7090)connections before dropping WAN2 and switch back to WAN1. Any ongoing conversations will not be interrupted that way.

The risk is that on busy lines the switch back has to wait very long and you could include a timeout after which the line is hard dropped, to force switching back.

che · Mon Mar 15, 2021 11:07 pm

Masquerade is not suitable for multi wan VoIP setups, use action=src-nat instead of masquerade. Example:

/ip firewall nat
add chain=srcnat action=src-nat to-addresses=1.1.1.1 out-interface=WAN1
add chain=srcnat action=src-nat to-addresses=2.2.2.2 out-interface=WAN2

If you have a dynamic IP on any of the WANs, use script to update the NAT rule.

millenium7 · Mon Mar 15, 2021 11:50 pm

yes, it seems as some of the traffic would maintain the backup path once swapped for the main connection failure. The backup is flawless because there is "no choice".. the sessions are dead for the down of the WAN1, but when WAN1 comes up again you don't have a down of the WAN2 so all that remains alive in WAN2 becomes a zombie..
For contracked NAT sessions this is expected behaviour: If WAN1 goes down, IP connections drop and are newly initiated by clients using the only available WAN2 interface. If this happens, a Masquerade/NAT session between the internal IP and the WAN2 IP exists and is used to NAT ingress traffic beloning to this connection. If WAN1 comes back up in parallel to WAN2, existing NAT connections will stay on WAN2 until they are closed or time out. While new outbound NAT connections will use WAN1 because of the lower distance of 10.

Are you sure this is the case?

My understanding is that the routing table takes priority. If WAN1 comes back up, then connections that were going out through WAN2 will then go via WAN1 if it has a lower distance value
But because of connection tracking, connections will go out WAN1 with the source IP address of WAN2 (src-nat/masquerade combined with connection tracking makes this sticky) and will therefore silently cause all phone calls to fail as you have asymmetrical routing (or is simply dropped in WAN1's network) and will not work at all until router is rebooted or connection table flushed

If i'm wrong please correct me

Importantly if there's a good way to handle all this properly, i'm sure all of us would like to know
For instance I think everyone would agree that the optimal approach to this type of failover would be to have a hard fail-over if WAN1 goes down, but when WAN1 comes back up its not a hard transition back to WAN1 (killing all phone calls) it would be best to have all existing and ongoing calls still routed out WAN2, and when the phone call is over, then it would transition over to WAN1. This would importantly prevent a flapping WAN connection from continually interrupting everyones calls
I'm sure there's a way to do this with mangle and connection/packet marking? maybe with speed thresholds, once a VoIP session is below say 10kbit/s then the call is clearly over and the connection should be flushed

jbl42 · Tue Mar 16, 2021 1:08 am

My understanding is that the routing table takes priority. If WAN1 comes back up, then connections that were going out through WAN2 will then go via WAN1 if it has a lower distance value
But because of connection tracking, connections will go out WAN1 with the source IP address of WAN2 (src-nat/masquerade combined with connection tracking makes this sticky) and will therefore silently cause all phone calls to fail as you have asymmetrical routing (or is simply dropped in WAN1's network) and will not work at all until router is rebooted or connection table flushed

Changing the outgoing WAN interface for an existing conntracked src-nat connection also changes the IP address on your end of the IP connection, immediatly breaking any TCP connection. Outonly UDP might continue to work, depending on the L5 protocol. Even if packets with wrong WAN source IP get to the destination without being dropped as spoof, any response would arrive on the wrong RB4011 WAN interface where it gets dropped in the FW input chain as invalid connection.
So if what you described is what the RB4011 does, it would be a quite strange thing to do for a router.

Anyway.
I think this is pretty close to what you need:
https://help.mikrotik.com/docs/display/ ... ll+Marking

millenium7 · Wed Mar 17, 2021 11:34 pm

So if what you described is what the RB4011 does, it would be a quite strange thing to do for a router.

I would like to be corrected if someone knows for sure, but I believe it is expected behavior if using nothing more than a masquerade rule. But doesn't happen with src-nat rules, the problem is src-nat rules are no good if the IP address is dynamic

One reasoning behind it (besides not really being coded well for a user router) would be if you do have multiple WAN's and don't drop invalid connections, you can have asymmetrical routing without dropping connections, they continue to work. Which is a good thing for a core/distribution router
But in most user cases its a very bad thing

Be good if someone who knows for certain could chime in

sindy · Wed Mar 17, 2021 11:45 pm

It doesn't matter whether action=masquerade or action=src-nat rule sets up the NAT behavior of the connection. Only the initial packet of each connection is matched against the NAT rule chains, and the behaviour imposed by the rules it matches is then remembered in the context of the connection and repeated for all subsequent downstream (orig) packets of that connection, and inverted for the upstream (reply) ones.

The difference between action=masquerade and action=src-nat is that masquerade uses the address attached to the out-interface as the new source address of the connection (referred to as reply-dst-address), but also that the connection is removed if the interface loses that address. This does not happen with connections whose reply-dst-address has been assigned by an action=src-nat rule.

che · Thu Mar 18, 2021 12:48 am

It doesn't matter whether action=masquerade or action=src-nat rule sets up the NAT behavior of the connection.

Please read the "masquerade" section of MikroTik Confluence page to understand why this method of NAT is particulary bad for mutli-wan setups running critical services such as IP voice: https://help.mikrotik.com/docs/display/ROS/NAT

The reason why I mentioned masquerade in the first place is that I have recognized the issue without OP even mentioning it. I have seen it many times in production back when I was involved with VoIP systems and switching NAT method from masquerade to src-nat together with properly setting up NAT helper (IP/Firewalll/Service Ports/sip in MikroTik case) solved the issue every time. Now, maybe my comments look unprofessional without 20 lines of explanation but that doesn't mean they're invalid.

gotsprings · Thu Mar 18, 2021 3:10 am

/ip firewall connection remove [find]

I have that tied to a netwatch. Flushes all the connections on a change.

anav · Thu Mar 18, 2021 1:41 pm

It doesn't matter whether action=masquerade or action=src-nat rule sets up the NAT behavior of the connection.
Please read the "masquerade" section of MikroTik Confluence page to understand why this method of NAT is particulary bad for mutli-wan setups running critical services such as IP voice: https://help.mikrotik.com/docs/display/ROS/NAT

The reason why I mentioned masquerade in the first place is that I have recognized the issue without OP even mentioning it. I have seen it many times in production back when I was involved with VoIP systems and switching NAT method from masquerade to src-nat together with properly setting up NAT helper (IP/Firewalll/Service Ports/sip in MikroTik case) solved the issue every time. Now, maybe my comments look unprofessional without 20 lines of explanation but that doesn't mean they're invalid.

Interesting there che.
I have two masquerade rules, one for each ISP connection. Are you saying that this is a problematic approach?? (both ISPs are dynamic, one fiber, one cable)
I could just connect the VOIP to my backup cable ISP which rarely has issues and be done with it I suppose.

anav · Thu Mar 18, 2021 1:42 pm

/ip firewall connection remove [find]

I have that tied to a netwatch. Flushes all the connections on a change.

Can you elaborate, what are you finding [find}
Is this a dhcp script, how is it tied to netwatch?

gotsprings · Thu Mar 18, 2021 2:08 pm

/ip firewall connection remove [find]

I have that tied to a netwatch. Flushes all the connections on a change.
Can you elaborate, what are you finding [find}
Is this a dhcp script, how is it tied to netwatch?

Sure...

Keep it "simple stupid".

I use netwatch to monitor an external DNS server.
I set a firewall rule to ONLY ALLOW pings to that host over the primary ISP. Using OUTPUT CHAIN.

if it goes up or down...
The netwatch fires the line above.

Flushing all connections in the firewall. Forcing them to reconnect.

I should take some time to do a 3 out of 5 ping... But this works as a starter.

But I have one site that "gave me S--t about calls." I made them subscribe to "bigleaf". Which made me unaware of the "hot interface."

All traffic is sent to bigleaf and on to its destination by default.

Bigleaf takes all the feeds into the router (3 ISPs) the provides me 1 up from them.

Watching this... I have been on a phone call using VoIP and ripped out 2 ISP connections. The call never dropped.

sindy · Thu Mar 18, 2021 2:33 pm

The reason why I mentioned masquerade in the first place is that I have recognized the issue without OP even mentioning it. I have seen it many times in production back when I was involved with VoIP systems and switching NAT method from masquerade to src-nat together with properly setting up NAT helper (IP/Firewalll/Service Ports/sip in MikroTik case) solved the issue every time. Now, maybe my comments look unprofessional without 20 lines of explanation but that doesn't mean they're invalid.

As you took the effort to quote the first half of my post before this statement of yours, would you mind explaining what is incorrect or missing in that post as a whole?

che · Thu Mar 18, 2021 5:03 pm

Sure.

You are missing the part that clearly distinguishes how masquearade leaves the connections hanging when primary link is back if backup link did not have any kind of interuption, while source NAT does not in the same scenario. There are some corner cases where all types of NAT would be problematic, but I don't think we need to go into explaining people how to properly segment services and use private IP range routing through VPN, MPLS, VLANs and similar.

And the reason why I quoted your first sentece is because it's misleading to the reader (me).

I have two masquerade rules, one for each ISP connection. Are you saying that this is a problematic approach?? (both ISPs are dynamic, one fiber, one cable)
I could just connect the VOIP to my backup cable ISP which rarely has issues and be done with it I suppose.

The answer depends on the client experience. Do the phones or IP PBX have problems authenticating or connecting after your WAN(s) get recycled? This info might get you to the right track if the answer is yes.

Just to make it clear, my idea is to empower MikroTik users with my own experience, not to sound arrogant. :)

anav · Thu Mar 18, 2021 7:24 pm

All sounds good che, I dont even use SIP, not recommended by most.

jbl42 · Tue Mar 30, 2021 8:08 pm

I found the topic interesting and could spare some time the last weekend to do some experiments.

@Che is right:
Masquerade is tight to a physical interface, not to an IP. Masquerade uses the IP of the specified out interface as NAT source. If the interface goes down, loses the IP or the IP changes to another physical interface, Masquerade drops all exiting connections without anything you can do about. This makes sense for the intended use case of masquerade.

src-nat uses the specified IP as NAT source and then decides which outgoing interface to use depending on this source IP. Actual physical interface status has no direct effect on existing connections. It is only about subnets and routing.
This in my opinion clearly means that src-nat is the right thing to use for multi-WAN.

jkyawesome · Wed Mar 31, 2021 12:15 am

MikroTik Automatic Failover Two Gateways
Steve Discher5/15/2015
There’s a million ways to do this on the wiki and the web but none of them fit my particular application. Let me explain:

1. The weak point in my network was an AirFiber 24 upstream from the tower I am connected to wirelessly. This is the link that goes down in heavy rain causing an outage at our office to PROVIDER1. We have a backup connection through a second provider that is slower but being 5GHz doesn’t drop in the rain, PROVIDER2.

The network is like this:

[MikroTik CCR1036-12G-4S]
—[RBSXT]—[RBOmniTikU-5HnD[—[AF24]—[PROVIDER1]
—[RBSXT]—[PROVIDER2]

2. Simple floating static routes with check gateway doesn’t help because on PROVIDER1 we never drop our 5GHz connection to the tower, it’s the upstream link that fails.

3. I tried recursive routes and it works but the failover was still lacking and seemed sporadic at best.

4. When failover did occur, the VOIP PBX would hold the connection open through the dead provider and some phones in the office wouldn’t work at all, rebooting the phone was the only solution. We tried a ton of solutions and never got it to work consistently.

The solution that works the best is as follows. I am using a combination of static routes, firewall rules and Netwatch scripts. Here it is:

The Netwatch script watches 4.2.2.4 (a public DNS server). If it goes down:

It changes the distance on the default router to PROVIDER1 to 20 making it inactive. Now all traffic defaults through PROVIDER2.
It emails me that the gateway has changed. Please not you must set up your email server IP, and any authentication in /tools e-mail first.
It clears any connections to my VOIP gateway, thereby causing them to re-establish, interestingly calls do not drop!
When pings return, it sets the distance on the default route through PROVIDER2 back to 1 making it the active route and then clears all connections to the VOIP gateway again.
/tool netwatch
add comment=CheckCon down-script="/ip route set [find comment=\"\
PROVIDER1\"] distance=20\r\
\n/ip route set [find comment=\"PROVIDER2\"] disabled=no\r\
\n/tool e-mail send to=\"YourEmailAddress\" body=\
\"Connection with PROVIDER1 Lost, Switched to PROVIDER2\" \
subject=\
\"Lost connection with PROVIDER1\"\r\
\n/ ip firewall connection remove [find dst-address=\"\
YourVoipGatewayIP\"]" host=4.2.2.4 interval=5s timeout=2s \
up-script="/ip route set [find comment=\"PROVIDER1\"] distan\
ce=1\r\
\n/ip route set [find comment=\"PROVIDER2\"] disabled=no\r\
\n/tool e-mail send to=\"YourEmailAddress\" body=\
\"Connection with PROVIDER1 Regained, Switched back to PROV\
IDER1\" subject=\"Regained connection with PROVIDER1\"\r\
\n/ip firewall connection remove [find dst-address=\"\
YourVOIPGatewayIP\"]"
Next we need to ensure we can only ping our test host through the PROVIDER1 connection. This is done with a static route through PROVIDER1:

/ip route add
comment="Force test pings through PROVIDER1" dst-address=4.2.2.4 /
gateway=199.21.228.153
Next we need to comment our default routes.

/ip route
add comment=PROVIDER1 distance=1 gateway=199.21.228.137 scope=\
11
add comment=PROVIDER2 distance=10 gateway=209.112.225.65
Next we need to ensure that no pings to our test ip go through PROVIDER1 only:

/ip firewall filter add chain=output comment=/
"Drop pings to 4.2.2.4 if they go through PROVIDER2" \
dst-address=4.2.2.4out-interface=ether2 action=drop
As I write this it is pouring rain outside and I have observed it go down 3-4 times and even with people on the phone, calls continue and we haven’t lost the network. I am loving this!

Steve Discher
Steve Discher

Hot to handle VOIP on multiple WANs/backup

Hot to handle VOIP on multiple WANs/backup

Re: Hot to handle VOIP on multiple WANs/backup

Re: Hot to handle VOIP on multiple WANs/backup

Re: Hot to handle VOIP on multiple WANs/backup

Re: Hot to handle VOIP on multiple WANs/backup

Re: Hot to handle VOIP on multiple WANs/backup

Re: Hot to handle VOIP on multiple WANs/backup

Re: Hot to handle VOIP on multiple WANs/backup

Re: Hot to handle VOIP on multiple WANs/backup

Re: Hot to handle VOIP on multiple WANs/backup

Re: Hot to handle VOIP on multiple WANs/backup

Re: Hot to handle VOIP on multiple WANs/backup

Re: Hot to handle VOIP on multiple WANs/backup

Re: Hot to handle VOIP on multiple WANs/backup

Re: Hot to handle VOIP on multiple WANs/backup

Re: Hot to handle VOIP on multiple WANs/backup

Re: Hot to handle VOIP on multiple WANs/backup

Re: Hot to handle VOIP on multiple WANs/backup

Re: Hot to handle VOIP on multiple WANs/backup

Re: Hot to handle VOIP on multiple WANs/backup

Re: Hot to handle VOIP on multiple WANs/backup

Re: Hot to handle VOIP on multiple WANs/backup

Who is online