Hot to handle VOIP on multiple WANs/backup

ecastellani · March 15, 2021, 8:17am

Hi,

we give to our standard customer two internet connections managed by a 4011 and often we have a 3CX PBX cloud service with some phones inside the lan.

VOIP routing is quite simply demanded to distances (connections to SIP server).

When the distance 10 route fails (internet on WAN1 goes down) all the traffic goes to WAN2 (distance 20).

And this works fine for VOIP too.

When WAN1 come back to UP, VOIP sessions remains hanged on old route until reboot of the 4011.

I red about a workaround with a “flush nat” script on WAN change but I would like to know if someone has a better suggestion

Thanks in advance

sindy · March 15, 2021, 12:40pm

I don’t get how the connections can continue working as they move from WAN 1 to WAN 2 whereas they hang on WAN 2 when WAN 1 becomes available again.

There is a possibility to run a virtual router somewhere in a data center, create two VPN tunnels to it from the customer premises router, each via one of the WANs, and do the NAT for the connections to the 3CX on the virtual router. A failover of the DC normally doesn’t change the IP addresses of the hosts running there, and the virtual router is normally migrated to a backup hardware if the current one fails.

anav · March 15, 2021, 2:05pm

Interesting, I too have experienced similar issues. What I would do is manually unplug my obi2 modem and then plug it back in.
My case is where it would switch to the second router successfully but not switch back to the primary. I think you may be on to something about flushing the DNS or NAT cache. I tried to change settings on the Modem itself and it was successful to a point. For whatever reason the obi2 modem wants to connect to home via the WAN2 connection, or HOME thinks the Obi2 is still accessible through WAN2…

ecastellani · March 15, 2021, 7:58pm

yes, it seems as some of the traffic would maintain the backup path once swapped for the main connection failure. The backup is flawless because there is “no choice”.. the sessions are dead for the down of the WAN1, but when WAN1 comes up again you don’t have a down of the WAN2 so all that remains alive in WAN2 becomes a zombie..

specially the voip sessions: the cloud PBX (we use mainly 3CX) has the sessions still opened from the internet public IP of the WAN2 but begins to have requests from the WAN1 and voip begins to mess up: maybe you can open new calls but you are not able to receive..

It depends even from the device you are using inside the lan: a yealink phone surely will mess up (and so does a FritzBox), but with a local SBC the issue disappear (probably is smarter in handling sessions).

Oh yes, the second workaround is placing an SBC in the lan configuring sip accounts reg for the local phones. It works but I don’t like it because you gain stability but even one more point of failure and you have an extra expense..

sindy · March 15, 2021, 8:19pm

The failover from primary to backup WAN can’t be flawless either - some time will elapse until the phones re-register (and thus the registration arrives to the remote registrar from the new WAN IP), and until that happens, the incoming calls keep being sent to the dead WAN IP.

So what you have in mind is that whilst the tracked connections NATed to the primary WAN’s IP are dropped when the primary WAN goes physically down (as the NAT is done using masquerade), the backup WAN doesn’t go physically down when the primary one gets up again, so the tracked connections NATed to the backup WAN’s address stay like that, and even though the phones eventually re-register, the re-registration reuses the tracked connection so the REGISTER packets still leave with the wrong public IP.

And if the primary WAN remains physically up but the path to internet is broken somewhere at your ISP, the situation is the same, as the existing tracked connections aren’t removed either.

A virtual VPN router or virtual SBC located in a data center is still a SPOF, but it should be less likely to fail as redundancy mechanisms exist in the data centers. It is definitely an extra expense, yes.

At local level, there is no other way but the script removing the tracked connections NATed to secondary WAN’s IP whenever the primary WAN becomes available again; as suggested above, in fact you should track the transparency of the primary WAN rather than just its physical state, and the script removing the tracked connections should act also when the traffic fails over to the backup WAN.

Since there were some woes that the removal of the tracked connections using /ip firewall connection remove occasionally fails, an /ip dhcp-client release on the backup WAN, or even /interface ethernet disable etherX ; delay 1s ; /interface ethernet enable etherX may be necessary as a “bigger hammer” to get rid of the stuck connections.

But no matter what you do, the new connections are not created until each phone decides to renew the registration.

jbl42 · March 15, 2021, 8:34pm

yes, it seems as some of the traffic would maintain the backup path once swapped for the main connection failure. The backup is flawless because there is “no choice”.. the sessions are dead for the down of the WAN1, but when WAN1 comes up again you don’t have a down of the WAN2 so all that remains alive in WAN2 becomes a zombie..

For contracked NAT sessions this is expected behaviour: If WAN1 goes down, IP connections drop and are newly initiated by clients using the only available WAN2 interface. If this happens, a Masquerade/NAT session between the internal IP and the WAN2 IP exists and is used to NAT ingress traffic beloning to this connection. If WAN1 comes back up in parallel to WAN2, existing NAT connections will stay on WAN2 until they are closed or time out. While new outbound NAT connections will use WAN1 because of the lower distance of 10.

What I did to solve this problem in an installation is having a acript running on WAN1 link up disabling WAN2 interface, waiting for 100ms and renabling it. This drops all connections on WAN2 and if clients reestablish them they will use WAN1 because of the lower distance of 10. An ugly solution, but it works reliable.

A general problem of any dual WAN approach with different IPs is that everytime your WAN IP changes, all ongoing SIP calls will be dropped. If you want to avoid this, you will need an SBC in the internal LAN. (Or a dynamic routing solution on your WAN connection allowing to change the line bwteen ISPs while keeping your public IP. But this is usually off the chart for most non-Enterprise installations).

PS: In case of SIP another question would be how the RB4011 SIP Alg handles changes of WAN route and IP.

msatter · March 15, 2021, 8:57pm

You could look at the STUN (UDP/7080-7090)connections before dropping WAN2 and switch back to WAN1. Any ongoing conversations will not be interrupted that way.

The risk is that on busy lines the switch back has to wait very long and you could include a timeout after which the line is hard dropped, to force switching back.

che · March 15, 2021, 9:07pm

Masquerade is not suitable for multi wan VoIP setups, use action=src-nat instead of masquerade. Example:

/ip firewall nat
add chain=srcnat action=src-nat to-addresses=1.1.1.1 out-interface=WAN1
add chain=srcnat action=src-nat to-addresses=2.2.2.2 out-interface=WAN2

If you have a dynamic IP on any of the WANs, use script to update the NAT rule.

millenium7 · March 15, 2021, 9:50pm

yes, it seems as some of the traffic would maintain the backup path once swapped for the main connection failure. The backup is flawless because there is “no choice”.. the sessions are dead for the down of the WAN1, but when WAN1 comes up again you don’t have a down of the WAN2 so all that remains alive in WAN2 becomes a zombie..

For contracked NAT sessions this is expected behaviour: If WAN1 goes down, IP connections drop and are newly initiated by clients using the only available WAN2 interface. If this happens, a Masquerade/NAT session between the internal IP and the WAN2 IP exists and is used to NAT ingress traffic beloning to this connection. If WAN1 comes back up in parallel to WAN2, existing NAT connections will stay on WAN2 until they are closed or time out. While new outbound NAT connections will use WAN1 because of the lower distance of 10.

Are you sure this is the case?

My understanding is that the routing table takes priority. If WAN1 comes back up, then connections that were going out through WAN2 will then go via WAN1 if it has a lower distance value
But because of connection tracking, connections will go out WAN1 with the source IP address of WAN2 (src-nat/masquerade combined with connection tracking makes this sticky) and will therefore silently cause all phone calls to fail as you have asymmetrical routing (or is simply dropped in WAN1’s network) and will not work at all until router is rebooted or connection table flushed

If i’m wrong please correct me

Importantly if there’s a good way to handle all this properly, i’m sure all of us would like to know
For instance I think everyone would agree that the optimal approach to this type of failover would be to have a hard fail-over if WAN1 goes down, but when WAN1 comes back up its not a hard transition back to WAN1 (killing all phone calls) it would be best to have all existing and ongoing calls still routed out WAN2, and when the phone call is over, then it would transition over to WAN1. This would importantly prevent a flapping WAN connection from continually interrupting everyones calls
I’m sure there’s a way to do this with mangle and connection/packet marking? maybe with speed thresholds, once a VoIP session is below say 10kbit/s then the call is clearly over and the connection should be flushed

jbl42 · March 15, 2021, 11:08pm

My understanding is that the routing table takes priority. If WAN1 comes back up, then connections that were going out through WAN2 will then go via WAN1 if it has a lower distance value
But because of connection tracking, connections will go out WAN1 with the source IP address of WAN2 (src-nat/masquerade combined with connection tracking makes this sticky) and will therefore silently cause all phone calls to fail as you have asymmetrical routing (or is simply dropped in WAN1’s network) and will not work at all until router is rebooted or connection table flushed

Changing the outgoing WAN interface for an existing conntracked src-nat connection also changes the IP address on your end of the IP connection, immediatly breaking any TCP connection. Outonly UDP might continue to work, depending on the L5 protocol. Even if packets with wrong WAN source IP get to the destination without being dropped as spoof, any response would arrive on the wrong RB4011 WAN interface where it gets dropped in the FW input chain as invalid connection.
So if what you described is what the RB4011 does, it would be a quite strange thing to do for a router.

Anyway.
I think this is pretty close to what you need:
https://help.mikrotik.com/docs/display/ROS/Firewall+Marking

millenium7 · March 17, 2021, 9:34pm

I would like to be corrected if someone knows for sure, but I believe it is expected behavior if using nothing more than a masquerade rule. But doesn’t happen with src-nat rules, the problem is src-nat rules are no good if the IP address is dynamic

One reasoning behind it (besides not really being coded well for a user router) would be if you do have multiple WAN’s and don’t drop invalid connections, you can have asymmetrical routing without dropping connections, they continue to work. Which is a good thing for a core/distribution router
But in most user cases its a very bad thing

Be good if someone who knows for certain could chime in

sindy · March 17, 2021, 9:45pm

It doesn’t matter whether action=masquerade or action=src-nat rule sets up the NAT behavior of the connection. Only the initial packet of each connection is matched against the NAT rule chains, and the behaviour imposed by the rules it matches is then remembered in the context of the connection and repeated for all subsequent downstream (orig) packets of that connection, and inverted for the upstream (reply) ones.

The difference between action=masquerade and action=src-nat is that masquerade uses the address attached to the out-interface as the new source address of the connection (referred to as reply-dst-address), but also that the connection is removed if the interface loses that address. This does not happen with connections whose reply-dst-address has been assigned by an action=src-nat rule.

che · March 17, 2021, 10:48pm

Please read the “masquerade” section of MikroTik Confluence page to understand why this method of NAT is particulary bad for mutli-wan setups running critical services such as IP voice: NAT - RouterOS - MikroTik Documentation

The reason why I mentioned masquerade in the first place is that I have recognized the issue without OP even mentioning it. I have seen it many times in production back when I was involved with VoIP systems and switching NAT method from masquerade to src-nat together with properly setting up NAT helper (IP/Firewalll/Service Ports/sip in MikroTik case) solved the issue every time. Now, maybe my comments look unprofessional without 20 lines of explanation but that doesn’t mean they’re invalid.

gotsprings · March 18, 2021, 1:10am

/ip firewall connection remove [find]

I have that tied to a netwatch. Flushes all the connections on a change.

anav · March 18, 2021, 11:41am

It doesn’t matter whether action=masquerade or action=src-nat rule sets up the NAT behavior of the connection.

Please read the “masquerade” section of MikroTik Confluence page to understand why this method of NAT is particulary bad for mutli-wan setups running critical services such as IP voice: NAT - RouterOS - MikroTik Documentation

The reason why I mentioned masquerade in the first place is that I have recognized the issue without OP even mentioning it. I have seen it many times in production back when I was involved with VoIP systems and switching NAT method from masquerade to src-nat together with properly setting up NAT helper (IP/Firewalll/Service Ports/sip in MikroTik case) solved the issue every time. Now, maybe my comments look unprofessional without 20 lines of explanation but that doesn’t mean they’re invalid.

Interesting there che.
I have two masquerade rules, one for each ISP connection. Are you saying that this is a problematic approach?? (both ISPs are dynamic, one fiber, one cable)
I could just connect the VOIP to my backup cable ISP which rarely has issues and be done with it I suppose.

anav · March 18, 2021, 11:42am

Can you elaborate, what are you finding [find}
Is this a dhcp script, how is it tied to netwatch?

gotsprings · March 18, 2021, 12:08pm

Sure…

Keep it “simple stupid”.

I use netwatch to monitor an external DNS server.
I set a firewall rule to ONLY ALLOW pings to that host over the primary ISP. Using OUTPUT CHAIN.

if it goes up or down…
The netwatch fires the line above.

Flushing all connections in the firewall. Forcing them to reconnect.

I should take some time to do a 3 out of 5 ping… But this works as a starter.

But I have one site that “gave me S–t about calls.” I made them subscribe to “bigleaf”. Which made me unaware of the “hot interface.”

All traffic is sent to bigleaf and on to its destination by default.

Bigleaf takes all the feeds into the router (3 ISPs) the provides me 1 up from them.

Watching this… I have been on a phone call using VoIP and ripped out 2 ISP connections. The call never dropped.

sindy · March 18, 2021, 12:33pm

As you took the effort to quote the first half of my post before this statement of yours, would you mind explaining what is incorrect or missing in that post as a whole?

che · March 18, 2021, 3:03pm

Sure.

You are missing the part that clearly distinguishes how masquearade leaves the connections hanging when primary link is back if backup link did not have any kind of interuption, while source NAT does not in the same scenario. There are some corner cases where all types of NAT would be problematic, but I don’t think we need to go into explaining people how to properly segment services and use private IP range routing through VPN, MPLS, VLANs and similar.

And the reason why I quoted your first sentece is because it’s misleading to the reader (me).

The answer depends on the client experience. Do the phones or IP PBX have problems authenticating or connecting after your WAN(s) get recycled? This info might get you to the right track if the answer is yes.

Just to make it clear, my idea is to empower MikroTik users with my own experience, not to sound arrogant.

anav · March 18, 2021, 5:24pm

All sounds good che, I dont even use SIP, not recommended by most.