need help for resilient route from DUAL-WAN to CHR/VPS acting as default GW

Hi Folks,

I need your help in getting my head around the best practice for defining a stable, resilient route from my home DUAL-WAN setup to a CHR on a VPS, which I want to use as default GW for my VOIP server.

Here is the thing, I depicted in my head so far.
I need help defining the best way to go for the “things in red”, here:

This is my Use-Case:
While working from home-office I deployed a second WAN via 4G/LTE to cover as backup for internet services.
This is a DUAL-WAN setup that works well.
However, for Voice/Phone services, I also deployed a VOIP (asterisk) server with connections to some cloud based VOIP-Providers.
The Problem: Whenever one WAN switches over, the VOIP connection will not (re-)establish itself via the other, remaining WAN link (this is connecting to the remote service, active calls will terminate in the event of a switchover, of course).
My Goal: keeping the connection between VOIP Server and Provider as stable as possible, by taking the problem of switchover of a local WAN out of the equation.

My open questions:

  • what service to use in order to connect between RB4011 (via WAN1) and LHGG (via WAN2) to the CHR?
    I’d rather not open any ports on my local site on either WAN. The CHR is already secured, only accepting inbound traffic (input chain) from my WAN IPs.
    So what will it be (Wireguard, GRE, …) you recommend?
  • What feature/function to use in order to find and update the currently, active route from local VOIP server to its default gw on the CHR?
    Is this something that calls for OSPF on the RB4011 or are there other means of doing that (better).

Some more parameters, some of which you might have seen in the PIC:

  • my VOIP providers do not accept IPv6, hence my idea of using the static ipv4 from CHR as gateway.
  • WAN1 has a public (dynamic) IPv4 and IPv6, while WAN2 only has one dynamic IPv4 (IPv6 not available)
  • The CHR has one fixed IPv4 and a :/64 IPv6 net available for deployment - so one or two IPv6 are possible on its ether1 (WAN) interface
  • LHGG and CHR are on ROSv7.1.1, while the RB4011 is still on v6.49.2 (but I am willing to upgrade, should need be - i.e. for using wireguard).

Thank you very much in advance for your input and recommendations!
Stay safe and a good transition into 2022!

regards,
hominidae

Wireguard is a way of encapsulation, encryption, and NAT traversal all in one, whilst GRE is just a way of encapsulation. You may not need encryption so much as the traffic between the Asterisk and the remote server will go plaintext (or whatever application layer security the Asterisk applies itself) from the CHR to the SIP provider anyway, but you do need a tunnel interface capable of dealing with the NAT betwen your home and the CHR, so bare GRE (or mere IPIP) cannot be used.


OSPF is definitely applicable but it may be an overkill, plus it requires a virtual tunnel interface, which excludes use of bare IPsec. And it’ll have to run at both the 4011 and the CHR.

If you decide to use a virtual tunnel interface (so IPIP or GRE over IPsec to resolve the NAT, bare L2TP (ridiculous encryption) or L2TP/IPsec, or Wireguard), you can just add static routes with different distance and check-gateway=ping at both ends, so that the tunnel via the primary uplink would be used if available and the LTE would be used if the primary uplink goes down.

You can also use the bare IPsec redundancy solution implemented relatively recently, where a single policy is linked to two peers simultaneously, but the failure detection time may be too long here plus you need a script to move the connection to back to the primary uplink once it comes back to life after an outage.

If you want to avoid any problems with mishandled packet fragments, it must be L2TP with MLPPP support activated - some SIP packets may be pretty large.


All the above assumes that you can either tell the Asterisk that it is NATed to the IPv4 address of the CHR and that it will forward the RTP, or that the Asterisk registers with the SIP provider and the SIP helper on the CHR will take care about the SIP contents manipulation if in plaintext - bvut even in this case the Asterisk has to forward the RTP from phones on your LAN, as unless something has changed recently, the SIP helper is unable to deal with RTP coming from another address than the SIP.


There is no point in L2-tunneling a VLAN from the 4011 to the CHR as you show on your picture, unless you’d want to bridge the WAN of the CHR with the LAN at your home so that you could make the Asterisk have the public IPv4 address on itself. To do that, you’d have to be able to link the IPv6 addresses to a separate interface on the CHR.

thank you so much for your response!


Agreed. I did some tests and, mainly because my local WAN IPs are dynamic, I have come to the conclusion, that using wireguard is the fastest and simplest solutiuon.
However, I will only activate the endpoints on the local WAN sides, making them connect to the CHR as peers.
This is, how it looks now (only tested for the LHGG yet, as the RB4011 is still occupied and cannot be upgraded to v7 before the weekend.

I am using a single wg-interface on the CHR and use this a default gateway for traffic to be sent out via the CHR.
I am unsure as of now, if the access-list is correct that way.



OSPF is definitely applicable but it may be an overkill, plus it requires a virtual tunnel interface, which excludes use of bare IPsec. And it’ll have to run at both the 4011 and the CHR.

If you decide to use a virtual tunnel interface (so IPIP or GRE over IPsec to resolve the NAT, bare L2TP (ridiculous encryption) or L2TP/IPsec, or Wireguard), you can just add static routes with different distance and check-gateway=ping at both ends, so that the tunnel via the primary uplink would be used if available and the LTE would be used if the primary uplink goes down.

Also agree, but from my experience, the handover for WAN is sometimes not instantaneous..I was under the assumption, that OSPF maybe can be faster?



If you want to avoid any problems with mishandled packet fragments, it must be L2TP with MLPPP support activated - some SIP packets may be pretty large.

Hmm, not quite sure what to make of that. Packets larger than the wg mtu will be fragmented, of course, which could call for trouble…is this what your’re saying?
I do not plan to use large media/codecs, only voice and codec alaw for outgoing calls to external providers, so no HD-codecs.



All the above assumes that you can either tell the Asterisk that it is NATed to the IPv4 address of the CHR and that it will forward the RTP, or that the Asterisk registers with the SIP provider and the SIP helper on the CHR will take care about the SIP contents manipulation if in plaintext - bvut even in this case the Asterisk has to forward the RTP from phones on your LAN, as unless something has changed recently, the SIP helper is unable to deal with RTP coming from another address than the SIP.

I assume Asterisk/freepbx will assume a NATed network in any case. I’ll be using the CHR ipv4 as the static external one. This should work, but I will definitely test.



There is no point in L2-tunneling a VLAN from the 4011 to the CHR as you show on your picture, unless you’d want to bridge the WAN of the CHR with the LAN at your home so that you could make the Asterisk have the public IPv4 address on itself. To do that, you’d have to be able to link the IPv6 addresses to a separate interface on the CHR.

Agree, to that as well.
When I first came up with that idea, I’d wanted all VOIP infrastructure (Server, phones and outbound GW) in the same network. But I see, that this is not feasible (don’t even know if this would work at all, maybe with a VXLAN?).

I also changed the local IP on the CHR-bridge to a single IP. I am going to use this for inbound connections on the CHR, like accessing webfig / ssh over the wg-tunnel, too.

Well… OSPF as such also just prepares a backup route, same like the static routes. What makes the difference in the switchover time is the rate of link transparency checking. check-gateway=ping sends the test pings (or ARP requests) every 10 seconds and it cannot be affected, whereas the bfd sends 5 packets per second by default and it is configurable. But at the moment, bfd is not available in ROS 7 yet.


Yes, that’s what I had in mind - fragments are sometimes dropped somewhere on the path between the endpoints. Voice codecs are not the issue, even HD ones, but the signaling packets may be large. You’ll see whether it is an issue or not.

Then i’ll just have to test and decide…
Looking at my 2nd, updated pic…I am not quite sure if the dual route scenario works, as the second endpoint, pointing to wg1 on the CHR, which is also the default gw of the VOIP server is not on the same routing table.
The RB4011 will see the wg1 of CHR directly via 10.255.255.2 but not the second route via 10.255.255.3 (which is only reachable via 192.168.881 (IP of LHGG)).
So check gateway of the second route on rb4011 will not work (as gw is always available).
The reason this might still work is, that the second route will always be preferred as second, so only when route 1 is down, it will switch over…so no checking on route 2 is really needed.

Yes, that’s what I had in mind - fragments are sometimes dropped somewhere on the path between the endpoints. Voice codecs are not the issue, even HD ones, but the signaling packets may be large. You’ll see whether it is an issue or not.

Hmm, I see, so no use of trying EOIP over wireguard either :laughing:

Anyway, many thanks for sharing your thoughts and feedback.
Really appreciate it!

If using a static route on RB4011 you could implement recursive route to define it, so that check-ping will check the end point at the destination side instead of the internal address between LTE router and 4011 and it will find whether one route is still working or not.

…I do that for the WAN default routes, but here the endpoint of the destination (the CHR) will be the same for both routes…unsure if this is going to work.

Besides, I think there is a response thread for 7.1.1 here, saying that recursive routes are currently not working…so I am kind of goofed atm.

Another option could be, to enable passthrough on the LTE modem interface-channel of the LHGG, making it point to the RB4011 and enable WAN2 (and subsequently both wg links) from there,
I just have the fear of breaking things, and ultimately seeing myself trying to hold a button, that is 2m above while simultaneously trying to not fall off the roof, when performing a netinstall on the LHGG :laughing:

Actually recursive route works in v7.1.1 (I’m using it on my router), but you need to make sure that target-scope is set with a value higher than scope.
This is not taken care by the migration script, so initially the route is marked as invalid; then you can fix it by setting the target-scope (just need to be +1 of scope).

Beside your project, I have a similar setup with 2 WANs (one is SXT LTE) and I manage switching between default routes by using recursive route on RB4011; in your case it could be your destination route for your voice system, but the principle is similar.
About the LTE passthrough, I had the same thought as you, but I left it alone as it just remove one NAT level. But for the same concerns to avoid reaching the unit on top of the pole, I kept this alone.
At the end I have 2 hops from RB4011 to SXTLTE, then modem then Internet, but that’s not an issue if you can live with it.

Ahhh! Very good, thanks for clarifying that to me!


Beside your project, I have a similar setup with 2 WANs (one is SXT LTE) and I manage switching between default routes by using recursive route on RB4011; in your case it could be your destination route for your voice system, but the principle is similar.

Yes I do have that enabled for my WAN setup as well.
Will look into it for the routes to the CHR when I find myself willing to shut the house down (with a bunch a teenagers in it) on the weekend.
Normally I’d offer to send them to cinema to get my hands on the infrastructure but not in these pandemic times.

About the LTE passthrough, I had the same thought as you, but I left it alone as it just remove one NAT level. But for the same concerns to avoid reaching the unit on top of the pole, I kept this alone.At the end I have 2 hops from RB4011 to SXTLTE, then modem then Internet, but that’s not an issue if you can live with it.

Agree, but why the extra NAT level? As the LHGG will handle that, I turned it off on the main router for that link.

About NAT I was referring to my case where I establish a VPN connection from SXT to CHR (on datacenter) to get around the CGNAT that I have on my LTE connection.
In this case when I need to reach some internal port, I need to open it on CHR first and then on LTE before reaching the RB4011; if I had used passthrough, I would have avoided the second DST NAT on LTE device.
But I decided to manage this double NAT and not touched this passthrough option for now.

Ah, yes..I now understand.
I actually decided to trust the local endpoints of my wg tunnels, so I’d only open ports on the CHR but I’ve decided that there is no actual case any more after wireguard is once enabled on the CHR and a zerotier (running a docker on my NAS in a dedicated VLAN) link is around for other means.

It will not work if you ping the public IP of the CHR, because you have no way to tell the router which WAN interface to use to send the ping requests as the ping request packets cannot be distinguished from one another by destination address and you cannot do dst-nat in the output chain, but since your actual aim is to check the tranparency of the VPN tunnel established via the particular WAN, you need to ping a remote address reachable through that tunnel anyway, not just the public IP of the CHR. So just assign a different private IP to each of the two tunnel interfaces at the CHR, and set up a /32 route to the remote internal address of the primary tunnel with the tunnel as the gateway and another /32 route to that address with a higher distance value via a port-less bridge (I don’t remember how else to mimic a blackhole route in RouterOS 7). So while the tunnel will be up, the test IP will be contacted via that tunnel; if the tunnel goes down, there will still be a specific route to that destination which will supersede any less specific one but will effectively drop the packet.