Problems with WAN Failover

Hello everyone,

At the company where I work, we’re starting to professionalize our network. One of the things we want to improve is reliability, so we’ve contracted two ISPs and are trying to configure failover following these instructions:

Brief summary of our setup:

We currently have two ISPs: one works via PPPoE and the other simply uses IPoE. We also have two routing tables, where each one had an outbound route through a different provider, and we use routing rules so that, depending on the VLAN, traffic goes out through one ISP or the other.

I added the routes to each table as indicated in the link, changing the gateway so that in the PPPoE case I specify the interface instead of an IP. Here are the rules I applied for one of the tables:

/ip/route/
add dst-address=1.1.1.1 scope=10 gateway=pppoe-xxx routing-table=rtable-testing
add dst-address=208.67.222.222 scope=10 gateway=x.x.x.x routing-table=rtable-testing

add distance=1 gateway=1.1.1.1 target-scope=11 check-gateway=ping routing-table=rtable-testing
add distance=2 gateway=208.67.222.222 target-scope=11 check-gateway=ping routing-table=rtable-testing

As soon as I enable these rules, I immediately lose internet connectivity. My first thought was that maybe specifying the interface as the gateway (pppoe-xxx) instead of an IP is not compatible, but it also doesn’t switch to the backup gateway. Has anyone experienced the same issue?

If you’re wondering why I specify the interface instead of the PPPoE IP, it’s because when I run /interface pppoe-client print detail, it doesn’t show an IP:

/interface pppoe-client print detail

Flags: X - disabled, I - invalid; R - running
0  R ;;; HGU-1 (SFP+2)
name="pppoe-xxx" max-mtu=auto max-mru=auto mrru=disabled interface=sfp-sfpplus2 user="user" password="pass" profile=default keepalive-timeout=10 service-name="" ac-name=""
add-default-route=no dial-on-demand=no use-peer-dns=no allow=pap,chap,mschap1,mschap2

If you have any idea what might be causing this (or any advice), you’d be doing me a huge favor.

Thanks a lot!

Its too difficult for me at least to be of assistance without seeing the config, a diagram also helps.
/export file=anynameyouwish ( minus router serial number, any public WANIP information, keys, dhcp lease lists )

Hi anav,

Thank you for your reply.

I made a quick diagram to explain what I’m trying to achieve:

We have different VLANs, and depending on which one it is, it will use either the main routing table (which goes out through ISP1) or the secondary routing table (which goes out through ISP2).

The idea is that each table should have routes that switch to the other ISP if its assigned one goes down. The commands I shared earlier correspond to what would be the secondary table. Once it works, I would do the same for the main table, but the other way around.

I still can’t upload files since I’m a new user, so I’m pasting the configuration as reduced as possible. I hope I did it right :slight_smile:

# model = CCR2004-1G-12S+2XS
# serial number = 
/interface bridge
add admin-mac= auto-mac=no comment=defconf name=bridge \
    port-cost-mode=short vlan-filtering=yes
add comment=OSPF name=lo port-cost-mode=short
/interface ethernet
set [ find default-name=sfp-sfpplus1 ] comment=WAN1
set [ find default-name=sfp-sfpplus2 ] comment="WAN2"
set [ find default-name=sfp-sfpplus4 ] auto-negotiation=no speed=\
    1G-baseT-full
set [ find default-name=sfp-sfpplus12 ] comment=
/interface vlan
add interface=bridge name=vlan-xxx vlan-id=80
... many vlans...
/interface bonding
add disabled=yes mode=802.3ad name=bond-jtr0ag1 slaves=\
    sfp-sfpplus10,sfp-sfpplus11
/interface pppoe-client
add comment="PPPoE-1 (SFP+2)" disabled=no interface=sfp-sfpplus2 name=\
    pppoe-xxx user=user
/interface list
add comment=defconf name=WAN
add comment=defconf name=LAN
add name=LAN-VLANs
/interface lte apn
set [ find default=yes ] ip-type=ipv4 use-network-apn=no
/interface wireless security-profiles
set [ find default=yes ] supplicant-identity=MikroTik
/ip dhcp-server option
add code=66 name=voip-option66-provision value=\
    "'privider data'"
add code=2 name=voip-option2-gmtoffset value=0x00000e10
/ip dhcp-server option sets
add name=-voip-dhcp-set options=\
    voip-option66-provision,voip-option2-gmtoffset
/ip pool
... many ip pools ...
/ip dhcp-server
...
/ip smb users
set [ find default=yes ] disabled=yes
/port
set 0 name=serial0
set 1 name=serial1
/routing ospf instance
...
/routing table
add disabled=no fib name=rtable-testing
add disabled=no fib name=rtable-secondary
/system logging action
set 0 memory-lines=3000
set 1 disk-lines-per-file=3000
/user group
add name=xxx policy=""
/interface bridge port
add bridge=bridge ingress-filtering=no interface=sfp-sfpplus3 \
    internal-path-cost=10 path-cost=10
add bridge=bridge ingress-filtering=no interface=sfp-sfpplus4 \
    internal-path-cost=10 path-cost=10
add bridge=bridge ingress-filtering=no interface=sfp-sfpplus5 \
    internal-path-cost=10 path-cost=10
add bridge=bridge ingress-filtering=no interface=sfp-sfpplus6 \
    internal-path-cost=10 path-cost=10
add bridge=bridge ingress-filtering=no interface=sfp-sfpplus7 \
    internal-path-cost=10 path-cost=10
add bridge=bridge ingress-filtering=no interface=sfp-sfpplus8 \
    internal-path-cost=10 path-cost=10
add bridge=bridge ingress-filtering=no interface=sfp-sfpplus9 \
    internal-path-cost=10 path-cost=10
add bridge=bridge interface=sfp-sfpplus10
add bridge=bridge interface=sfp-sfpplus11
add bridge=bridge ingress-filtering=no interface=sfp-sfpplus12 \
    internal-path-cost=10 path-cost=10
add bridge=bridge interface=sfp28-1 internal-path-cost=10 path-cost=10
add bridge=bridge interface=sfp28-2 internal-path-cost=10 path-cost=10
/ip firewall connection tracking
set udp-stream-timeout=11m udp-timeout=10s
/ip neighbor discovery-settings
set discover-interface-list=LAN
/ipv6 settings
set disable-ipv6=yes max-neighbor-entries=8192 soft-max-neighbor-entries=8191
/interface bridge vlan
add bridge=bridge comment=work tagged="bridge,sfp-sfpplus3,sfp-sfpplus4,sf\
    p-sfpplus5,sfp-sfpplus6,sfp-sfpplus7,sfp-sfpplus8,sfp-sfpplus9,sfp-sfpplus\
    10,sfp-sfpplus11,sfp-sfpplus12,sfp28-1,sfp28-2" vlan-ids=30
... many rules
/interface list member
add comment=defconf interface=bridge list=LAN
add comment=defconf interface=sfp-sfpplus1 list=WAN
add interface=pppoe-xxx list=WAN
add interface=ether1 list=LAN
add interface=sfp-sfpplus12 list=LAN

/ip address
add address=x.x.x.x/24 comment=defconf interface=ether1 network=x.x.x.x
add address=x.x.x.x/24 comment=defconf interface=bridge network=\
    192.168.88.0
... vlan interfaces addresses...
add address=x.x.x.x/29 comment=WAN1 interface=sfp-sfpplus1 network=\
    x.x.x.x
/ip dhcp-client
# Interface not active
add interface=ether1
/ip dhcp-server lease
... DHCP leases ...
/ip dhcp-server network
...
/ip dns
set servers=x.x.x.x,x.x.x.x
/ip dns static
add address=x.x.x.x comment=defconf name=router.lan type=A
/ip firewall address-list
add address=x.x.x.x/24 list=...
add address=x.x.x.x/24 list=...
/ip firewall nat
add action=masquerade chain=srcnat comment=\
    "SRC-NAT -> MASQUERADE PARA FAILOVER" out-interface=sfp-sfpplus1
add action=src-nat chain=srcnat disabled=yes out-interface=\
    pppoe-xxx to-addresses=x.x.x.x
add action=masquerade chain=srcnat comment=\
    "SCR-NAT -> MASQUERADE FOR FAILOVER" out-interface=\
    pppoe-xxx
/ip firewall service-port
set ftp disabled=yes
set sip disabled=yes
set pptp disabled=yes
/ip ipsec profile
set [ find default=yes ] dpd-interval=2m dpd-maximum-failures=5
/ip route
add comment="Disable when enable failover" disabled=no distance=1 \
    dst-address=0.0.0.0/0 gateway=pppoe-xxxx routing-table=\
    rtable-secondary scope=30 suppress-hw-offload=no target-scope=10
add comment="Disable when enable failover" disabled=no distance=1 \
    dst-address=0.0.0.0/0 gateway=IPISP1 routing-table=main scope=30 \
    suppress-hw-offload=no target-scope=10
add comment="TEST - PreFAILOVER" disabled=yes distance=1 dst-address=\
    0.0.0.0/0 gateway=pppoe-xxx routing-table=rtable-testing \
    scope=30 suppress-hw-offload=no target-scope=10
add comment="TEST TABLE: first gw via ISP2" disabled=no dst-address=1.1.1.1 \
    gateway=pppoe-hgu-xxx routing-table=rtable-testing scope=10
add comment="TEST TABLE second gw via ISP1" disabled=no dst-address=\
    208.67.222.222 gateway=IPISP1 routing-table=rtable-testing scope=10
add check-gateway=ping comment="TEST TABLE: Check ISP2" disabled=no distance=1 \
    gateway=1.1.1.1 routing-table=rtable-testing target-scope=11
add check-gateway=ping comment="TEST TABLE: Check ISP1" disabled=no \
    distance=2 gateway=208.67.222.222 routing-table=rtable-testing \
    target-scope=11
/ip service
set telnet disabled=yes
set ftp disabled=yes
/ip smb shares
set [ find default=yes ] directory=/pub
/routing ospf interface-template
... OSPF config ...
/routing rule
add action=lookup comment="Internal subnets traffic only look at main table" \
    disabled=no dst-address=x.x.x.x/8 table=main
add action=lookup comment="Management LAN" disabled=no dst-address=\
    192.168.x.x/24 table=main
add action=lookup comment="Test Aitor" disabled=no src-address=x.x.x.x/32 \
    table=rtable-testing
add action=lookup comment="Vlan 38 ->ISP2 PPPOE" \
    disabled=yes src-address=x.x.x.x/16 table=rtable-ISP2
add action=lookup comment="Vlan 90 ->ISP2 PPPOE" \
    disabled=yes src-address=x.x.x.x/16 table=rtable-secondary
add action=lookup comment="Vlans 80-83 -> ISP2 PPPOE" disabled=\
    no src-address=x.x.x.x/14 table=rtable-ISP2
add action=lookup comment="Test" disabled=yes src-address=\
    x.x.x.x/32 table=rtable-ISP2
/system clock
set time-zone-name=Europe/Madrid
/system identity
set name=router
/system note
set show-at-login=no
/system ntp client
set enabled=yes
/system ntp client servers
add address=es.pool.ntp.org
/system routerboard settings
set enter-setup-on=delete-key
/tool graphing interface
add allow-address=x.x.x.x/24
add disabled=yes interface=sfp-sfpplus1 store-on-disk=no
/tool graphing resource
add allow-address=192.168.88.0/24
/tool mac-server
set allowed-interface-list=LAN
/tool mac-server mac-winbox
set allowed-interface-list=LAN
/tool netwatch
add comment="[Test] Check ISP2" disabled=no down-script=\
    ":log warning \"ISP2 DOWN\";" host=8.8.8.8 name="Check ISP2" \
    src-address=x.x.x.x test-script="" type=icmp up-script=\
    ":log warning \"ISP2 UP\";"
add comment="[TEST] Check ISP1" disabled=no down-script=\
    ":log warning \"ISP1 DOWN\";" host=8.8.8.8 name="Check ISP1" \
    src-address=x.x.x.x test-script="" type=icmp up-script=\
    ":log warning \"ISP1 UP\";"

Thanks!

Just to be clear, ISP1 is the primary WAN, where all users and devices ( assuming more than just two vlans ) go to. ISP2 is a secondary WAN ( and a backup WAN in case ISP1 goes down ) and only users on vlan20 should normally use this WAN. IF this WAN goes down, then users from vlan20 would be moved to VLAN1.

Is there a difference in the speeds of ISP1 or ISP2 or different costs to use??
Aka why not load balance users etc……….. So that they are both available to all at the same time…..
What is driving the setup…………

Many small config errors to work through even before tackling bigger questions………..
Appreciate you trying to shorten the config, but length is not a problem as accuracy is most important.

Observations so far:

  1. YOU ADDED interface list LAN-VLANs ( why not just keep it LAN as it refers to all the LAN subnets ). This also cascades downwards as many default rules assume the admin will use LAN……
    SUCH as the error your causing by the change when looking at /ip neighbours discovery setting for example which now refers to a non-existent interface list!! The advice is no need to get cute with default nomenclature. This also affects tools mac winbox mac server.
  2. Unless you have hybrid ports ( one untagged vlan and many tagged vlans) all other ports either trunk (carrying all tagged vlans) or access ( passing on one untagged vlan), they should all have.
    A. ingress-filtering=yes
    B. frame-types=admit-only-vlan-tagged ( for trunk ports)
    frame-types=admit-priority-and-untagged ( for access ports )
  3. Interface list members, there is no need to identify the bridge, and only the vlans need be identified.
  4. Part of that discussion is that once you go vlans its much easier, less confusing, cleaner and better practice to not assign any subnets to the bridge, it only does bridging. Thus if you have any subnet attached to the bridge, create another vlan and detail any ports or vlans appropriately in the /interface bridge settings. Some minor changes of course required in address, dhcp server etc.

In other words get rid of the 88 network!!
5. There seems to be an issue with your ether1?? YOu gave it an IP address so am I to assume WAN1 is a private WANIP and not public? What about ISP2 WAN2, is that public? Speaks to what we do about primary secondary and wireguard setup.

Note: For WAN connections it EITHER OR, not BOTH, in terms of use IP DHCP Client (or pppoe settings) OR ip dhcp client NOT both, so decide for ether1 !
6. You can get rid of the default static DNS setting.
7. Why do you put quotation marks on /interface bridge vlan entries???
8. Also hard to progress when you dont have firewall rules, or any mangle rules shown etc…

You need to put these two routes in the main table. Not the rtable-testing table.

Sorry for the late reply, yesterday was crazy for me.

@anav

Yes, ISP1 is the primary WAN, which is contracted with a provider that offers us better service (better SLA, etc.). Most workstations and a few VLANs route their traffic through it.

ISP2 is a cheaper link through which the majority of user computers are routed. We achieve that using two routing tables, each pointing to a different ISP, and routing rules that determine which VLAN queries each table. That’s the idea behind the current configuration.

Load balancing can probably be improved quite a bit, and it’s something we’ll work on once I’ve managed to get failover working.

The idea behind the failover configuration I mentioned in the first post is to set it up in each routing table so that if the ISP assigned to that table fails, it automatically uses the other one. I’m not sure if I’m explaining it clearly — that’s what I tried to illustrate in the diagram.

Regarding all the observations you made: thank you very much for taking the time and for the feedback. I’ll keep refining the configuration using the advice you gave me :slight_smile:
Let’s see if I have time today to upload a less redacted version of the config while still removing sensitive information.

@CGGXANNX

First of all, thanks for your comment.

Can failover only be configured in the main table? In this case, I set it up in rtable-testing since that’s the table I’m currently using for testing — I don’t want to touch the table that’s in production right now. If it can only be configured in the main table, I’ll have to completely rethink how we handle load balancing.

Thanks to both of you!

No, of course it works in other tables too. From these 4 routes:

The first two must be in the main table, while the 2 other ones, the one that are the default routes with dst-address=0.0.0.0/0 can be in the rtable-testing table.

You only have to put those two last rules in the rtable-testing table to achieve what you need. The first two rules, the ones that have dst-address pointing to the intermediate fake gateway addresses, MUST be in the main table.

The reason is that when you specify gateway=1.1.1.1 and gateway=208.67.222.222, when the next-hop lookup kicks in to find out how 1.1.1.1 or 208.67.222.222 can be reached, that lookup only use routes in the main table.

Your problem comes from the fact that you put the routes that tell the routers how to reach 208.67.222.222 and 1.1.1.1 in the rtable-testing table, they are completely useless that way.


To recapitulate, this is how you should specify the 4 routes:

/ip route
add dst-address=1.1.1.1 scope=11 target-scope=10 gateway=pppoe-xxx routing-table=main
add dst-address=208.67.222.222 scope=11 target-scope=10 gateway=x.x.x.x routing-table=main

add distance=1 gateway=1.1.1.1 scope=30 target-scope=11 check-gateway=ping routing-table=rtable-testing
add distance=2 gateway=208.67.222.222 scope=30 target-scope=11 check-gateway=ping routing-table=rtable-testing
1 Like

I am confused in that you mention (for ISP1) a few vlans and most workstations,
and then later ISP2, majority of computer users………………

Seems like we need a better understanding of what you are trying to achieve and you need a much clearer breakdown of your requirements. It should be as succinct as I have 10 vlans, 5 need to use ISP1 as primary and 5 need to use ISP2 as primary, with failover to the other WAN if needed.
vlan1 - usergroup A –> ISPx
vlan2- usergroup B –> ISPy
……..
vlan10- usergroup J →ISP?

where the usergroups can be, CEO, “executives”, “admin”, workstations-billing”, “workstations-production” ”computers-administration”, printers-all, Work WIFI, Guest WIFI, etc…………..
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

You failed to answer which of your ISPs provides a public IP address or at least a public IP to an upstream ISP router/modem where you can access and forward ports.