I have a Mikrotik RB5009UG at home connected to 2 ISPs: Claro and Nio. One of the ISPs (Claro) allows a bridge connection to the Internet, hence on this interface I get valid IPV4 and IPV6 addresses, as well as a /64 PD to use internally for my devices. I am able to assign a IPV6 address from the pool to the internal lan interface and configure Mikrotik to advertise SLAAC on the internal LAN.
The other ISP (Nio) does not allow bridge connection, forcing me to use it’s CPE as the router to the Internet. On this external interface I get a reserved IPV4 address and a valid IPV6 address, but no PD. The CPE has the /56 PD.
The problem is that when Claro ISP is out of order, all my internal devices loose IPV6 connectivity. Is there any way to configure the Mikrotik DHCPV6 Relay in order to allow internal devices to receive IPV6 addresses from BOTH ISPs?
Recent RouterOS versions has support for proxying of ND, which allows you to "proxy" SLAAC. However the ND Proxy entries are all static, which means you need some heavy use of scripting to deal with dynamic prefixes and client interface identifier. The interface identifier part can be dealt with by using DHCPv6 on the LAN side, but scripting is still required update the entries on prefix changes, and Android devices for example do not support address assignment through DHCPv6.
So a simpler way is to accept that you'll need to use NAT for connections going out of "Nio". You can proceed like this:
Make sure the default route provided by "Nio" has higher distance, so it's used as failover only (you might need to specify the route manually, while excluding the "Nio" from the Accept Router Advertisements interface list).
Address assignment on the LAN interface still uses the prefix information from the pool provided by "Claro".
Additionally, announce a static ULA prefix on the LAN interface, you can use this site to get a random ULA prefix: Unique Local IPv6 Generator.
Add a SRCNAT masquerade rule for out-interface=Nio, similar to what we normally have on the IPv4 side. For the outgoing interface Claro, no NAT is applied.
As a result from the above configuration:
When "Claro" is working, devices get addresses from both the public GUA prefix of "Claro" and the static ULA prefix. Due to precedence, they will prefer the GUA address for outgoing connections.
When "Claro" is working, its default route has lower distance, thus is the default route, normal IPv6 communication with the internet works for the LAN clients.
When "Claro" is down, client devices might lose the GUA addresses or not, it's not important. If "Claro" is only recently down, they might still be using the GUA addresses as the preferred ones. If its down for longer, they'll only have the ULA addresses and use them.
When "Claro" is down, the default route uses the "Nio" gateway and interface. Due to the presence of the SRCNAT rule, regardless of the IP addresses used by the clients (old GUA or static ULA prefix), the source address will be properly translated to the address with the prefix acceptable by "Nio". Clients devices will still have IPv6 connectivity, albeit with the downsides of NAT.
What happens when “Claro ISP is out of order”, does the interface go down? Clients need to learn about unroutable address.
@CGGXANNX Do you think it would help to toggle a firewall rule to reject packets on new and established connections with old GUA with an appropriate ICMPv6 error?
For new connections (those that have no existing entries in the connection tracking table) it's not needed because:
When the default route changes to "Nio" because "Claro" when down (either interface is down or OP implement recursive routing checks), and clients initiates new connections to the outside, then it doesn't matter. The connections will have a source address that is not valid for "Nio" anyway, and this new connection (connection-state=new) will go throught the SRCNAT rule that masquerades the source address into the address on the "Nio" interface. So this works normally.
When the default route change to "Claro" being the active route then:
If it was because the interface previously went down and is now up (interface state change), then normally there is a process that obtains the prefix from "Claro" (probably via DHCPv6), this prefix is then advertised to the LAN clients, which means within a few seconds, the clients will have IP addresses within this new valid prefix. Because the prefix is GUA and the addresses are marked as preferred, new connections will be using them as source address right away, So, there is no real issue (save for the special case of connections being initiated within the few seconds before the prefix is advertised) because the GUA addresses have "Claro" prefix and will go out through "Claro" link.
If it was because a recursive route check detects that the route is reachable again, and previously the interface never really went down, then everything will work too, because the client devices all still have the GUA addresses belonging to "Claro", which have not changed (remember, the route when down because some ping checks failed, maybe due to congestion, not because of a real network connection), so the new connections all still use the "Claro" prefixes for the source addresses.
Really relevant issues are there for established connections, those that already have an entry in the connection tracking table. As NAT (or not NAT) is already decided with the first packet of the connection, the NAT rules are not checked again for the rest of the packets (when connection-state is no longer new). When the default route changes, and the clients keep sending packets for the same connection, the packets will go out of the new interface with the source addresses always being the wrong ones.
Those packets will most probably be dropped by the ISP (because of the bogus source addresses), or if the ISP is sloppy and forwards the packets, then once they arrived at the destination, the destination hosts might send back responses (because the source addresses are the same as the previous packets, so they are still valid, from the viewpoint of the remote hosts), but those responses will have the no longer usable destination addresses, and will not come back to the router. The connection will timeout. After the timeout, the clients' applications probably will retry with new connections. These new connections should work normally.
When you write scripts to detect the default route changes situation, and turning on some firewall rules that catch the packets (these rules must be placed above the normal "accept established,related" rule, because they have to catch the "established" connections too), and of course, fasttrack must not be active, the you can sent back ICMPv6 error as you said, that would close the connection immediately, instead of having to wait for the timeout.
Whether the added complexity outweighs the inconvenience of having things "hang" for a while before timing out, that's your decision. Personally, I'll probably implement nothing and suffer the temporary interruption caused by the timeouts of ongoing connections .
In fact when you announce 2 different prefixes via SLAAC, a reasonably designed device will pick 2 IPv6 addresses from those (or more when you also have privacy extensions enabled) and when making an outgoing connect it will use the IPv6 address “closest” to the connected address.
That can work as a primitive kind of load balancing reasonably well, or not, depending on what IPv6 addresses you happen to get from the ISPs. E.g. at work we get a 2001: address from one ISP and a 2a02: address from the other, and it works.
Of course only for load balancing, not for failover. What happens when one of the links is down is entirely up to the client device. However, as we have static addresses I was able to setup “NAT” (netmap) in such a way that when one connection is down and traffic with one provider’s address range is router via the other, the source addresses are appropriately translated. With dynamic addresses that is a bit trickier.
In normal operation there is no NAT, only during failiver. And you also avoid issues with some software deliberately avoiding the use of any special address including ULA.
As Claro ISP allows me to connect to the Internet via bridging, when they have a problem, both IPV4 and IPV6 routes become “not active”.
In this situation, as Nio connection is via it’s ONU in router mode, my internal devices have only IPV4 connectivity via “double NAT” (internal Mikrotik LAN 192.168.50.0/24) and “external LAN” 192.168.254.0/24 (internal to Nio’s ONU).
Do you know if deleting connections with “deprecated” address via /ipv6/firewall/connection can help? I never needed to use it, but my understanding it’s effectively TCP RST which is the next right thing to do.
add ::1/64 from static pool as static address for internal lan interface, with advertising
add SRCNAT rule at IPV6 Firewall, specifying the ULA prefix as src-address and out-interface=Nio.
So far, so good. Internal PCs now have 2 IPV6 addresses and if i shutdown Claro interface, IPV6 starts to flow through masquerade rule and Nio interface.
I would not restrict the source address for this SRCNAT rule, and allow any src-address instead, keeping only the out-interface=Nio condition. That way it would also work for the situation where the client devices have not lost their GUA yet (not yet deprecated) and were still using the address with the public prefix (from Claro) as source address for their outgoing connections.
I just did a test for this scenario and no, TCP RST is not sent. The test setup was:
PC running Wireshark + SSH client (IPv6) <-> Test Router running 7.23rc1 <-> Main Router as SSH Server (IPv6)
On the PC I use SSH to connect to the Main Router and run /system resource monitor, so that there is constant traffic.
On Test Router, the IPv6 conntrack entry appears for the forwarded connection.
I delete the conntrack entry on the Test Router.
No TCP RST packet is sent or received, the SSH connection is kept open.
However, the output of /system resource monitor freezes, the reason is that the packets sent back by Main Router when they arrived at Test Router cannot be associated with an established connection, and are dropped because they come from the WAN side.
If I type something in the SSH window, the PC sends packets over the connection.
This immediately creates a new conntrack entry on the Test Router. Because of this default setting:
The ACK flag in the packet is enough for the new conntrack to be created, without the need for seeing a previous packet with SYN!
However, these packets obviously have the wrong sequence number (because in the meantime the server's side had sent multiple packets that were dropped, increasing the sequence number).
PC is notified about this, and the sequence number is resynchronized connection resumes normally.
SSH connection was not interrupted.
The conclusion is that the client will not be notified of anything if we go to the conntrack table and remove entries belonging to the obsolete address, because no TCP RST is sent, and also no ICMPv6 message.
Edit: I did further tests with loose-tcp-tracking=no and as expected, the conntrack entry is not recreated if I type stuffs in the SSH window, as a result, the response packets will still be dropped and after a few tries, the SSH client on the PC closes the connection.
This is however not relevant for OP's scenario, because in OP's scenario the remote side was cut off anyway (WAN down) so even with a recreated conntrack entry, there would still be no responses, and the SSH client would still close the connection after a few tries. Which means there is no need to set loose-tcp-tracking=no.