Hi,
A feature with ROS 6 is that in the event of failover or any other rerouting, ROS changes routing for established connections, but does not reassess any NAT. That means in the case of Internet failover, a connection formerly using ISP A but now failed over to ISP B, still goes out with a reply to address from A. Of course that doesn’t work, so we rely on things timing out so a new connection is created with the correct NAT.
I’d be delighted if this was due to a configuration error as it’s quite a problem for failover of UDP functions, when I first found and posted about the issue it seemed the general view was it’s just how it works.
Once an entry exists in the connection tracking table, along with any associated NAT flags, that’s it until the connection is closed. UDP connection tracking is tricky - as there is no equivalent of the TCP FIN flag to peek at, the “connection” is deemed to exist until no matching packets are seen for a certain period. For some services, typically SIP and NAT-T IPsec, the client keeps on sending packets so the connection tracking table entry is never cleared when the WAN changes.
Well it sounds like your asking different things.
If a connection goes down, then that session effectively dies a slow death for sure…
The degree to which the new connections smoothly transition to the other WAN is always a work in progress.
However, I dont agree with the assertion that the new connections are using the wrong WAN??
That sounds like a configuration issue.
Using masquerade in SrcNAT rules helps speed up the clearing of old connections and also some scripts as noted by rextended do that as well.
Both valid for ros6 or ros7. Check his scripts out for some ideas.
The behavior in 6.x is clear. If a connection re-routes because the original egress interface goes done, then the NAT is cleared and the connection immediately takes up the new NAT (if any) as applies to the new route.
If it re-routes for any other reason then the existing NAT or lack of NAT is retained even if incorrectfor the nw route. This could happen if the preferred interface comes back up, through recursive failover, or if a better route becomes available for any other reason.
The question is whether this is the same in ROS 7.
Regarding configuration I observe this behaviour with only a single masquerade rule active ..
Just had a play with packet capture and it looks like it’s not quite how I thought. In the scenario where the NAT is not revised, the connection is not actually re-routed, but still uses the former egress interface. That gives slight complications as to what breaks and what doesn’t.
In normal NAT scenarios (i.e. ISP gives you IP address, which otherwise belongs to their address space) ongoing connection can not be “re-routed” … because all traffic passing certain ISP has to use corresponding public IP address (public to the router itself, can be CGNAT address if ISP uses them). And the connection peer sees the public IP address as your end. If NAT would change public IP address of an ongoing connection (due to switching over to another ISP link), then packets sent to remote end would appear as a new connection (or, in case of TCP, invalid connection) from a completely different remote IP address.
Hence if a link (with its associated IP address) goes down, all connections using it have to die … either slowly (by timing out) or quickky (by being actively dropped by clearing connection tracking list on the egress router).
That is the heart of the problem because some connections do not easily time out. That’s specially an issue with SIP over UDP and using default timings. Hours after the Internet route hand changed the connection remained established with the old NAT still in place.
Connection details don’t appear to show egress interface, so I had been assuming it was following the new route, which obviously wouldn’t work with the wrong NAT. I didn’t realise it would send to the old egress interface even though that route had been removed, but obviously that won’t work either.
It can be worked around with 6.x by careful tweaking of timers so that in a failover situation the connection is not kept alive by periodic unanswered keep-alives. Timeout needs to be such that it’s maintained by two way traffic but not once replies are no longer received.
The concern regarding 7.x is that I’ve seen it stated that a UDP onnection is only treated as a stream if three packets are received at intervals less than udp-timeout. So simple poll/reply keep-alives won’t establish a stream. Which means udp-timeout needs to be increased to cover the interval between successive polls, rather than just between poll and reply. As evident in the thread about RDP where after upgrading to 7.x people had to increase udp-timeout from the value they’d been using with 6.x.
I had a bit of a struggle making sense of Packet Sniffer, but I have now found that it simply doesn’t capture packets sent though an interface that was down when the capture started. Hence I was seeing packets apparently not being forwarded on the new route. Knowing that, I have managed to catch it red handed, sending on the new route but using NAT from the old route.
In this clip my host is 172.17.1.210 connected on ether2. Internet is via ether9 (which is CGNAT).
Bringing up the preferred interface and route before restarting capture we see packets following the new route, new egress interface, but still with the same NAT, and naturally not being answered …
As opposed to the correct NAT if a new connection is made. Note that’s my public IP so I’ve obscured it just leaving enough to show it’s not 192.168.1.55 ..
The behaviour you “discovered” is actually well known (among those who care). That’s why it’s sometimes better to use “chain=srcnat action=masquerade” instead of “chain=srcnat action=src-nat to-addresses=X.Y.Z.W” as SRC NAT rule.
Every time when interface disconnects and/or its IP address changes, the router will clear all masqueraded connection tracking entries related to the interface, this way improving system recovery time after public IP change. If srcnat is used instead of masquerade, connection tracking entries remain and connections can simply resume after a link failure.
Your case (when WAN interface switches over from active to backup) actually falls into same category … as far as active connections are concerned: WAN IP address changes.
I feel I did discover it because in this very thread it was stated dogmatically that it would not behave this way if properly configured, hence having to do the traces to confirm for myself.
The scenario is one where the route changes but the old egress interface has not gone down. So depending on how you define it, it seems that the connection does get re-routed - it retains the same status and timeout, same NAT, but gets transmitted on a different egress interface.
I’m using masquerade, second rule disabled during testing but shouldn’t have any effect anyway.
Well, it’s hard to say whether it’s a bug in ROS (which then happens on some rare conditions) or a misconfiguration … because you didn’t post too many details about the particular use case you have. So it’s hard to know what kind of WAN connectivity you have, how you configured the failover, etc.
Fact is that ROS uses firewall features (including NAT), present in linux kernel … and linux kernel between v6 and v7 is indeed significantly different. So there may be some change in behaviour which you are exploiting/exploring. After all, linux kernel switched from iptables to nftables …
In this design the preferred route is via L2TP (interface “AAISP”) which inserts a default route with low distance. Fall back is to pre-existing 0.0.0.0/0 route with distance=5 via ether9.
If the L2TP drops all connections clear and are recreated over ether9. That part all works fine.
When L2TP reestablished its route the takes precedence again, and that is when I see the connection re-routed but retaining the old incorrect NAT. Sending out on AAISP but using the NAT (masquerade) address of ether9.