yes, it seems as some of the traffic would maintain the backup path once swapped for the main connection failure. The backup is flawless because there is "no choice".. the sessions are dead for the down of the WAN1, but when WAN1 comes up again you don't have a down of the WAN2 so all that remains alive in WAN2 becomes a zombie..
For contracked NAT sessions this is expected behaviour: If WAN1 goes down, IP connections drop and are newly initiated by clients using the only available WAN2 interface. If this happens, a Masquerade/NAT session between the internal IP and the WAN2 IP exists and is used to NAT ingress traffic beloning to this connection. If WAN1 comes back up in parallel to WAN2, existing NAT connections will stay on WAN2 until they are closed or time out. While new outbound NAT connections will use WAN1 because of the lower distance of 10.
Are you sure this is the case?
My understanding is that the routing table takes priority. If WAN1 comes back up, then connections that were going out through WAN2 will then go via WAN1 if it has a lower distance value
But because of connection tracking, connections will go out WAN1 with the source IP address of WAN2 (src-nat/masquerade combined with connection tracking makes this sticky) and will therefore silently cause all phone calls to fail as you have asymmetrical routing (or is simply dropped in WAN1's network) and will not work at all until router is rebooted or connection table flushed
If i'm wrong please correct me
Importantly if there's a good way to handle all this properly, i'm sure all of us would like to know
For instance I think everyone would agree that the optimal approach to this type of failover would be to have a hard fail-over if WAN1 goes down, but when WAN1 comes back up its not a hard transition back to WAN1 (killing all phone calls) it would be best to have all existing and ongoing calls still routed out WAN2, and when the phone call is over, then it would transition over to WAN1. This would importantly prevent a flapping WAN connection from continually interrupting everyones calls
I'm sure there's a way to do this with mangle and connection/packet marking? maybe with speed thresholds, once a VoIP session is below say 10kbit/s then the call is clearly over and the connection should be flushed