I have this customer with dual WAN, ISP1 and ISP2.
ISP1 is a fiber Internet link, with 60 Mbit download and low latency.
ISP2 is a regular DSL, 20 Mbit download, twice as much latency.
ISP1 is primary, and ISP2 is used only for failover, in case ISP1 goes down.
Both ISP1 and ISP2 provides static public IPs and the PCs behind the MT router are all NAT using masquerade.
I wrote a script that check for three external IPs every 5 seconds, when all three fail, then ISP1 is declared DOWN and the default route is moved to ISP2.
(I have static routes for the three IPs I use for checks, so they are forced to go via ISP1 at all times)
As soon as ISP1 comes online (any of the three IPs ping) then the default changes again to ISP1.
Last weekend I noticed that ISP1 was flapping, coming Offline/Online, several times in a period of 5 minutes, so my script was failing over and failing back very quick.
If that would happen during business hours and ISP gets unstable for long, the office will be pretty much down, as the MT router will be swtiching back and forth all the time.
Any suggestions on how to address the issue?
I was thinking to add some logic on the script to start a counter after failover, so when ISP1 comes online then it waits for lets say, 5 minutes before failback, if within those 5 minutes ISP1 has issues, then the counter will reset and start again, preventing failback until ISP1 is stable.
Maybe I could add checks for latency also, not sure.
Any ideas on how to prevent rapid failback on potentially unstable links?
Thanks!