I have a puzzling problem. I’m using a recursive route for WAN failover, and on a second RB I am using Netwatch to ping this recursive route so that a specific port can be disabled when the primary WAN fails, and enabled when the primary WAN recovers.
Topology:
<WAN1>--<ether1|rb1|ether2>--<ether1|rb2|wlan1>--<WAN2>
On rb1, recursive route using Google DNS for validation:
/ip route
add check-gateway=ping comment="primary route" distance=1 gateway=8.8.8.8
add comment="secondary route" distance=2 gateway=172.16.44.2
add comment="validate primary route" distance=1 dst-address=8.8.8.8/32 gateway=47.223.56.1 scope=10
IP 172.16.44.2 is rb2. On rb2:
/ip route
add distance=3 gateway=172.16.44.1
add distance=1 dst-address=8.8.8.8/32 gateway=172.16.44.1
Also on rb2 is a dynamic default route with distance=2 that is automatically added when WAN2 is manually activated during failure.
And the netwatch is pretty standard:
/tool netwatch
add down-script="/interface disable ether2;" host=8.8.8.8 interval=2s up-script="/interface enable ether2;"
So here is the problem:
- In the regular state with WAN1 working, rb2 pings to 8.8.8.8 are successful and netwatch works as expected (up)
- In the failure state with WAN1 failing, rb2 pings to 8.8.8.8 are unsuccessful and netwatch works as expected (down)
- In the recovery state with WAN1 working again, rb2 pings to 8.8.8.8 are still unsuccessful and netwatch is still down; traceroute is successful but ping is not!
- Only a reboot of rb2 will fix the ping/netwatch problem
I just cannot understand why traceroute is successful on recovery, but ping is not!
[admin@rb2] > ping 8.8.8.8
SEQ HOST SIZE TTL TIME STATUS
0 8.8.8.8 timeout
1 8.8.8.8 timeout
2 8.8.8.8 timeout
sent=3 received=0 packet-loss=100%
[admin@rb2] > /tool traceroute 8.8.8.8
# ADDRESS LOSS SENT LAST AVG BEST WORST
1 172.16.44.1 0% 2 0.3ms 0.4 0.3 0.4
2 100% 2 timeout
3 173.219.227.28 0% 2 7.5ms 8.8 7.5 10
4 173.219.152.234 0% 2 15.2ms 13.3 11.3 15.2
5 173.219.196.47 0% 2 18.8ms 18.8 18.8 18.8
6 209.85.245.179 0% 1 21.3ms 21.3 21.3 21.3
7 72.14.234.61 0% 1 11.7ms 11.7 11.7 11.7
8 8.8.8.8 0% 1 11.8ms 11.8 11.8 11.8
Any ideas of more things to check?