Scenario:
OP has a primary adguard DNS server. OP has a backup PI server.
He wishes as elegantly as possible ( aka hands-off ) approach, that detects when the adguard DNS server is not working and he wishes to switch the responsibility of DNS to a backup PI server.
There are multiple private VLANs at play but only one WAN.
Facts:
AdguardServer 10.20.30.30
PiServer 10.10.10.5
Approach: Attempt to apply netwatch DNS probe to accomplish the goal
Obstacle: We use DSTNAT to force users to Adguard or if applicable to Pi, and since there is no such thing as distance in dstnat rules, how are we going to select the correct Server to force users through??
ex.
dst-port=53 in-interface-list=LAN src-address-list=!excluded to-address=10.20.30.30
dst-port=53 in-interface-list=LAN src-address-list=!excluded to-address=10.10.10.5
Where the excluded list includes devices that should not be forced out adg/pi for DNS.
Attempt
- We modify the DSTNAT rules to include address-lists that will be used for "activating" the rules later as follows:
dst-port=53 in-interface-list=LAN src-address-list=!excluded to-address=10.20.30.30
dst-address-list=Ad-UP
dst-port=53 in-interface-list=LAN src-address-list=!excluded to-address=10.10.10.5
dst-address-list=Pi-UP
The idea being that we create firewall address list entries, to the EMPTY address-lists shown above, in system scripts that are called on by Netwatch (up or down). The scripts are executed based on the netwatch DNS probe results. There are no entries in these dynamic lists, so by default, both dstnat rules would not be utilized/traffic captured the router. The conditions for the rule to be applied are not met.
Note: This method of naming or defining address-lists within rules, means that they will not show up on the applicable firewall table, showing the current lists. However, when you create a new entry and pull down the list rown, both of these lists will be available as an option. Works the same if you make entries via winbox or CLI.
- Next we setup the DNS probe methodology. Instead of an external canary reliance ( like ICMP ) the idea is to use a local "all the time available" interface, similar to the work around we use for wireguard:
a. we use the Lo interface to use as an 'always available' interface gateway that a DNS will always respond to positively if the Adguard DNS server is working. Thus we create a static DNS.
/ip dns static
add address=127.0.0.1 name=checkadguard.net ttl=0.1 type=A
b. we use Netwatch DNS probe to check the host domain address that points to the lo-interface.
/tool netwatch
add dns-server=10.20.30.30 down-script=system script run Switch2Pi \
host=checkadguard.net name=Verify-UP record-type=A type=dns \
interval=4.0s timeout=1s up-script=system script run AllGood
Any positive resolved response is what we want and this tells us if the Adguard device is Functional. If the router could not resolve the lo-interface via the domain name 'checkadguard.net' , then it can be considered 'down'
c. Finally, the scripts come into play. If the adguard server is operational ( DNS probe gets a response ) then the script adds 0.0.0.0/0 to the firewall address list: Ad-UP on the destination nat rule for forcing users to the Adguard server, and as the conditions are met for the rule, the rule becomes active. to the adguard server and the conditions are met and the dst-nat rule is active. The system script executed, that adds 0.0.0.0/0 to the address list, includes a time-expiry of 5 secs. To ensure the continuous valid indication of adguard availability, is the reason we adjusted the default Netwatch interval time from 10s to every 4 sec.
The opposite occurs on the down side of the Netwatch Rule. If there is no response from the Adguard Server, on the DNS probe, then the other script is called and executed, which adds 0.0.0.0/0 to the Pi-UP firewall address list and times out after 5 secs. (also covered by the DNS probe occurring every 4 seconds.
Thus Every four seconds, a firewall address list entry is made in one of the pair of destination nat rules (udp/tcp) ensuring that associated rule is active and being used. If Adguard goes down, then there is a one second gap in availability, which seems feasible and the same gap should occur when Adguard comes back on line. We are not expecting flapping of the server, but a failure of equipment of some sort and not frequently.
/system script
add dont-require-permissions=yes name=AllGood owner=OP \
policy=read,write source="do={/ip firewall address-list add list=Ad-UP \
address=0.0.0.0/0 timeout=5s}"
And the other
add dont-require-permissions=yes name=Switch2Pi owner=OP \
policy=read,write source="do={/ip firewall address-list add list=Pi-UP \
address=0.0.0.0/0 timeout=5s}"
+++++++++++++++++++
Good idea, bad idea?
Where have I got it wrong?
Where can it be improved?
Recent EDITS:
A. The IP DNS static entry of TTL we determined should be zero because we want a fresh look each time we run netwatch, and didnt want the router to cache results. However the netwatch status, stayed hard down with this setting so we switched it to 0.1. That works.
B. Added interval=4s for Netwatch and a timeout of 1s
C. System script didnt work at all, so trying next to remove all permissions other than read/write.