DNS failure when some of dynamic servers do not answer

I have configured 2 static dns servers and two dynamic interfaces (dual wan links with dhcp) also get two servers for each interface (total 6 dns servers). Setup works normally ok and clients asking names from mikrotik work fine.

However, when some of the servers fail, mikrotik does not reply at all even though some of servers work. For example now I have a situation that two dynamic servers do not anwer (192.19.223.230,192.19.123.231 (addresses magled)) and query fail, even though other dns-servers work (like 8.8.8.8).

Is this a bug in RouterOS 6.39.1 ?

[admin@gw] > /ip dns print
servers: 8.8.8.8,8.8.4.4
dynamic-servers: 192.19.223.230,192.19.123.231,191.219.0.40,191.129.0.42
allow-remote-requests: yes
max-udp-packet-size: 4096
query-server-timeout: 1s
query-total-timeout: 10s
max-concurrent-queries: 120
max-concurrent-tcp-sessions: 20
cache-size: 2048KiB
cache-max-ttl: 1w
cache-used: 425KiB

$ dig google.fi @10.2.1.254

; <<>> DiG 9.10.3-P4-Ubuntu <<>> google.fi @10.2.1.254 <mikrotik
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 29629
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;google.fi. IN A

;; Query time: 0 msec
;; SERVER: 10.2.1.254#53(10.2.1.254)
;; WHEN: Wed Jun 07 11:00:45 EEST 2017
;; MSG SIZE rcvd: 27

$ dig google.fi @8.8.8.8

; <<>> DiG 9.10.3-P4-Ubuntu <<>> google.fi @8.8.8.8
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 26670
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;google.fi. IN A

;; ANSWER SECTION:
google.fi. 299 IN A 172.217.18.131

;; Query time: 28 msec
;; SERVER: 8.8.8.8#53(8.8.8.8)
;; WHEN: Wed Jun 07 11:00:29 EEST 2017
;; MSG SIZE rcvd: 54

You say “do not answer at all” and then you show an example where there is an answer in 0ms with status SERVFAIL.
What is it what you want to resolve? SERVFAIL does not mean “not answer”, it is an error return code.

Ok, i described it inaccurately. So when name is queried from mikrotik, dig returns after 10s (query-total-timeout) with SERVFAIL response and no A record is returned, even though 4/6 dns-servers are functional. So looks like only 1 or 2 of dns-servers are used.

System has now had few hours two first dynamic servers down (dig returning from those “no servers could be reached”) and all dns-queris from mikrotik return SERVFAIL. So looks like all queries are being sent to two (or one) first dynamic servers and other dynamic dns-servers or static servers are not used at all.

After mikrotik was rebooted and the dynamic interfaces got the same dns-servers (2 still not answering), now mikrotik is replying to dns-queries. So maybe a problem in dns-server selection that it could be stuck to one or two servers sometimes?

Interesting… I will have to test at home to see if I can reproduce this.
(cannot do such tests in production environment of course)

After mikrotik was rebooted and the dynamic interfaces got the same dns-servers (2 still not answering), now mikrotik is replying to dns-queries. So maybe a problem in dns-server selection that it could be stuck to one or two servers under some circumstaces.

When checking the issue, I tried to change various parameters back and forth (without affect), like query-server-timeout, query-total-timeout, allow-remote-requests, etc).

The same has occured sometimes earlier, few months ago. Don’t know how frequently it would occur as IPSs DNS-servers typically work, only this time one ISPs both servers stopped answering.

if u have dual wan link and one of them fails then you are querying the dns server from the other link. usually isp dns servers are not public. you can query them only from inner network. what is happening in your case is that you can still ping the server from outside isp network so their status is up but when u query them you get an invalid response. this is normal for not public dns server. what you need to do is drop packets sent to this dns servers for interfaces other than your respective isp wan interfaces

/ip firewall filter add action=drop chain=output dst-address=192.19.223.230,192.19.123.231 out-interface=!WAN1
/ip firewall filter add action=drop chain=output dst-address=191.219.0.40,191.129.0.42 out-interface=!WAN2

Yes, that’s true, that the ISPs don’t answer to queries from wrong network (my other ISP does so).

But anyway, I suppose mikrotik should anyway rotate the queries to all dns-servers and get the answer from some of them? At least when the query-server-timeout is like 1s and query-total-timeout 10s, it should have time to ask from 10 dns-servers before timeout.

mikrotik reach the dns based on routes in routing table. if you have default route active primary isp then will query the dns of the second isp through the active route that might be your primary isp. add routes to the dns ip manually to select proper path.

/ip route add dst-address=192.19.223.230 gateway=WAN1
/ip route add dst-address=192.19.223.231 gateway=WAN1
/ip route add dst-address=191.219.0.40 gateway=WAN2
/ip route add dst-address=191.129.0.42 gateway=WAN2

include also the firewall rules to make sure not query them from other interface if these routes become invalid

There is no timeout but instead an invalid response telling you that you can not query the dns from this network. Mikrotik will try next dns only if there is a timeout. Since Mikrotik does not have mechanism to analyse the response just forward’s the response to the requesting machine and adds it to the dns cache. It would be no sense to discard the response since the requesting machine might want the response anyway so why discard the response if not told too?
anyway dns are not queried in order from the first to the last. instead they are queried in random to balance the load of dns servers. just saying

No, there was no answer from dns-servers (no single packet received according to wireshark trace). Requests to dns-servers timed out without any error response. So after timeout, mikrotik did not try next random server. Also, randomization did not work as all request failed (not only 2 out of 6) consistently.

As I saied earlier, this occures only occasionally every few months and after mikrotik reboot, it again worked.