/ip dns servers= (cache) - how are multiple servers used?

This question does not have a clear answer from mikrotik (and the manual has very little data on /ip dns , and does not specifically address this important question):

With this setting how does rOS use the multiple DNS servers? (ie equal weight? failover? ratio?):
/ip dns set allow-remote-requests=yes servers=8.8.4.4,1.1.1.1,4.2.2.1

ie im using ros dns cacheing to speed up my clients (ie 192.168.1.0/24) DNS queries. But how does ROS use those 3x DNS servers?

In this thread, the answer from a MT rep seemed to be fail-over: http://forum.mikrotik.com/t/dns-utilization/93252/1

however, when i look at my traffic flow data for just DNS queries from the mikrotik (ros 6.42.7), im seeing Equal weight (ie spread across the 3x), so how does ros DNS handle if a single dns server is down? do 33% of requests now fail? (see my image below based on 16hours of data, it could just be luck that it appears to be equal weight / 33% per DNS Server of 3x , but this is why we need a solid answer)

We need more info on this , ideally added to the manual (and from mikrotik offical , so we can stop guessing on this pretty important service).
thanks
DNSCapture.JPG

I agree that Mikrotik should put the answer to the topic in the Wiki. I have not found the answer in the WIKI. However, I recall reading a forum article a few years ago in which a forum guru wrote that Mikrotik uses the DNS entries with Round Robin, excluding DNS IPs that don’t reply. It was never made clear how the DNS entries that do not reply are dealt with, nor how often then are queried. At the time, if my memory serves me, the latest version was 5.x, so I am not sure if this applies to 6.40+.

I can’t seem to find that post … but here is one that may be relevant to you

http://forum.mikrotik.com/t/dns-servers-secondary-before-primary/107622/1

Why?

All you really need to know is that when some configured server doesn’t answer, another will be asked instead. As long as at least one works, you’ll get the answer (if such exists). Everything else falls in “unimportant implementation details” category.

That is not really true. When the operation mode is “round robin” without any failure detection, the result will be that one in 3 (in this case) requests will get no reply and has to be retried, which happens after 2 seconds by default (can be changed in the setup). That slows down the DNS service.
However, I think it is like described above (and it is like this in most Linux resolvers): there is basically a round-robin operation but there is a weight for each server which is adjusted by the average responsetime and the number of timeouts. So when a server responds quickly, it will comparatively get more of the queries, and when it does not respond at all it gets far fewer queries. (probably no operational queries but only queries for the root, to see if it is back up)

Ok, let me try again.

The important thing to know is that it isn’t dumb round-robin, where one dead server would mean that every x-th query fails and needs to be retried. It would be really stupid. It didn’t work like this before and quick test says it doesn’t work like this now. I see simple failover. First server is used initially and when it fails, it moves to next one and so on.

If it works like this, then even distribution does look suspicious. But hey, it’s udp, packet can get lost. Or some query could take longer to resolve and when it doesn’t get back in time, RouterOS thinks the server is dead and moves on. I admit that it does look too even, but it’s not impossible.

So I stand corrected, it should be clearly documented. It’s just that almost every time there’s question about dns server priority, it turns out that the person wants to mix public and internal resolvers (for some non-public domain) and it’s simply wrong.

When it works like that the distribution will also be quite even because there is a certain number of lookups that will not return a reply (especially for reverse-DNS) and so the resolver will regularly move on to the next server under normal use (it is not possible to see the difference between a timeout due to nobody answering or due to packet lost).
However that would be quite sub-optimal w.r.t. servers being down, as for each “round” through the round-robin it would hit the nonworking server again and incur another 2 second delay.
I know “bind9” does not work like that, but maybe the MikroTik resolver does.

Asking for non-existing record will still give you answer:

> dig 1.88.168.192.in-addr.arpa PTR @8.8.8.8

; <<>> DiG 9.9.2-P2 <<>> 1.88.168.192.in-addr.arpa PTR @8.8.8.8
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 42555
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

...

So the server will look perfectly alive.

This is only problem for people mixing public and private domains, because the answer is clear: “it does not exist”. It isn’t “I don’t know” or silence, so there’s no point asking another server. When they add their private resolver as first, it initially works, because RouterOS starts from first one. But as soon as first query fails (no reply comes from server), RouterOS switches to another server and if it’s public one, private domains stop resolving. And it’s hard to debug, because it’s random, plus records are cached by both RouterOS and clients based on TTL, so it doesn’t show immediately.

Only for this specific case because Google DNS has the recommended empty zones for RFC1918 addresses.
But when you feed addresses of traffic you receive from internet into reverse DNS you will find many that do not return a response, at least not within 2 seconds.
Of course a resolver can finally return an error (e.g. MikroTik does that by default after 10 seconds) but then it will have tried different servers each returning no answer, and the “active server” has changed to a random one.

Even if Google didn’t do anything (they probably do, I think I’ve seen blocking of RFC1918 addresses in other resolvers too), 168.192.in-addr.arpa has proper delegation with working nameservers (they return NXDOMAIN for all PTR queries), so it’s like any other domain. With all domains, there should always be response, as long as authoritative servers are not dead or too slow. Which I’m sure does happen, but I can’t say how often.

I’d say at least in about 25% of reverse-DNS lookups (for addresses that are doing portscanning etc, so probably not representative for the entire address space).

sob, this thread exactly shows why we NEED a formal answer to this. At some locations where 100x or 1000x users are at risk, i need a formal / official answer to something that can cause everyone internet to “stop” (or appear dead), ie DNS. The most fundamental / critical part of the services we provide with rOS / routerboard is for the internet to function. DNS issues can cause that to stop. (i know- dont use caching then, but the benefits of this feature are very clear.)

I can say that before posting this question, i did do some tests on my own (and over the years have as well), when i have 2x DNS entries set on my RB, and i kill one of those 2x DNS servers, im not seeing a fail-over dns setup on rOS. (ie to my test clients, DNS appears to stop working, or a major part of it does).

I have even written (and posted to the fourms here) scripts to test dns and failover the ros entries in /ip dns. So an answer to this question is needed and is very important.

But more important, why do i need to spend all this time testing and writing scripts and even then still not 100% know for sure how the rOS dns system works- when a simple 1 or 2 line entry in the manual or wiki (or in this thread), from official MT would answer all of this and have taken maybe 60s for both/all parties involved.

if you dont have anything valuable to add please stay out of the discussion. Calling in to question the post is not something valuable and will help to cause this thread to be ignored / not answered by mt officials (or others who may know the answer for sure). This thread could have been answered in 2 posts (my post, and an answer - now its filled with all this extra trash)

Ok, I admit that I didn’t choose the best start. At the same time, it’s not like I turned it into complete off-topic. And I definitely don’t think that MikroTik employees hunt for unanswered threads and would be deterred by too many posts. Anyway, if I annoyed you, I’m sorry. Shutting up now…

to update everyone, i received this reply from MT support (email):


Yes, once DNS servers are responding properly, the same weight is applied.
In case one DNS server is not responding, its weight is decreased, then router check again if server is responding and weight is decreased or increased (whether query is replied or not).

So it appears (and pls correct me if im interpeting this wrong)- that if all dns replies are working, than equal weight is applied (ie 50/50 if 2x /ip dns servers= are used). If one of the 2x starts to not reply the weight becomes 100/0 %, until the down dns server, starts replying again (at which pt it goes back to 50/50% ).

(im going to test this locally, and will report back to confirm). tks

I tried RouterOS 6.44beta20 and /ip dns with 8.8.8.8, 8.8.4.4, 1.1.1.1, 1.0.0.1 with counter for each resolver (using firewall rules) and:

1) Queries for 100% working records
My own domain where authoritative servers have “*.<my_domain> A ”. Script sends stream of queries for .<my_domain>.

Result: All queries are handled by first resolver. Not a single packet is sent to others.

2) Queries that are likely to fail
PTR queries to ....in-addr.arpa.

Result: Queries have many random failures and timeouts and RouterOS is switching between resolvers all the time.

3) Simulated failures
Same records as in 1), but I send queries one by one and try to block outgoing udp/53 to selected resolvers.

Result: RouterOS switches to next resolver if (and only if) current fails to respond.

But now the most interesting testcase: you have 4 DNS servers configured and 3 are working, and you regularly query for nonresponding records (those .in-addr.arpa ones).
The DNS resolver queries 8.8.8.8 and gets no response, it has to assume that 8.8.8.8 is dead and switch to the next one. There (8.8.4.4) it gets a reply, etc.

Now what happens when 1 server is really dead and the other 3 aren’t? Will the result be a 1/4 failure rate on correct records?
When it is working correctly, the failure will result in a slight reduction of weight for the working server, but the weight for the dead server (that never replied the last hour) is much lower, and the rotation skips it most of the time so the average failure rate remains much lower.
(of course “failure” here means: a 2-second delay. because the resolver will try the next server after 2 seconds, so it eventually resolves the record when it responds properly)

So far I wasn’t able to find when/if any kind of weight kicks in. If I make one server always dead, it’s still asked when previous in line fails (and that adds delay). But I would need different method for controlling failures to properly test it. Random in-addr.arpa queries are not good for this, you’d need some external monitoring and evaluate what really failed. Let’s see what jo2jo comes with.

I gave up and use only 1 (virtual) IP in my DNS config. That does the trick.

My understanding was that DNS servers were always used in preference order. First one until it is not available at which point the queries go to the second.

If this is not the case it is both good and bad news I guess.

That is usually the case with resolver libraries and their config (e.g. /etc/resolv.conf). The big drawback is that the system becomes extremely sluggish when the first DNS server is down.
Resolver programs (including the DNS resolver used in RouterOS) normally do not work that way, they use some form of round-robin.
This not only improves performance, it also distributes the load over the DNS servers usually configured everywhere in the same order.

Bump. Would love to see some documentation from MT on this.