DoH corrupting DNS cache? DNS cache full with invalid data?

So rather a strange problem occurred today. I woke up and heard complain that some sites aren’t accelerable from my network. Especially www.youtube.com. For some reason the dns wasn’t being resolved, but only for few domains.


 dig @192.168.1.1 www.google.com

; <<>> DiG 9.11.3-1ubuntu1-Ubuntu <<>> @192.168.1.1 www.google.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 56004
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;www.google.com.                        IN      A

;; ANSWER SECTION:
www.google.com.         231     IN      A       172.217.167.132

;; Query time: 1 msec
;; SERVER: 192.168.1.1#53(192.168.1.1)
;; WHEN: Mon Jun 22 19:11:50 +06 2020
;; MSG SIZE  rcvd: 48

 dig @192.168.1.1 www.youtube.com

; <<>> DiG 9.11.3-1ubuntu1-Ubuntu <<>> @192.168.1.1 www.youtube.com
; (1 server found)
;; global options: +cmd
;; connection timed out; no servers could be reached

 dig @8.8.8.8 www.youtube.com

; <<>> DiG 9.11.3-1ubuntu1-Ubuntu <<>> @8.8.8.8 www.youtube.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 61986
;; flags: qr rd ra; QUERY: 1, ANSWER: 8, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;www.youtube.com.               IN      A

;; ANSWER SECTION:
www.youtube.com.        21388   IN      CNAME   youtube-ui.l.google.com.
youtube-ui.l.google.com. 88     IN      A       216.58.200.142
youtube-ui.l.google.com. 88     IN      A       172.217.167.142
youtube-ui.l.google.com. 88     IN      A       172.217.160.142
youtube-ui.l.google.com. 88     IN      A       216.58.196.174
youtube-ui.l.google.com. 88     IN      A       172.217.163.46
youtube-ui.l.google.com. 88     IN      A       172.217.163.78
youtube-ui.l.google.com. 88     IN      A       172.217.163.142

;; Query time: 72 msec
;; SERVER: 8.8.8.8#53(8.8.8.8)
;; WHEN: Mon Jun 22 19:12:13 +06 2020
;; MSG SIZE  rcvd: 190

192.168.1.1 is my router - RB750Gr3. What is really interesting that during this time both cache-size and cache-used in /ip dns was 2048KiB. But /ip dns cache had only 3-4 entries and reverted to empty in 1 sec interval. I tried flush cache multiple times. No avail. cache-used is still 2048KiB and no meaningful entry was being added to cache. Then I doubled cache-size to 4096KiB, and instantly cache-used became 4096KiB too. But still ip dns cache is broken.

Then I restarted the router and suddenly everything is fixed. Every domain is resolving correctly. cache-used isn’t full anymore. Right now cache has 302 items and cache-used is only 183KiB.

Now my question is what caused this problem? It was clear the dns service was somehow broken/corrupted? As cache-used being 100% size with only 3-4 entries. Is there any other way to clear/restart dns service other than router reboot? I’ve been running this router for years and never had this problem. I upgraded to 6.47 recently (I always keep up to date with stable channel). And I configured DoH.

My current DNS setting

[admin@GittuTik] /ip dns> print
                      servers: 8.8.8.8,8.8.4.4
              dynamic-servers: 103.86.96.100,103.86.99.100
               use-doh-server: https://dns.google/dns-query
              verify-doh-cert: yes
        allow-remote-requests: yes
          max-udp-packet-size: 4096
         query-server-timeout: 10s
          query-total-timeout: 15s
       max-concurrent-queries: 100
  max-concurrent-tcp-sessions: 20
                   cache-size: 4096KiB
                cache-max-ttl: 1w
                   cache-used: 183KiB

I can confirm that with the recent release of 6.47, I am experiencing the same issues, but only when DoH is enabled.

I would say DoH is still in development / wide beta. It was introduced in this version.
When you want reliability, just turn it off until things are completely sorted out and stable.

(in fact I am surprised that DoH resolved entries even show up in the cache at all)

Out of curiosity: why shouldn’t DoH resolved entries be cached?

If that was the case then 6.47 shouldn’t have been pushed to stable channel. Nevertheless thats not the point.

Are mikrotik devs aware of this problem? Is posting here enough or do I have to report bug through some other channel?

Interesting. Also I think I can reproduce it. Right now even if i clear the cache, cache-used is ~600KiB. So I guess this 600KiB is the garbage/corrupt data, which is increasing over time. I guess as soon as it hits cache-size, dns service will break again.
Another thing I noticed, even if I disable DoH, and flush, cache-used is still ~600KiB. So without rebooting the router its impossible to clear dns cache?

Yes recommend, a supout report to MT, data points help when fixing code.

They should be, but in this release there was some other functionality added to the DNS resolver (added record types and forwards) and the way these features are handled when DoH is enabled suggests that DoH was “hooked” into the DNS resolver at the wrong place: not as an external DNS server, but as a handler for internal DNS requests that replaces the existing resolver.

It happened again. And this time dns cache-used wasn’t even 100% full. But suddenly dns queries stopped working. Sigh. I disabled DoH for now. Situations like this I wish metarouter was available for more models …

Did you setup static DNS for dns.google? Like (for dual-stack IPv4/IPv6):

/ip dns
set allow-remote-requests=yes cache-max-ttl=1d
use-doh-server=https://dns.google/dns-query verify-doh-cert=yes
/ip dns static
add address=8.8.4.4 name=dns.google ttl=5m type=A
add address=8.8.8.8 name=dns.google ttl=5m type=A
add address=2001:4860:4860::8844 name=dns.google ttl=5m type=AAAA
add address=2001:4860:4860::8888 name=dns.google ttl=5m type=AAAA

>

Or does the error also occur if you use Cloudflare public DoH DNS?

> ```text
/ip dns
set allow-remote-requests=yes cache-max-ttl=1d \
    use-doh-server=https://1.1.1.1/dns-query verify-doh-cert=yes

No I didn’t set static entries for dns.google as I had 8.8.8.8 and 8.8.4.4 for resolving it first time. I don’t see how it would be relevant … If DoH can’t resolv its own name (dns.google) then it posts a error in log. But the issue wasn’t that as there were no error in log when this issue happens.

yes, this issue is still present in 6.47.1

I stumbled on this this morning in 6.47.1, Once the cache is full you can see it constantly refreshing current entries and reloading the static entries. All DNS requests time out.

pbs.twimg.com
Server: [192.168.1.1]
Address: 192.168.1.1

DNS request timed out.
timeout was 2 seconds.

Is there a way to restart just the DNS service?

Beta version has this same issue here

No…

It is a legit issue. It’d be nice to have someone from mikrotik confirm that they are aware and working on this issue.

Hi All,
I reported this problem to Mikrotik Support, I have just had this response:

Hello,

We are seeing similar reports, currently we are trying to reproduce the issue. We are looking forward to fixing it as soon as possible.

Best regards,

Seeing the same thing, DoH enabled and DNS cache constantly evicted every second. 6.47.3

Trying 6.48beta58 since I saw some recent DNS and DoH things in the ChangeLog (though none seemed directly addressed at this).

About 4 days running and the cache size is staying at around 400k/4096k and there are plenty of entries. I would say 6.48beta58 fixed this for me!

-m

I know this is a year old post, but I have this problem with DoH off and have tried both newest ros releases. I can’t get rid of this problem. I end up rebooting my router multiple times a day.

Try change to another DoH provider.