I recently deployed an RB5009UG+S+IN as both a primary gateway and local DNS server. For about three months after deployment it was leaking memory, eventually reaching ~95% memory usage, at which point the router would reboot. This was happening quite quickly too - the last such cycle took a little less than 6 days from startup (about 15% usage) to reboot.
I have been monitoring the router’s memory usage via SNMP since the initial deployment, so I was able to watch memory usage drifting upwards in real time. The upwards trend would occasionally pause for an hour or two, but outside of such periods it was a consistent upwards drift. Interestingly the rate at which memory was used actually increased several times, coinciding with changes being deployed to the running config - I will return to this later.
Eventually a colleague of mine stumbled across supout.rif, so I generated one right as the memory usage was at its peak and they analysed it. From this we were able to learn that the
resolver
process (which we assumed to be the local DNS server) had allocated itself 843MB of memory! In other words, it was using over 80% of the RB5009’s 1GB of memory on it’s own!
Edit: I forgot to mention here that I had increased the DNS cache size to fit the configured adlist. However it was only increased to 20MB or so, so the conclusions don’t change.
Earlier I had temporarily disabled the DNS adlist feature, thinking that might be causing the issue. However, this hadn’t made any difference, so I did some more research. During my research I turned up the following threads:
- DNS cache and memory usage, without adlist
- Log flooded with cache full, not storting (dns)
- DNS Cahe Full/adlist read: max cache size reached
- cache full, not storing since 7.14
I think all these threads are related, but the first and second are the most similar to my issue - particularly the second, which actually describes the router rebooting due to an out of memory condition, which I haven’t seen mentioned in any other threads.
I saw several users report that converting static CNAMEs to A records resolved issues with the DNS cache being filled, so I converted all the CNAME records on my router to A records. Upon making this change the router’s memory usage immediately stopped increasing. However, memory usage didn’t drop until I rebooted the router a few days later; since then it has been stable at about 15% usage.
Returning to the increased memory consumption rates I mentioned earlier. I compared deployed changes with the inflection points, and I indeed found that the coinciding changes had created one or more new CNAMEs.
Edit2: Note about RouterOS version(s): when I first noticed the leak the router was running v7.14.3, and it was running v7.17.2 when I managed to resolve the issue. I have since upgraded to v7.18.2 - I have not tried CNAMEs since upgrading, but the changelog doesn’t indicate any relevant fixes to the DNS resolver, so I assume the problem persists.
To me this seems like fairly strong evidence that RouterOS is mishandling static CNAME records, but I have one more spanner to throw in the works: I have a RB4011 running at another site which, despite having several static CNAMEs, does not exhibit a memory leak. However, I know am aware the CNAMEs configured in the RB4011 are for uncommonly queried names (in fact those records may not be being queried at all), whereas the CNAMEs set in the RB5009 were queried extremely often.