Hi all,
We're currently peaking at around 8.5Gbps in our network, all going through a single CCR1072. We use this CCR as a DNS cache, which feeds our downstream networks, each served by a CCR (1016 to 1072 depending on customer volumes). Basically, the customer CPE has the network-level CCR as the DNS server, and the network-level CCR has the main 1072 as its own DNS server. The main 1072 uses a local DNS resolver which has no issues whatsoever.
A few weeks ago, noise started bubbling up from the customer base as to "DNS issues", where certain sites or services would not be accessible, or half-work. After some lengthy testing and digging, I've come to conclude the main 1072 is the bottleneck, yet I have no way to really measure what's going on inside it.
Setting up a bind9 server on a DO droplet (160ms away from our network, so slow), and setting this server up as the resolver for a couple of network-level CCRs has made issues go away. We're getting some 40-60 queries per second, which are cached by the network-level CCRs (aswell as bind doing its own share of caching within TTL limits). We've also seen an uptick in traffic of 15-20% in both networks.
We already had issues with ~1200 PPPoE sessions misteriously dropping with some really strange behaviors on the CCR1072 in one of the networks, and Mikrotik's reply was literally "the CCR has insufficient resources for your particular setup", yet we keep reading about people doing 3000+ PPPoE sessions on smaller boxes.
I'm really hoping we can get some clarity on DNS performance, how to measure what the CCRs resolvers are doing, or wether we should completely walk away from the CCR doing any form of DNS functionality.
We have already tried (in case you're wondering):
- Increasing cache size (usage never hits the limit)
- Increasing timeout values (clients timeout even if 10-15 seconds are set)
- Increasing number of concurrent queries (30.000 on the main CCR, is there an upper limit? How does this get handled?)
- Changing the maximum TTL value