There is part which is our architecture / configuration, but the more deep & worrying issue is that in some cases, one has almost zero visibility into when the architecture or configuration is causing a particular issue. Had we these tools at our disposal (or was the wiki more useful in some areas), the noise on this forum caused by "architecture" or "configuration" would die considerably.
Let's take DNS, as our next "mistery headache" example - I'm going to shut up about all this on this thread and open a more generic one on "Lack of tools and visibility into performance issues", but it serves as a useful example:
Screen Shot 2020-08-17 at 07.46.46.png
Notice anything odd? Not really, right? Tool->Profile didn't either, with DNS sitting at less than 1% CPU usage. However, we had hundreds of customer tickets related to "some sites don't load" and "DNS doesn't resolve" which in turn caused poor service. We've been trying to debug where the issue was for quite some time. We have run packet traces that only show DNS responses sometimes don't arrive at all, sometimes they come with SERVFAIL, but nothing that signalled performance issues on the CCRs or the network topology. This is our architecture (in terms of configuration, we tried all sorts of changes on the DNS settings of all the boxes involved):
dns.png
The arrows show who the resolver is upstream for each device. The customer hAPs each also serve as caching DNS servers for the devices connected to them by the customers. Technically, this is quite an efficient setup, as the central CCR will only need to query once (within TTL limits of course) and cache the result for all the network-level CCRs, and the network-level CCRs do the same for the up to 1500 CPEs connected to each one.
The final proof, fed up with coming up with nothing, was to setup a caching bind9 server on a DigitalOcean droplet as a test. We noticed an immediate 15-20% uptick of traffic on two network-level CCRs which had this DNS server setup as their upstream source.
We then setup a bind9 server next to the 1072 at the DC, and pointed all 20 network-level CCRs to it, noting a jump in aggregate traffic from 8.3Gbps peak time to just over 9.5Gpbs. Customers now report they are fine.
Back to my first graph,
there is zero positive impact on CPU from removing DNS duties off the main CCR. The change was made mid-day on Friday.
My
BIG question to you now is: can you predict, or failing that, observe directly, when the network-level CCRs will "give up" on DNS, just like they "give up" on PPPoE? Yes, we can then point customer hAPs directly to the bind9 server, but do we need to wait for customers to scream and trash us on social media to notice this? What will be the next feature that the CCRs will "give up" on?
We need far far greater visibility into issues before we can be blamed for certain "architecture" or "configuration" problems.
You do not have the required permissions to view the files attached to this post.