Hi all,
This is a somewhat philosophical post - it's not about the actual technical details, which I will use as an example, but about the underlying issues that we are faced with when gauging the capability of Mikrotik devices to perform certain functionality.
The issue at hand is that there are instances when Mikrotik devices stop functioning or experience degraded performance, which manifests itself elsewhere (customer complaints, to start with), and are not easily caught via monitoring of the devices in question. There is a lack of in-depth monitoring and/or logging that makes it extremely hard to debug these issues, and one ends up going around in circles with ever-increasing amounts of gray hair.
First example is the recent problems we've been having with PPPoE sessions getting dropped - I won't reproduce here, but two threads can serve as reference:
viewtopic.php?p=811016
viewtopic.php?p=811132
And one blog post that references the issue head-on, and provides an architectural solution:
https://aacable.wordpress.com/2018/03/2 ... -mikrotik/
All references to PPPoE capabilities I can find are vague, going from "we are doing 200 sessions on box A" to "we are doing 5000 sessions on box B", with no clear indication of configuration, topology, or supporting infrastructure.
The second example is DNS. Our architecture in terms of DNS is "waterfall-like", where downstream devices use upstream devices as their respective resolvers:
We recently started experiencing increased volumes of customer complaints, pointing to DNS issues. Tests across the network, including packet traces on all devices in the chain, showed some requests would not get answered or timed out, others would get SERVFAIL. However, all CPU statistics on the CCRs showed sub-1% CPU utilization (using Profile). We tried all kinds of settings, from increasing timeouts all the way to stupid values like 15 seconds, to increasing number of concurrent requests to 30.000, etc. Nothing worked.
Eventually, we moved the network-level CCRs (the colored ones in my drawing) to use a bind9 server we setup at the DC, and removed DNS duties from the main CCR1072. Here is the CPU graph of the main CCR1072 last week, when the change was implemented. Can anyone guess what day it was done?
It was mid-day on the Friday. There was NO impact on CPU usage in graphing, and no impact on Profile tool either. It's as if DNS duties are not captured by Profile. We simply had no way to see the 1072 was choking. The impact for customers was noticeable, we jumped from daily traffic peaks of 8.3Gbps to over 9.5Gbps.
I have just implemented a chart to show the DNS statistics on the bind9 server, and it's showing some 300 queries per second at peak times. Quite how the 1072 was not able to handle this, or where the failure point was, is a mistery:
I would plead to Mikrotik for implementation of low-level debugging tools that allowed us to spot these kinds of trends and patterns, I'm sure there is a place in RouterOS that measures DNS queries, failures, etc. - just like we have a way to see RADIUS statistics (even though the WebFig UI has been broken for months...).
If there are hardware resource issues that affect critical functionality, there should be a way to trace them without the slowness of Profile (which doesn't allow you to see time-series changes with fine resolution, for example...). We just have no way to know when the next architectural or configuration problem will be spotted by our customers first, for the only reason of not having the tools to be pre-emptive.
Here is a post from 2014, asking specifically for more details on Profile tool:
viewtopic.php?t=90315
and the reply was "sorry, you need to watch really close and see if you can spot something". To me, this is the analogy - make sure you try and spot "disturbances in the Matrix":
(That one said with a dose of humor, I'm a big Mikrotik fan, but get frustrated by some of their responses to customer problems and requests...)