Please accept my apologies if this has been asked and answered in the past.
We have a WIFI Mesh made up of 11 single and 3 dual radio Mikrotik devices.Most of the access points single and dual radio are running OpenWrt backfire r31348. Two dual radios are inexplicably running Mikrotik RouterOS 6.7. Most of the devices are alix2d2
http://www.pcengines.ch/alix2d2.htm
We use the Dude to monitor the mesh.
When you do a discover in services, typically you get, cpu, disk, dns, memory, ping, router, virtual memory.
We have set the polling to probe every 5 minutes, 15 seconds’ timeout and to consider a down after 3 down count.
1. To the best of our understanding the above should give us an alert if a device cannot be reached after 15 minutes and 3 attempts to connect, is that correct?
2. With all the default probes listed above one device or the other constantly reports one or two probes are down while the rest remains up for example the alert will be Service disk & Service CPU are down, while the rest are up, in a couple of minutes they will all be up again. At times, we have been connected to devices behind the radio and did not lose connection to the device.
We experimented with some of the percentage probes provided by Winkelman in the Probe thread. This confused us even more. The default disk probe will be down but the probe “Warning when disk usage goes over 89%” remains up.
Past experience with other tools, has shown that just ping is not sufficient, the device may not be functioning but still respond to a ping.
All we want to know is when the radio is actually down. At this point we just keep increasing Probe Interval, Probe Timeout and Probe down count till the devices stops hemorrhaging alerts. Our logic is that it is better to get a real alert even after 1 hour, rather than being inundated with false (?) alerts.
Any and all advice will be greatly appreciated.