Community discussions

MikroTik App
 
lansend
just joined
Topic Author
Posts: 12
Joined: Mon May 23, 2016 7:59 pm

False Alerts

Tue Aug 08, 2017 9:10 pm

Please accept my apologies if this has been asked and answered in the past.

We have a WIFI Mesh made up of 11 single and 3 dual radio Mikrotik devices.Most of the access points single and dual radio are running OpenWrt backfire r31348. Two dual radios are inexplicably running Mikrotik RouterOS 6.7. Most of the devices are alix2d2
http://www.pcengines.ch/alix2d2.htm

We use the Dude to monitor the mesh.
When you do a discover in services, typically you get, cpu, disk, dns, memory, ping, router, virtual memory.
We have set the polling to probe every 5 minutes, 15 seconds’ timeout and to consider a down after 3 down count.

1. To the best of our understanding the above should give us an alert if a device cannot be reached after 15 minutes and 3 attempts to connect, is that correct?

2. With all the default probes listed above one device or the other constantly reports one or two probes are down while the rest remains up for example the alert will be Service disk & Service CPU are down, while the rest are up, in a couple of minutes they will all be up again. At times, we have been connected to devices behind the radio and did not lose connection to the device.
We experimented with some of the percentage probes provided by Winkelman in the Probe thread. This confused us even more. The default disk probe will be down but the probe “Warning when disk usage goes over 89%” remains up.
Past experience with other tools, has shown that just ping is not sufficient, the device may not be functioning but still respond to a ping.
All we want to know is when the radio is actually down. At this point we just keep increasing Probe Interval, Probe Timeout and Probe down count till the devices stops hemorrhaging alerts. Our logic is that it is better to get a real alert even after 1 hour, rather than being inundated with false (?) alerts.

Any and all advice will be greatly appreciated.
 
lansend
just joined
Topic Author
Posts: 12
Joined: Mon May 23, 2016 7:59 pm

Re: False Alerts

Tue Aug 08, 2017 10:25 pm

Here is another example. The memory has been down for about 10 minutes now. I have re probed multiple times to no avail, In all probabilities it will resolve it self tomorrow. If not a reboot will probably clear the error
http://help.lansend.com/probedown.jpg
There is an NVR behind the radio, I can get to the NVR manage and monitor the NVR & cameras.
IS something really down ? Do I need to take any action ?
 
lebowski
Forum Guru
Forum Guru
Posts: 1619
Joined: Wed Aug 27, 2008 5:17 pm

Re: False Alerts

Tue Sep 05, 2017 5:45 pm

Inside the OID function is a built in 300 second negative cache timer. Meaning if a failed probe is stored it will stay failed for 5 minutes, this can cause probes to never respond to re-probes. I don't remember if I completely tested this but if the negative cache timer is not cleared just before the re-probe interval the probe will stay down even though the system is not down. I have changed everything I do into a function and a probe so that I can specify the negative cache timer. I have not had false positives since I manually specified the timer.

The ,10,5 after the oid are the cache time and negative cache time... in the following Cisco CPU function (note the +1 is due to the cpu oid can return 0 which is down)
if(string_size(oid("1.3.6.1.4.1.9.2.1.57.0", 10 ,5)), oid("1.3.6.1.4.1.9.2.1.57.0", 10, 5)+1 ,"False")
 
lansend
just joined
Topic Author
Posts: 12
Joined: Mon May 23, 2016 7:59 pm

Re: False Alerts

Tue Sep 05, 2017 11:35 pm

Lebowski , thank you for your reply. While your explanation is clear , the code change is way beyond us. Can some such be incorporated into the basic Probe ? e.g: if(cpu_usage_available(), "", "down")
 
lebowski
Forum Guru
Forum Guru
Posts: 1619
Joined: Wed Aug 27, 2008 5:17 pm

Re: False Alerts

Wed Sep 06, 2017 7:05 pm

Yes I modified the built in Functions, most of them...

cpu_usage_available
array_size(oid_column("iso.org.dod.internet.mgmt.mib-2.host.hrDevice.hrProcessorTable.hrProcessorEntry.hrProcessorLoad", 10, 5))

cpu_usage
average(oid_column("iso.org.dod.internet.mgmt.mib-2.host.hrDevice.hrProcessorTable.hrProcessorEntry.hrProcessorLoad", 10, 5))

It was a pain in the ass but I don't get many false positives.

Lebowski

Who is online

Users browsing this forum: No registered users and 13 guests