I have a radio that drops a few times in a day it seems.
I don’t register a down radio for some reason in the log. Probes are set to interval 5min and timeout 1min. The probe is watching for ping. probe down count is 5.
If the probe doesn’t fail it doesn’t write to the log and I don’t know the radio is dropping unless I happen to be watching.
Is this because it will only log if the probe can’t ping 5 times (probe down count)? In other words every five minutes the probe will try and if it can’t ping it will keep trying five more time before writing to the log? In other words I might not know for 25 min in this scenario.
Thanks in advance.
Edit: The log I was referring to is in winbox. I assume dude probes and winbox logs are two different things. I might need some real straightening out.
Yes that is the problem, 5x5=25 so it takes a very long time to have an outage.
I had a router rebooting and I was using 30 seconds polling and 3 failures to send outage notifications… The router would reboot before I would get a notification so now I am using 20 second polling. It turns out the power supply was bad in the router.
Note: It takes a lot of fine tuning of probes to eliminate false positives with such frequent polling.
Every configuration is different so what is best for you and best for me will be different.
As I said I am currently running 20 second polling intervals with 3 in Probe Down Count. That would take 1 minute to detect and report a failure. You will notice that after the first failure the device turns yellow and after the 3rd failure the device will turn red.
In your case 5 minute polling with 5 failed reads. It will take a maximum of 5 minutes to have 1 failure. So a device could be down for 5 minutes before it turns yellow, 25 minutes later it would turn red (after it failed the 5th read).
I would recommend a maximum of 1 minute polling and 3 failed reads (3 minutes to show red). I have not tested enough to recommend a minimum. As I said right now I am at 20 and that seems to work just fine. My server rx and tx hovers around 100kbps and peak around 600kbps with drops as low as 3kbsp. Although I doubt I will ever reduce Probe Interval below 15 seconds.
There is more you can fine tune, Inside the oid function(and all oid functions work this way) you can specify the cache time and negative cache time. The negative cache time is 300 seconds by default. Therefore if you have an outage the device will not come back up for 5 minutes. Since I do not want devices to appear down longer than they need to I specify a negative cache time of 29 seconds but doing this means modifying every probe. Note when you leave the default Negative cache time of 300 seconds no amount of reprobing will cause the device to appear up, the negative cache time has to expire for the probe to re-activate. You can find other posts I have made about negative cache time including in the wiki.