Community discussions

MikroTik App
 
kempoguy
just joined
Topic Author
Posts: 11
Joined: Wed Jul 27, 2016 6:05 am

Network Notifications

Mon Aug 29, 2016 12:09 am

Hi,

On our Dude install, we have the polling set for a Probe Interval of 1 hour, Probe Timeout set for 5 minutes and Probe Down Count set for 4 retries.

Our network is geographically large and made up of nearly every kind of comms technology including X25, satellite, fibre ethernet, wireless, microwave - you name it, we're probably using it somewhere.
Some of our sites are services by satellite only, some have multiple paths.

Every second day or so, we receive alerts from the Dude notifying us that some services on some devices across our network are down. The devices in question are never the same and the services vary with each event.

When we log onto the Dude to check the status, we click Reprobe on each device and they come back up as if nothing had happened.
We know there are no major issues, as our systems are still operating. for instance, I may receive a notification that the ping service has gone down on one of our servers, but our other software is still operating, so the PC is still communicating.

With the configuration we have in the Dude, I believe a service has to be down for over an hour before a notification is sent. If any of our devices are down for more than about 10 minutes, our operations staff are screaming, so the notifications are either false, or spurious network errors that are getting trapped somehow.


So, my questions:
1. Does anyone know what causes this? Is this a "known issue"
2. Can we prevent it? It's getting a little bit "boy who cried wolf" - our techs will stop paying attention to the errors and the system will become useless


Cheers
 
lebowski
Forum Guru
Forum Guru
Posts: 1619
Joined: Wed Aug 27, 2008 5:17 pm

Re: Network Notifications

Tue Sep 06, 2016 4:43 pm

There is a negative cache time that affects probes you can find some examples on stuff I built by searching. The setup you have sounds weird if you have 1 probe per hour and 4 retries you have the dude configured to take 4 hours to show an outage but you wait 5 minutes for response. Even high latency networks should have under a few seconds response time. BUT the issue with negative cache is another problem. The dude remembers for 5 minutes when a response is not received and that should not be that way. Manually specifying a negative cache time will take a bit of work but you should also put your timeout to 20 or 30 seconds with 5 minute probe intervals and 3 or 4 tries, that would still take 15 or 20 minutes to show a down device. I run 30 second probe interval, 3 retires, a 10 second timeout, and a 5 second negative cache time. This way a failed probe has a chance to actually retry. I hope they set negative cache to some low value in a future version. I am running v4b3.

Who is online

Users browsing this forum: No registered users and 10 guests