Hi,
On our Dude install, we have the polling set for a Probe Interval of 1 hour, Probe Timeout set for 5 minutes and Probe Down Count set for 4 retries.
Our network is geographically large and made up of nearly every kind of comms technology including X25, satellite, fibre ethernet, wireless, microwave - you name it, we're probably using it somewhere.
Some of our sites are services by satellite only, some have multiple paths.
Every second day or so, we receive alerts from the Dude notifying us that some services on some devices across our network are down. The devices in question are never the same and the services vary with each event.
When we log onto the Dude to check the status, we click Reprobe on each device and they come back up as if nothing had happened.
We know there are no major issues, as our systems are still operating. for instance, I may receive a notification that the ping service has gone down on one of our servers, but our other software is still operating, so the PC is still communicating.
With the configuration we have in the Dude, I believe a service has to be down for over an hour before a notification is sent. If any of our devices are down for more than about 10 minutes, our operations staff are screaming, so the notifications are either false, or spurious network errors that are getting trapped somehow.
So, my questions:
1. Does anyone know what causes this? Is this a "known issue"
2. Can we prevent it? It's getting a little bit "boy who cried wolf" - our techs will stop paying attention to the errors and the system will become useless
Cheers