Hello,
I am having a problem with our Dude system, which is quite self explanatory in the screenshots below! From time to time some servers start showing services like CPU Usage and Memory Size as being down. Well, I can see that The Dude is unable to probe those services so that obviously is “down”. But I cannot figure out why this suddenly happens and normally it is for a number of servers at the same time, like what you can see below. SNMP is configured for every server in the same way.


I searched for this problem but no one else seems to have had problems like this as I couldn’t find anything! I’m kind of a newbie to The Dude, so your help is much appreciated!
Please let me know if you need any other information.
This is an old problem and there are a lot of variations of it so you might have not found others with it but it is definitely there.
I believe it appeared in v3.5 or so. You can find people complaining that link tags show rx / tx instead of a value. There are other issues as well like yours “many false positives” and “gaps in graphs”. The only somewhat workaround I have found is to set the Negative cache time lower so that false positives don’t last 5 minutes. They have changed things over time but 4.3 gives the most false positives. One version had a serious bug that they fixed quickly. No one has been able to describe the exact reason for this.
You can find people saying “no mater how many times I re-probe the service will not come up even though I know for a fact that it is up”. That is because of a 5 minute negative cache time. The negative cache time is a good idea but it should be lower than the re-probe time or they should throw away a failed read instead of cache it.
You will find a lot of weirdness with ping times. My servers are all under 1ms but often ping is anywhere between ~15 to ~80ms. There are interesting artifacts with this since if you re-probe the obviously incorrect response time will be reset to a reasonable value until sometime later.
There are people with a RB that can’t backup their Dude via export.
This is pure speculation but I imagine there is a function that handles the large loop of executing services and inside it is a function that can send and receive data to some IO device like disk or network. That IO device might be executed serially so each IO is waiting on the previous IO to complete. But that is only speculation. I do wish someone would take a very serious look at how things are being executed and what changed back between v3.0 and v4.
Here is a DNS probe that was running about 80ms and clicking re-probe fixed it… for now.

So re-probe somehow bumps the current service to the front of the list.
If you attempt an export of your database you will find that many probes will timeout while your export is happening and you will have many more false positives from then on until you restart the service. So no mater what the speculation is there is a bug that needs to be fixed.
At the end of the day for me the dude is not able to tell me what is going on with enough accuracy to use it for alerts and I only use it for the map and bandwidth graphs. I provide some support in the forums to help people get it running and enjoy a really neat piece of software. If they ever fix the issue you describe it will be able to do alerts again and would be the best network monitoring program out there.
TL;dr There is a bug that needs fixed.
Lebowski
Some ways to get fewer false positives is to reduce the amount of logging. Stop using it for a syslog and only send notifications, don’t log outages in syslog. Run the service in real time priority if your using windows.