This is a bug, there are many things that can be discussed from this but it would be so much simpler if they would just open the code and fix some bugs... My solution was to disable link labels on the "dashboard" map so I don't see them as often.
(continued reading)
It seems to me that any response that is slightly slow is more sensitive to false positives. Most likely the dude does not wait correctly for packets to return.
I could never solve link labels going haywire. I improved false positives dramatically by doing two things. Configuring every probe to have a 5 second negative cache value and setting the polling interval to 1 minute. I know you are not calling these false positives but here is why I am calling them that.
With wireshark I have watched a specific request for an snmp counter get returned but the packet somehow did not get processed. So I saw the request and I saw the response and then the probe showed 1 failure. Now with the default 300 second negative cache time the dude will not correctly retry any re-probes. This is a very specific bug. The negative cache time should not be set until the configured retires are expired but the negative cache time is set in the first failed probe and since sometimes a response is lost (doesn't seem to get processed). A device that is certainly on the network will show as down and no mater how many times you reprobe it while the negative cache is in effect it will not come back up BUT you are logged into the device so you know it is not down. Reducing the negative cache time lower than the reprobe interval is how to solve this. You have to do this to every single function since negative cache time is not a variable also I cant think of any reason for negative cache time, if a device is down you will want to know the exact time when it came back up not 300 seconds after a failed probe.
So now lets apply this to links which are probably based on label refresh interval. Do they also have a global negative cache time? (I don't know if it can be modified.) A failed poll of an interface causes the label to show that RX[interface.in.bitrate] label. The next successful poll shows a 0 on the label since the previous value is no longer in record, the dude can't do any math to determine how much the link has been utilized. Interface utilization is calculated in octets so you have to have a value to start with then diff the value based on the time since last successful read. I would bet that the probe of the interface was successful and the packet was never processed which is basically a false positive.
I listed the effects of your attempts to resolve it.
1) increased polling intervals to 5 min - all outages are now 5 min and false positives are still 5 minutes, doesn't affect labels.
2) increased polling Probe Timeout to 1 min - the packet probably made it back to the dude so the probe didn't actually need a longer time.
3) increased Label Refresh to 30 sec - Decreased the number of failed reads by reducing total reads. (labels might be correct more often)
4) increased Snmp (v1) Try Timeout to 10 sec (max); - snmp v1 or v2 should take well under 5 seconds to respond.
5) un-selected Resolve MAC Address Manufacturer; - shouldn't matter
6) increased Mac Mapping Refresh Interval to 1 hr; - mine is set to 24 hours maps mac addresses to switchports
7) increased the Database Commit Interval to 30 sec; - shouldn't matter
increased ROS Connection Timeout to 15 sec; - shouldn't matter
9) increased the ROS connection Interval to 1 min; - shouldn't matter
10) and finally, forced CPU core affinity to a single core for Dude. - was an early attempt to understand the issue - I am still forcing dude to a single core and running it in real time. This seems to cause just a few less false positives.
To see the dude fail to read some devices click on services, then sort by problem and watch the list, occasionally devices will show "not available" when that happens and the negative cache has not been modified those probes that show "not available" will become down since none of the retries ever get sent.
Also the HTTP request probe shows some signs of this troubled processing, many times one http probe will show down and clicking on retry will quite often show "connection closed". Watching this in a protocol analyzer even though the server responded in .009 seconds the dude shows connection closed before the three way handshake has even finished.
Since they are not going to make it open source and they don't have anyone developing it maybe they could "hire" someone,have them sign a NDA, and let them work on the code for free... (me). That way the code will get some much needed attention.
the wall of text,
Lebowski