More issues with dependencies and delayed email notification

Hello all,

I posted about this issue awhile ago and received some helpful advice regarding the notification delay option. Now after playing round with this for awhile, the delay feature does not appear to do what I need it to.

Here’s an example of a usual network:

Cable Modem → Router → Switch → AP1 - AP24

My dependencies are set so that all the APs are children of the switch, the switch is a child of the router and the router is a child of the cable modem. With default settings, if the cable modem goes down, I receive down notifications for nearly all devices (and corresponding up notifications). I assume this is because The Dude is polling child services as different times than parent services, thus finding an AP down before it realizes the cable modem or router is down. This is a similar problem in other SNMP and network monitoring software suites and the solution on in particular (Intermapper) implements (with much success) is a delay option.

In Intermapper, I set all child devices an email delay of 3 minutes. The probing interval is still 30 seconds or so. Commonly Intermapper will find a child device down, update the devices status in the Intermapper client software to down and then wait 3 minutes. If Intermapper finds that no parent devices are down within 3 minutes, it then sends the down message. If Intermapper finds a parent device is down inside the 3 minute window, but that parent device then comes back up within the 3 minute window, then NO emails are sent for the child device.

In contrast, the behavior I’m experiencing with The Dude is as such:

I have the delay on child devices set to 3 minutes. If the parent device goes down and The Dude finds a child device down first, it starts the 3 minute timer. The parent device is detected as down and then comes back up again, the last notification email for the child device is the “up” message. Even if at the end of the 3 minutes timer, both the child device and parent device are up, the “up” notification is still sent. Therefore, when an entire network goes down (with 40 access points), I get flooded with hundreds of service up notifications for devices that were never even considered down (and down messages were never sent for).

Is this intended functionality? It really makes the product unusable for us if the basic dependency features don’t work well. It essentially makes email notification useless, since we’re sifting through hundreds of pointless emails when all we care about is that the cable modem went down. We’re only monitoring 6 of our 40 properties and I can’t imagine implementing the rest of them without finding a solution to this problem.

Unfortunately, this leaves us in a bind since we love The Dude and find that it really is unmatched in many areas. It makes it especially difficult for us because we’ve already settled on deploying Mikrotiks to all of our properties and love that we automatically get an onsite agent for monitoring. In fact, the only product that comes close to The Dudes mapping features is Intermapper, which is a solid offering, but would require port forwards (or an onsite hardware agent) for everything behind the router that we want to monitor.

Anybody have any thoughts on this issue? Anyone have this experience and come up with a solution? I’d love if The Dude’s developers could chime in as well.

Many thanks!

Unfortunately I thought that you would find some issue as we discussed the dude would need to “cleanup” the list of notifications but it must not… One suggestion for you is to turn off notifications on every device that is not critical and only notify from the switch and the cable modem. Do you ever need to fix an AP after hours?

I get a lot of false positives and don’t rely on notifications from the dude although I still use them. I don’t understand why I get so many and I have tried many things to resolve that. I have verified with wireshark that the packet actually gets back to the server so there is some bug in the code or there is a flaw in my server. Do you see false positives of at least a few an hour? Many times on my setup overnight a single device will generate a hundred or so false positives but no others. If I reprobe the device it comes back immediately due to modifying the Negative Cache time. Although I did install the dude and never looked back meaning I never started from scratch a 2nd time to see if I just broke something… I hardly understood the product but I have ran it for a few years without issue. Also we run NMIS on a Linux server which has all the same devices and I can guarantee that those devices generating false positives are not down.

Anyhow I hope they find the trouble or when I move to a new server it just goes away. Just fyi… There was a definite change somewhere between 3.x and 4.x where labels on links would MORE OFTEN show Rx:[Interface.InBitRate] and Tx:[Interface.OutBitRate] instead of the actual bit rate and there was some posts complaining about it but I never saw that it was actually looked at. I think that this change is why false positives occur more often.

I have asked support to look at a couple outstanding issues that seem to be the worst offenders. I don’t know if you will get any traction for this issue but you can email support.

It sounds like InterMapper was nice can you tell me why you quit using it?

Thanks,
Lebowski