Negative rather than positive monitoring

I’m trying to use The Dude to do something kind of…odd, but I expect it can do it, I just can’t quite work it out.

I’m monitoring a bunch of DC-powered Windows PCs that are running surveillance and telemetry on school buses. These units are wifi-enabled. The problem with these units is that they are ignored by the people in charge until something happens. That’s great, and that’s what they’re for. However, because of this, if there’s a hardware failure nobody knows about it for months - and since they only check when there’s been an incident, it’s a very big deal when they find a dead machine.

The challenge with monitoring is that these units are offline most of the time. Either the bus is off or out of wifi range for pretty much the entire day. The way I see it, what we’re looking for will boil down to “Monday through Friday, if you don’t see this unit at SOME point between 5am and 7am I want you to send a notification”.

I could just have these things send an email when they get online, but that’ll create so many false negatives that the system administrator will miss the actual failures. All I need is a each system to successfully respond once a day during that time period for it to be considered good. If a system goes from 5am to 7am without a single successful response then consider it bad and send a notification.

I thought I had it figured out in TheDude, using a combination of timeouts and probe intervals…but nope.

Anyone have any suggestions?

I’m not sure about Dude (not familiar enough with it yet) but you should be able to do this with PRTG network monitor. And if you have fewer than 20 machines to monitor, it won’t cost anything either. (You get 10 sensors with the freeware version, but can easily get another 10 sensors for free if you link to their website somewhere.)

You can set up a custom schedule for monitoring (e.g. Ping for basic upness, or specific Windows services since PRTG does WMI if you wanted to make sure the software was working correctly as well as the hardware) between 5am and 7am on weekdays and then set a notification condition that only sends a notification if the sensor down states only pass a certain threshold - for example only if there are as many ‘down’ states as there are sensor readings during that time interval.

There are multiple ways of setting notification conditions with PRTG - see here.

Do these buses all end up at the same place at the end of the day where they could be polled while they are in the parking lot or wherever? If so why not setup a pinging schedule that only runs during those hours and then an email alert on failed polls? This would give you a daily inventory.

I thought about this a little more. Couldn’t you set your ping to retry 60 times and if your retry interval was 3600000ms (which is 1 hour) than your ping would ping once an hour for 24 hours and it would not be considered dead unless it failed them all. Am I thinking about this right?

I think sending email is not a bad idea for a number of reasons:

  1. if unit is capable of sending email it means that the OS is working
  2. if email is sent every time systems is on you’ll get a history of it’s usage
  3. enable smtp auth and assign each unit unique user login. Than have a script parse smtp logs (server side) and if user (the unit) did not login in past week - generate a report.

(just occurred to me that you could program each unit to login to POP3 box and have someone monitor users last login with similar result)

In similar fashion, instead of email you could also use number of monitoring tools that have client side agents, like zabbix or HostMonitor.

RM