Netwatch triggers

I use Netwatch with an UP and DOWN script that emails me notifications when a new UP or DOWN condition occurs.

The non-default netwatch settings include:

packet-count=270
packet-interval=1s
thr-avg=400ms
thr-loss-percent=95%

This is the script entry:

/tool netwatch
add comment=Netwatch-192.168.0.11 disabled=no down-script=netwatch-5-2025 host=192.168.0.11 http-codes="" interval=5m name=Netwatch-192.168.0.11 packet-count=270 packet-interval=1s test-script="" thr-avg=400ms thr-loss-percent=95% timeout=4s type=icmp up-script=netwatch-5-2025

I am occassionally getting emails such as these:

2025-11-15 17:58:18 Netwatch-192.168.0.11 629hAPac3 down to 192.168.0.11 with rtt-avg of 76 and loss-percent of 1%. The thresholds are thr-avg of 400ms and loss-percent of 95%.

rtt-avg and loss-percent were clearly not the triggers of the DOWN condition.

So I added a logging topic for netwatch:

/system logging
add topics=netwatch

And waited until the next DOWN email arrived. Checked the log and this is what I found:

[FAIL] rtt-max: 3757.254 ms [ > 1000.000 ms ]
[FAIL] rtt-jitter: 3736.313 ms [ > 1000.000 ms ]
[ OK ] rtt-avg: 78.987 ms [ <= 400.000 ms ]
[FAIL] rtt-stdev: 303.947 ms [ > 250.000 ms ]
[ OK ] loss count: 3 [ <= 4294967295 ]
[ OK ] loss: 1.1% [ <= 95.0% ]

Seems that the rtt-max, rtt-jitter, and rtt-stdev occassionally are above the default values.

What would you do?

Explicitly set those parameters to higher levels, or leave as is and accept what, for my purposes, are false-positives (at least insofar as actionable (i.e., worryable) events)? Or something else?

Explicitly set those parameters to higher levels.
Coincidentally the failed tests are the ones that you haven't explicitly set, which I see as a sign that the implied, default thresholds are meant for a better connection than you have.

Or - if you prefer - these thresholds need to be tailored - just like the others - to the characteristics of your connection and your preferences in the sensitivity of the Netwatch.

Yes indeed – the exceeded parameters were all left at default values.

I don’t know that I’d go far as to conclude that those default values are meant for a better connection than I have. It might be a stretch to infer that there was some extensive study done both on availability and on expectations that led to the choice of those default values.

The underlying main concept, I believe, is one that we have discussed in the context of Netwatch parameters multiple times, namely, what do we want to achieve with the use of Netwatch. I think for some of us, we want a much more sensitive Netwatch, so that it detects disruptions or degradations far more slight than for other people or environments.

I use mine for much less sensitive disruptions, and would be happy (in some circumstances) for a set of Netwatch parameters that would detect a total outage or substantial degradation that persists across several minutes (before getting an alert). This conclusion is a product of the non-urgent nature of the Internet access at my locations and the desire to have minimal non-actionable notices.

I might just raise those failed thresholds.

We don't actually know (at least I don't) exactly what those failed tests imply.

Common sense tells me that :
[FAIL] rtt-stdev: 303.947 ms [ > 250.000 ms ]
can be raised to - say - 350 and you won't lose much, stdev is one of those things that engineers and statisticians love so much but that are - IMHO - not a valid parameter to trigger this kind of notifications.

The other two:
[FAIL] rtt-max: 3757.254 ms [ > 1000.000 ms ]

[FAIL] rtt-jitter: 3736.313 ms [ > 1000.000 ms ]

the way I read them should mean:

  1. that you had a single ping that took 3757.254 (the 254 thousandths of a millisecond are IMHO ridiculous, I cannot find a better term to decribe them)
  2. that your fastest ping was 3763.313-3757.254=6,059 ms

If this is correct, the default settings are stupid, as jitter - by definiton - will always be smaller than rtt-max so a same value of 1000 already makes little sense.

And - at the end of the day - they both measure in different ways "exceptionally slow" replies to packets.

As I see it they are either meaningless metric or we don't know how to manage them correctly.

Personally I would rather have such "exceptionally slow" replies have not an own dignity but have only one of them contributing to thr-loss-percent, i.e. get rid completely of one of them, let's say keep only thr-max and when this is exceeded instead of autonomously trigger a fail, increase the number of failed packets.

This way an "exceptionally slow" reply will become to all effects a failed one, and thus be "filtered" by the percentage value of thr-loss-percent.

Trying to update my UP/DOWN script and stumbling.

Any idea why this code shows no values for rtt-max, rtt-jitter, and rtt-stdev:

:local netwatchID   ([find where host=$lhost and type=icmp]->0)
:local nthravg      [:tonum ([get $netwatchID thr-avg         ] * 1000)]
:local nthrlosspcnt         ([get $netwatchID thr-loss-percent] /   10)

:local nthrmax     [:tonum ([get $netwatchID thr-max         ] * 1000)] 
:local nthrjitter  [:tonum ([get $netwatchID thr-jitter         ] * 1000)]
:local nthrstdev   [:tonum ([get $netwatchID thr-stdev         ] * 1000)]

:log info $netwatchID
:log info $nthravg
:log info $nthrmax
:log info $nthrjitter
:log info $nthrstdev

All your searched values use the thr- prefix?
Then how do you expect to get rtt- values? :confused:

Sherwood:

I should have included the entire script:


/tool netwatch
:local thisBox      [/system identity get name]
:local lhost        [:toip $host]
:local lstatus             $status
:local llosspercent      ($"loss-percent" /   10)
:local lcomment $comment

:local lrttavg           ($"rtt-avg"      / 1000)
:if (($"rtt-avg" % 1000) > 0) do={ :set lrttavg ($lrttavg + 1) }

:local lrttmax           ($"rtt-max"      / 1000)
:if (($"rtt-max" % 1000) > 0) do={ :set lrttmax ($lrttmax + 1) }

:local  lrttjitter        ($"rtt-jitter"  / 1000)
:if (($"rtt-jitter" % 1000) > 0) do={ :set lrttjitter ($lrttjitter + 1) }

:local lrttstdev          ($"rtt-stdev"   / 1000)
:if (($"rtt-stdev" % 1000) > 0) do={ :set lrttstdev ($lrttstdev + 1) }

:local lrttavg           ($"rtt-avg"      / 1000)
:if (($"rtt-avg" % 1000) > 0) do={ :set lrttavg ($lrttavg + 1) }


:local netwatchID   ([find where host=$lhost and type=icmp]->0)
:local nthravg      [:tonum ([get $netwatchID thr-avg         ] * 1000)]
:local nthrlosspcnt         ([get $netwatchID thr-loss-percent] /   10)

:local nthrmax     [:tonum ([get $netwatchID thr-max         ] * 1000)] 
:local nthrjitter  [:tonum ([get $netwatchID thr-jitter         ] * 1000)]
:local nthrstdev   [:tonum ([get $netwatchID thr-stdev         ] * 1000)]

:log info $netwatchID
:log info $nthravg
:log info $nthrmax
:log info $nthrjitter
:log info $nthrstdev

This is so confusing:

The current value is “rtt-max” and the threshold is “thr-max,” correct?

nthravg works but nthrmax does not.

Should be, yes.

Not that I really know what I am saying, but your script seems to be taking values, multiply/divide/modify them and then check the result, so maybe the issue is with the manipulation you do of the values.

Or, if you prefer, can you just :put the values without manipulating them and check whether the variable(s) actually exist and have values within the range you expect?

Thank you!

The only variables that are not working are:

nthrmax

nthrjitter

nthrstdev

I tried this and the variable is still empty:

:local test ([get $netwatchID thr-max])
:log info $test

You seem like specialized into posting snippets of code without context.

And instead of checking from the base you wrap things into other things that you wrap into another layer.

Again, not that I undertstand much of the script syntax, but maybe the problems happens before, in:
:local netwatchID ([find where host=$lhost and type=icmp]->0)

I would start not pinging every second as replies could not come beack in expected time over saturated connection. What idea is behind pinging every second?
If you have lots of such tests then router could swet doing checks, checks of checks.
If the infrastructure is so critical then you should install professional monitoring tool that you could configure in few steps.

Good question, but 1 s is actually 20x the default.

So a better question would be why (the heck) did the good Mikrotik guys set a default of a ping every 50 ms?

Your advice to start “from the base” was spot on.

I checked if netwatchID was getting populated and it was, but with the wrong host.

I went back to $lhost and discovered it was not correct.

It turns out that the host identified in the following snippet is the incorrect $host:

/tool netwatch
:local thisBox      [/system identity get name]
:local lhost        [:toip $host]

I have 2 netwatch entries with the same host (192.168.0.11) – just for testing this.

“[:toip $host] was identifying the wrong one, which did not have non-default settings for thr-max, thr-jitter, and thr-stdev

So, the lesson (I think) is that this approach does not work reliably with multiple netwatch entries to the same host.

Ugh!

Are you suggesting a longer interval, such as 2 or 3 or 4 seconds?

No, the lesson is that when you want to select one item among many similar ones you need to make the distinction on a SURELY UNIQUE field. Use comment as a selector or both comment and host or use different canary addresses for different netwatch instances.

I don’t know how to change the above into an lhost-populating command based on multiple conditions where $host is the same across multiple netwatch entries.

I was suggesting something loosely like:
:local netwatchID ([find where host=$lhost and type=icmp and comment=this]->0)

:local netwatchID ([find where host=$lhost and type=icmp and comment=that]->0)

no idea if it is possible, or if another syntax different from the simple "and" is needed for multiple matches.

What has been your first thought reading this? Do you suspect that I suggest pinging 10 or 20 times per second?
Do these resources are so valuable that you HAVE TO get pile of emails every few minutes?
If yes, then you have to have the professional tool for monitoring and logging all ups/downs etc.
Do you have scripts that monitor services going back from the darkness as if it could have been only a small chocking in connectivity. Do you immediately react to each alert you get?

I get it – to an extent.

The problem is that I use this UP/DOWN script in a generic way, across many Netwatch entries and across many MT devices, so hard coding the comment= won’t work.

No, connectivity being down for seconds, even minutes, will not jeopardize safety or money.

The setting of 1 per second was already intended to reflect that, as it is so many higher than than the default; and, in conjunction with 270 pings over 5 minutes, is intended to reflect the non-essential-services status also.

I did not consider the possibility of over saturation.

Perhaps one ping every 4 seconds, with 65 pings sent every 5 minutes to determine a DOWN condition is more appropriate for me.