Netwatch UP threshold

Adding “my_” to the name of an ROS parameter/value/command is a great suggestion – thank you.

I “lifted” the script that I used as a basis for what I posted earlier from someone else’s work, where he capitalized the first letter to get around the problem of using the same variable name. I definetly like “my_” better – and, thanks to you, now I can have the confidence to implement it.

Ugh! Everything takes so much time, troubleshooting, effort, frustration.

I’ve got a whole big (for me) script, and continue to get an error.

Troubleshooting by commenting out lines, this line is the culprit:

:local my_host $host

With this being the only non-commented line, the script fails:

executing script Netwatch-details from netwatch failed, please check it manually

In the case of variables names, you cannot use underscore without quotes AFAIK. So :local “my_variable” not :local my_variable.

Thanks, that fixed that line.

But the rest didn’t work.

I tried various combinations of quotes.

I removed the underscore.

Then I discovered that hyphens aren’t liked either.

There are too many unintuitive rules with this scripting!

This works:

:local myhost $host
/tool netwatch
:local mystatus [get [find where host=$myhost] status]
:local mycomment [get [find where host=$myhost] comment]
:local myinterval [get [find where host=$myhost] interval]
:local mysince [get [find where host=$myhost] since]
:local mytimeout [get [find where host=$myhost] timeout]
:local mypacketinterval [get [find where host=$myhost] packet-interval]
:local mypacketcount [get [find where host=$myhost] packet-count]
:local mypacketsize [get [find where host=$myhost] packet-size]
:local mythrmax [get [find where host=$myhost] thr-max]
:local mythravg [get [find where host=$myhost] thr-avg]
:local mythrlosspercent [get [find where host=$myhost] thr-loss-percent]
:local mythrlosscount [get [find where host=$myhost] thr-loss-count]

:log info "host=$myhost status=$mystatus comment=$mycomment interval=$myinterval timeout=$mytimeout packet-interval=$mypacketinterval packet-count=$mypacketcount thr-avg=$mythravg thr-loss-percent=$mythrlosspercent"

Well that is a different approach, I guess it avoid the needed permissions for netwatch. The only esoteric issue with using script/scheduler outside of the “On Down”/“On Up” netwatch scripts… be if the netwatch polling could happen while that is running, if that happened… the values your scripts could be mixed from last test and current one — although that be pretty unlikely.


In fairness, the docs do say: “Valid characters in variable names are letters and digits. If the variable name contains any other character, then the variable name should be put in double quotes.”
And docs do show to use a quoted name you use a $“my variable” — but I would never have suggested using _ or - … the syntax gets even trickier if you.

Certainly a lot of rules … but I think whomever suggested using _ is more to blame. :wink:

Well, I can well take the blame, no problem, for omitting the need for double quotes if non-letter and non-number characters are used for variable names.

BTW coming from the (good?) ol’ times where variables could only be called A, B, C, D, etc., the RoS limitations (while completely absurd in these days and times) still don’t seem too bad to me.

Why, in my day …

LUXURY!

… kids today!

https://tinyapps.org/blog/200702250700_why_in_my_day.html

:laughing:

It gets even better:

Here I am thinking I will capture the actual values of these variables and write them to the log instead of turning on logging for topic netwatch and having the log fill up it, when, in reality, the only thing being logged is the netwatch settings (i.e., not the netwatch results).

Or maybe it is working as envisioned – I don’t even know any more.

So, while this has been informative and I am one step (out of 1,000,000,000,000) closer to competency (thanks to you guys!), I still need a way to capture the netwatch icmp details for when a host goes down.

And regarding being an old-timers, I am indeed one also, only without the years of design/programming/etc. And, to invoke a cross over to the docs:

I was around (and active on Bitnet, the predecessors of the current Internet; and thought Usenet groups were the next stage in human evolution) when RTFM was invented!

Still playing with netwatch and trying these settings:

/tool netwatch
add comment=Netwatch-192.168.0.11 disabled=no down-script=Netwatch-details host=192.168.0.11 http-codes="" interval=2m name=Netwatch-192.168.0.11 packet-count=300 packet-interval=200ms \
    test-script="" thr-avg=400ms thr-loss-percent=95% type=icmp up-script=Netwatch-details

This means (if I’m understanding correctly) that every 2 minutes a netwatch process will start and include 1 minute of pings consisting of 300 packets sent 200ms apart. And, a fail would be an average rtt of 400ms (or greater) and a loss percentage of 95% (or greater) for the group of 300 packets. (I’m still not clear on the “and” part, but I think it’s a safe assumption – as opposed to “or”.)

On a status change (up or down), this is the script that is run:

:local myhost $host
/tool netwatch
:local mystatus [get [find where host=$myhost] status]
:local mycomment [get [find where host=$myhost] comment]
:local myinterval [get [find where host=$myhost] interval]
:local mysince [get [find where host=$myhost] since]
:local mytimeout [get [find where host=$myhost] timeout]
:local mypacketinterval [get [find where host=$myhost] packet-interval]
:local mypacketcount [get [find where host=$myhost] packet-count]
:local mypacketsize [get [find where host=$myhost] packet-size]
:local mythrmax [get [find where host=$myhost] thr-max]
:local mythravg [get [find where host=$myhost] thr-avg]
:local mythrlosspercent [get [find where host=$myhost] thr-loss-percent]
:local mythrlosscount [get [find where host=$myhost] thr-loss-count]
:local myrttavg [get [find where host=$myhost] rtt-avg]

:log info "NETWATCH host=$myhost status=$mystatus comment=$mycomment interval=$myinterval timeout=$mytimeout packet-interval=$mypacketinterval packet-count=$mypacketcount thr-avg=$mythravg thr-loss-percent=$mythrlosspercent rttavg=$myrttavg"

Disabling and enabling this netwatch entry created the following log entry:

NETWATCH host=192.168.0.11 status=up comment=Netwatch-192.168.0.11 interval=00:02:00 timeout= packet-interval=00:00:00.200 packet-count=300 thr-avg=00:00:00.400 thr-loss-percent=950 rttavg=00:00:00.021539

It looks like all the properties that start with “thr” are threshold settings.

rtt-avg is a result or value derived by executing the process.

I looked at the docs (yes, really) and I don’t see the results or stats for loss-percentage.

The way I read the docs, it is a “or”, i.e. there are 6 different thresholds:
thr-max (Default: 1s) Fail threshold for round trip time-max (a value above thr-max is a probe fail)
thr-avg (Default: 100ms) Fail threshold for round trip time-avg
thr-stdev (Default: 250ms) Fail threshold for round trip time-stdev
thr-jitter (Default: 1s) Fail threshold for round trip time-jitter
thr-loss-percent (Default: 85.0%) Fail threshold for loss-percent
thr-loss-count (Default: 4294967295(max)) Fail threshold for loss-count

whichever fails first triggers the netwatch.

The tricky ones are (IMHO) the “statistic” ones, avg and stdev, particularly the latter.
I believe that the real world behaviour for these might be influenced by the sheer number of pings performed, i.e. by packet-count, a higher number of pings per run should “flatten” the statistics, making these threshold “less sensitive”, whilst a lower number of pings may make it over-sensitive (provided that any of the other threshold settings don’t trigger the netwatch earlier. :confused:

Fantastic analysis!

Makes perfect sense.

I wish there was a way of logging more stats, such as loss-percentage.

Still it is not at all clear (to me :blush: ) the difference between the previously listed ICMP probe options and the ICMP properties:
sent-count ICMP packets sent out
response-count Matching/valid ICMP packet responses received
thr-loss-count number of lost packets
thr-loss-percent number of lost packets in percent
thr-avg mean value of round trip time
thr-max max round trip time
thr-jitter jitter ( = max - min) of round trip time
thr-stdev standard deviation of round trip time
the last six of which share the same name (and are listed in the doc in a different order, to better confuse the reader).

In ICMP probe, ALL values must be within spec at end test (including those NOT in defined, which assume the "Default: " above) - the “and/or” may be confusing.

The tricky ones are (IMHO) the “statistic” ones, avg and stdev, particularly the latter.
I believe that the real world behaviour for these might be influenced by the sheer number of pings performed, i.e. by packet-count, a higher number of pings per run should “flatten” the statistics, making these threshold “less sensitive”, whilst a lower number of pings may make it over-sensitive (provided that any of the other threshold settings don’t trigger the netwatch earlier. > :confused:

The UP / DOWN result happens after all of the packet-count have been sent — there is no “fail early” – it sends all the packet-count first THEN it evaluates if all the 6 varaibles are within spec. Just ONE – any one – above the limit will cause a DOWN.

In ICMP probe, ALL values must be within spec at end - the and/or may be confusing.

I re-read docs this AM.

The docs are incomplete(/wrong) on the values used in a netwatch script AFTER a test are not thr-* — rather rtt-. You can you see these values in the “Status” section & if logging the RESULTS of a test you want to use rtt- like rtt-max/min/stddev/etc. More specifically $“rtt-max”, since variable has special character (-). The thr-* are the spec/requirement that DEFINE what success is, not the result.

For example, if this is used in up-script= or down-script:

:log info "NETWATCH host=$host status=$status comment=$comment interval=$interval rtt-avg=$"rtt-avg" rtt-min=$"rtt-min" rtt-max=$"rtt-max" rtt-stdev=$"rtt-stdev" rtt-jitter=$"rtt-jitter" "

you get a log entry if status changes:

NETWATCH host=8.8.8.8 status=up comment= interval=1000 rtt-avg=12769 rtt-min=12406 rtt-max=12995 rtt-stdev=200 rtt-jitter=589

Filed a doc bug about this (SUP-187116)… since docs should mention the rtt- values.

Nice finding about the rtt- prefix instead of the thr- one, now it starts making sense. (and confirms that proofreading is a lost art), but Amm0, with all due respect :slight_smile: , you need to use more linear English if you want to explain something (or maybe you also got the Latvian virus that make affected people use excessively periphrasis or double negations? :open_mouth: ).
But you are right, by “earlier” I was meaning “directly”, without considering an average or standard deviation of all the results.

Let’s see if the following is accurate (and simple) enough:

There are 6 thresholds in ICMP probe options, they are:

  1. thr-max (Default: 1s) Fail threshold for round trip time-max (a value above thr-max is a probe fail)
  2. thr-avg (Default: 100ms) Fail threshold for round trip time-avg
  3. thr-stdev (Default: 250ms) Fail threshold for round trip time-stdev
  4. thr-jitter (Default: 1s) Fail threshold for round trip time-jitter
  5. thr-loss-percent (Default: 85.0%) Fail threshold for loss-percent
  6. thr-loss-count (Default: 4294967295(max)) Fail threshold for loss-count
    if in a single run of the netwatch ANY of the above is exceeded the netwatch will fail (trigger the “down” status).

There are 9 properties (or results) of the ICMP probe (that in the help doc are incorrectly called with prefix thr- instead of rtt- or no prefix and given in a jumbled up order), here (hopefully) corrected and re-ordered:

  1. rtt-max max round trip time
    1 bis. rtt-min min round trip time
  2. rtt-avg mean value of round trip time
  3. rtt-stdev standard deviation of round trip time
  4. rtt-jitter jitter ( = max - min) of round trip time
  5. loss-percent number of lost packets in percent
  6. loss-count number of lost packets
  7. sent-count ICMP packets sent out
  8. response-count Matching/valid ICMP packet responses received

Edit: updated with the new info

LOL. Perhaps. I’m waiting for Apple Intelligence to do proofreading in an edit box, you’d think with all the AI talk it should be trivial to check grammar :wink: But you called me out…too lazy to proofread my posts and/or too lazy cut-and-paste to grammar checker.

And I do think the thr-* ones are available in the script, except they’d always return the same value since that what’s configured. Results are in rtt-* etc.

I can’t get rtt-loss-percent to work.

This in a script:

:local myrttlossper [get [find where host=$myhost] rtt-loss-percent]

kicks back an error

It’s just $“loss-percent”, $“thr-loss-percent” defines where the $“loss-percent” fails. These variables already pre-defined in the down/up-script= so they do not have to be declared or “get”

That works.

Is there some terminology somewhere that would inform me of which variables are already defined? And what the type of variable that is predefined is called so when I read “loss-percent” somewhere I can know that this is a predefined variable?

Thanks!

The variable names match winbox, except the name is all lowercase, and any spaces become a dash (-):
netwatch-icmp-variables.png
As noted, both here and docs, if it contains a - (or space as shown in winbox), then you need to use $“first-second” in any scripts.


It should be in the docs is the issue.

I do indeed see:
Screenshot 2025-05-04 155716.png
But I would never have figured out with your help that the system has predefined variables matching what is displayed (and that is not considering the inclusion of a “-” or the need for $“”)

How would one find this out? I suspect there are plenty of other areas of ROS where predefined variable exist.