Community discussions

MikroTik App
 
ilium007
Member Candidate
Member Candidate
Topic Author
Posts: 281
Joined: Sun Jan 31, 2010 9:58 am
Location: Brisbane, Australia

Netwatch UP threshold

Sat Apr 26, 2025 7:39 am

What are the default thresholds for a netwatch script to mark a check as UP? I have found forum responses where it seems there used to be netwatch attributes such as success-count but these do not seem to be available in rOS 7.

I have my netwatch "up-script" set to:

:local details ( "sent-count=" . ($"sent-count") . " response-count=" . ($"response-count"))
:log info "PRIMARY INET UP $details"
:if (($"response-count") = ($"sent-count")) do={
/ip route set distance=1 number=[find comment="primary_route"] 
} 

but realised that if the response-count was less then sent-count (say, if the link is not yet reliable) the primary_route distance will never be set, netwatch marks as UP and may flap DOWN/UP. I need a way to set my primary_route only if the UP condition is reliable and not just a single successful ICMP result.

I can't find documentation on how routerOS determines the netwatch UP condition.

I thought it may just be the reverse of the DOWN threshold ie, in my case (I'm using the netwatch ICMP probe type):
packet-count=15 packet-interval=500ms thr-avg=600ms thr-jitter=2s thr-loss-percent=85% thr-max=2s thr-stdev=500ms
But it marks the connection as UP after a single ICMP packet is successfully sent.
 
jaclaz
Forum Guru
Forum Guru
Posts: 2873
Joined: Tue Oct 03, 2023 4:21 pm

Re: Netwatch UP threshold

Sat Apr 26, 2025 11:49 am

The netwatch ICMP is (IMHO) complex and mis- or under- documented.

However (from what I understand) the logic seems to be like a double negation, UP is "NOT down".

But then I don't understand why you are comparing response-count against sent-count. :?
If they are equal, it should be the same as thr-loss-count=0. :?:

But then setting appropriate values for packet-count and thr-loss-count probes there should be no need for a comparison within the UP (or DOWN) script, the condition is already checked (and in case triggers the up or down states) by the netwatch.

The problems seem to me to be caused by the other probes settings that are seemingly very difficult to "tune" correctly.

Amm0 wrote a script to more easily set them, maybe useful:
viewtopic.php?t=205645
 
ilium007
Member Candidate
Member Candidate
Topic Author
Posts: 281
Joined: Sun Jan 31, 2010 9:58 am
Location: Brisbane, Australia

Re: Netwatch UP threshold

Sat Apr 26, 2025 12:28 pm

Thanks - I did look at his script but I still can't work out how netwatch transitions to "up" state. I ended up writing my own failover script to schedule. Will post in the forums for discussion.
 
User avatar
Amm0
Forum Guru
Forum Guru
Posts: 4862
Joined: Sun May 01, 2016 7:12 pm
Location: California
Contact:

Re: Netwatch UP threshold

Sat Apr 26, 2025 3:21 pm

I'm not sure my script is best example of how icmp check works, it assumes you understand the netwatch model.

Docs could be improved to explain the high level logic of netwatch. But they do describe all the parameters: https://help.mikrotik.com/docs/spaces/R ... obeoptions

For ICMP check... what's likely going on is that all values have defaults (which docs DO show) — so if something unset/empty/blank, a default is used, and that default will be used to UP/DOWN.
So for an "icmp" check ALL of the thr-* (like thr-rtt) value MUST be below either what you set OR, imporantly, the default thr-* value. But often it's the "hidden default" that causes an unexpected down.

The UP/DOWN for icmp check is determined after <packet-count> number of packets sent every <packet-interval>. By defaults, that means it runs for 500ms (10*50ms), and that's when netwatch test will compare ALL of the thr-* stuff like thr-loss-percent etc over the collection of pings. Another check is run again at next <interval> that applies to all tests. Also <packet-count>*<packet-interval> should be below BOTH the "global" <interval> and <timeout> that apply to all netwatch types.

To troubleshoot, it may be easier to look at a failure and compare the thr-* values with the default values shown in docs. If any are higher than default, then set your netwatch to use a higher value for one of the thr-* in your config.

There is also the "simple" check, where you avoid all the thr-* values that determine the results. So idea of the "icmp" check is that you do want to monitor (and fail) if ping was "slow", not just not responding. But what "slow" means does vary, thus all the options to configure the icmp check. But that does mean you need to define them in a lot of cases. (And to re-itererate the default values are used to determine success/failure, so an empty/unset value is still part of the ICMP check).
 
jaclaz
Forum Guru
Forum Guru
Posts: 2873
Joined: Tue Oct 03, 2023 4:21 pm

Re: Netwatch UP threshold

Sat Apr 26, 2025 4:58 pm

The point about:
"<packet-count>*<packet-interval> should be below BOTH the "global" <interval> and <timeout> that apply to all netwatch types. "
is interesting, never actually thought about it.

The defaults are:
General properties:
interval=10s
timeout=3s
ICMP specific probe options:
packet-interval=50ms
packet-count=10

If we use a fictional "my-packet-time" defined as:
my-packet-time=packet-interval*packet-count
the default is
my-packet-time=0.05*10=0.50s

So BOTH:
my-packet-time<interval = 0.50<10
and
my-packet-time<timeout = 0.50<3
are true.

The default packet-interval of 50ms does at first sight sound a lot like hammering[1], so it is understandable that ilum007 increased it to 500ms.

But that, combined with the increased number of packets from 10 to 15 makes the my-packet-interval=0.5*15=7.5s that will probably be less than interval, but larger than timeout.




[1] the default ping timeout in Windows is 4 seconds and in Linux the -i default parameter is 1 second
 
User avatar
Amm0
Forum Guru
Forum Guru
Posts: 4862
Joined: Sun May 01, 2016 7:12 pm
Location: California
Contact:

Re: Netwatch UP threshold

Sat Apr 26, 2025 5:21 pm

The point about:
"<packet-count>*<packet-interval> should be below BOTH the "global" <interval> and <timeout> that apply to all netwatch types. "
is interesting, never actually thought about it.
I dunno actually, thus the advice ("should")... is to avoid having to know. ;) Again the docs aren't clear.

Anyway that ICMP packet-interval= is very different from interval= was why I mentioned (and potentially confusing). And I guess, the minor point was there may be some value in align them to "spread" the pings across the interval= by adjusting the packet-interval=. Although multiple interval-ish things do require some understanding... i.e. it's nice you can really customize everything — but it not easy to know what you're checking ;).
 
jaclaz
Forum Guru
Forum Guru
Posts: 2873
Joined: Tue Oct 03, 2023 4:21 pm

Re: Netwatch UP threshold

Sat Apr 26, 2025 6:08 pm

Yep, but until someone manages to "decrypt" :wink: the documentation, translating it from Mikrotikish to plain English and adding some commentaries, all this flexibility is counterproductive.

BTW (only as a side-side note) the
thr-loss-percent (Default: 85.0%)
is curious, in the sense that if one leaves the packet count to the default 10, it should mean that 9 is up, but 8 is down, so is the threshold actually 85% or it is 80%?
Do values between 81% and 89% all the same, or are they 80% to 89% (or 81% to 90%)?

I.e. is it the comparison against threshold values made with > or with >=? :?
 
ilium007
Member Candidate
Member Candidate
Topic Author
Posts: 281
Joined: Sun Jan 31, 2010 9:58 am
Location: Brisbane, Australia

Re: Netwatch UP threshold

Sat Apr 26, 2025 6:22 pm

RouterOS doesn't do floating point math so 85% - 89% of 10 pings would be 9 required I think. See my failover script and explanation of using scaled math to work out the percentages: https://gist.github.com/ilium007/5cbe63 ... 1dc55bb967
:put (((10*85)+99)/100)
9
:put (((10*86)+99)/100)
9
 :put (((10*87)+99)/100)
9
:put (((10*88)+99)/100)
9
:put (((10*89)+99)/100)
9
:put (((10*90)+99)/100)
9
:put (((10*91)+99)/100)
10
 
User avatar
Amm0
Forum Guru
Forum Guru
Posts: 4862
Joined: Sun May 01, 2016 7:12 pm
Location: California
Contact:

Re: Netwatch UP threshold

Sat Apr 26, 2025 6:32 pm

RouterOS doesn't do floating point math
One has to imagine netwatch is implemented in C, so internally netwatch can do floating point... Now it's an open question whether thr-loss-percent is inclusive or not (i.e. == or >= )

However, @ilium007 is correct user scripting does not do floating point... so scripting thr-loss-percent= requires the above trick.
 
jaclaz
Forum Guru
Forum Guru
Posts: 2873
Joined: Tue Oct 03, 2023 4:21 pm

Re: Netwatch UP threshold

Sat Apr 26, 2025 8:10 pm

Well, that seems to me another thing.

If in that formula 91% comes out as 10, it is not really-really rounding, it is *something else*, using that formula also 81=9.

Mathematical rounding should be a formula where 80=8, 84=8 and 85=9, that formula instead is as if in a spreadsheet you use int:
=int(((10*91)+99)/100) -> 10
so the value when something changes is not 85%, it is 81% (and 81% up to 90% all come out as 9).

With integer limitations, the easier formula would be:
(80+5)/10=8
(81+5)/10=8
(82+5)/10=8
(83+5)/10=8
(84+5)/10=8
(85+5)/10=9
(86+5)/10=9
(87+5)/10=9
(88+5)/10=9
(89+5)/10=9
(90+5)/10=9
(91+5)/10=9
 
Josephny
Forum Guru
Forum Guru
Posts: 1281
Joined: Tue Sep 20, 2022 12:11 am
Location: New York, USA

Re: Netwatch UP threshold

Wed Apr 30, 2025 1:54 pm

Do I understand this correctly that the following code will check the host once every 1 minute by sending out 100 packets with a 100ms interval between packets and return a 'HOST-IS-DOWN' result if the average response time of those 100 packets is greater than 400ms AND the total packet loss rate is 90% or above (i.e., 90 or more packets fail to elicit a response from the host)?
/tool netwatch
add disabled=no host=192.168.2.2 http-codes="" interval=1m name=Netwatch-192.168.2.2 packet-count=100 \
 packet-interval=100ms thr-avg=400ms thr-loss-percent=90% type=icmp
The goal for me here, BTW, is to not wake up to a screen full of notifications about down hosts when the ISP does some 1 minute flapping in the middle of the night.
 
jaclaz
Forum Guru
Forum Guru
Posts: 2873
Joined: Tue Oct 03, 2023 4:21 pm

Re: Netwatch UP threshold

Wed Apr 30, 2025 2:57 pm

The goal for me here, BTW, is to not wake up to a screen full of notifications about down hosts when the ISP does some 1 minute flapping in the middle of the night.
That is rather easy, suggested "down" script contents:
#DO NOTHING
:lol:
 
User avatar
Amm0
Forum Guru
Forum Guru
Posts: 4862
Joined: Sun May 01, 2016 7:12 pm
Location: California
Contact:

Re: Netwatch UP threshold

Wed Apr 30, 2025 3:00 pm

That looks right to me. As I said, I'd compare the "Status" to make sure all the other RTT things are well within the defaults. If not, or even close, specifically set the various thr-* higher.

If you temporary enable topics=netwatch in /system/logging, it will log both the values got, and the "threshold" amount so you can see how close you are too. You'd likely want to disable netwatch in logging after testing (or use different log than default "memory"), as that could fill your log. But the logging will show pretty clearly how "close" you are to the limits.

The point about:
"<packet-count>*<packet-interval> should be below BOTH the "global" <interval> and <timeout> that apply to all netwatch types. "
is interesting, never actually thought about it.
I dunno actually, thus the advice ("should")... is to avoid having to know. ;) Again the docs aren't clear.
WinBox4 with 7.19rc1 (perhaps earlier, didn't check) will give you a warning that it adjusts interval= to match. They give the formula for that.
netwatch-icmp-interval-warning.png
You do not have the required permissions to view the files attached to this post.
 
Josephny
Forum Guru
Forum Guru
Posts: 1281
Joined: Tue Sep 20, 2022 12:11 am
Location: New York, USA

Re: Netwatch UP threshold

Wed Apr 30, 2025 3:10 pm

The goal for me here, BTW, is to not wake up to a screen full of notifications about down hosts when the ISP does some 1 minute flapping in the middle of the night.
That is rather easy, suggested "down" script contents:
#DO NOTHING
:lol:

LOL! That's just what they think you'd think they'd say you'd say.
 
wiseroute
Member
Member
Posts: 427
Joined: Sun Feb 05, 2023 11:06 am

Re: Netwatch UP threshold

Wed Apr 30, 2025 4:31 pm

@josephny,

ii think your script output problem lies in your understanding how the script being processed by mt.
packet-count=15 packet-interval=500ms thr-avg=600ms thr-jitter=2s thr-loss-percent=85% thr-max=2s thr-stdev=500ms
try to add parameters one at a time and see how the netwatch script being processed. see how the 'and' and 'or' act between those parameters results.
 
Josephny
Forum Guru
Forum Guru
Posts: 1281
Joined: Tue Sep 20, 2022 12:11 am
Location: New York, USA

Re: Netwatch UP threshold

Wed Apr 30, 2025 4:55 pm

That looks right to me. As I said, I'd compare the "Status" to make sure all the other RTT things are well within the defaults. If not, or even close, specifically set the various thr-* higher.

If you temporary enable topics=netwatch in /system/logging, it will log both the values got, and the "threshold" amount so you can see how close you are too. You'd likely want to disable netwatch in logging after testing (or use different log than default "memory"), as that could fill your log. But the logging will show pretty clearly how "close" you are to the limits.


I dunno actually, thus the advice ("should")... is to avoid having to know. ;) Again the docs aren't clear.
WinBox4 with 7.19rc1 (perhaps earlier, didn't check) will give you a warning that it adjusts interval= to match. They give the formula for that.

netwatch-icmp-interval-warning.png
Great suggestion to enable logging -- I didn't know the thresholds would be shown.
 
Josephny
Forum Guru
Forum Guru
Posts: 1281
Joined: Tue Sep 20, 2022 12:11 am
Location: New York, USA

Re: Netwatch UP threshold

Wed Apr 30, 2025 5:10 pm

With netwatch:
/tool netwatch
add disabled=no host=192.168.2.2 http-codes="" interval=1m name=Netwatch-192.168.2.2 packet-count=100 packet-interval=100ms thr-avg=400ms \
thr-loss-percent=90% type=icmp
Enabled logging and seeing this every minute:
2025-04-30 09:59:32 netwatch,debug [ Netwatch-192.168.2.2 ] Stats:
2025-04-30 09:59:32 netwatch,debug [ OK ] rtt-max: 40.442 ms [ <= 1000.000 ms ]
2025-04-30 09:59:32 netwatch,debug [ OK ] rtt-jitter: 24.666 ms [ <= 1000.000 ms ]
2025-04-30 09:59:32 netwatch,debug [ OK ] rtt-avg: 22.661 ms [ <= 400.000 ms ]
2025-04-30 09:59:32 netwatch,debug [ OK ] rtt-stdev: 4.330 ms [ <= 250.000 ms ]
2025-04-30 09:59:32 netwatch,debug [ OK ] loss count: 0 [ <= 4294967295 ]
2025-04-30 09:59:32 netwatch,debug [ OK ] loss: 0% [ <= 90.0% ]
Any way to have it logged only when host is determined to be down?

Maybe a down-script:

:local thisBox [/system identity get name];
:local Host $host
/tool netwatch
:local Status [get [find where host="$Host"] status]
:local Comment [get [find where host="$Host"] comment]
:local Interval [get [find where host="$Host"] interval]
:local Rtt-ave [get [find where host="$Host"] rtt-ave]
:local Loss [get [find where host="$Host"] loss]
:log info "watch_host=$Host comment=\"$Comment\" rtt-ave=$Rtt-ave interval=$Interval loss=$Loss
 
jaclaz
Forum Guru
Forum Guru
Posts: 2873
Joined: Tue Oct 03, 2023 4:21 pm

Re: Netwatch UP threshold

Wed Apr 30, 2025 5:23 pm

@Amm0
I guess that writing that formula in the documentation was just too much work? :shock:

Seriously, which parameters lead to 19s620ms?


@Josepny
Using in scripts variables with the same name of ROS parameters/values/commands/etc. is usually not a good idea.
point #16 here:
viewtopic.php?p=1128345
use (say) my_ prefix, i.e. my_status, my_comment, etc.
You do not have the required permissions to view the files attached to this post.
 
User avatar
Amm0
Forum Guru
Forum Guru
Posts: 4862
Joined: Sun May 01, 2016 7:12 pm
Location: California
Contact:

Re: Netwatch UP threshold

Wed Apr 30, 2025 5:45 pm

Seriously, which parameters lead to 19s620ms?
interval=10s packet-count=50 packet-interval=380ms
 
jaclaz
Forum Guru
Forum Guru
Posts: 2873
Joined: Tue Oct 03, 2023 4:21 pm

Re: Netwatch UP threshold

Wed Apr 30, 2025 6:19 pm

Seriously, which parameters lead to 19s620ms?
interval=10s packet-count=50 packet-interval=380ms
And timeout=1s, OK. :) (interval is not used in the formula, it is just the reference).
You do not have the required permissions to view the files attached to this post.
 
Josephny
Forum Guru
Forum Guru
Posts: 1281
Joined: Tue Sep 20, 2022 12:11 am
Location: New York, USA

Re: Netwatch UP threshold

Thu May 01, 2025 1:45 am


@Josepny
Using in scripts variables with the same name of ROS parameters/values/commands/etc. is usually not a good idea.
point #16 here:
viewtopic.php?p=1128345
use (say) my_ prefix, i.e. my_status, my_comment, etc.
Adding "my_" to the name of an ROS parameter/value/command is a great suggestion -- thank you.

I "lifted" the script that I used as a basis for what I posted earlier from someone else's work, where he capitalized the first letter to get around the problem of using the same variable name. I definetly like "my_" better -- and, thanks to you, now I can have the confidence to implement it.
 
Josephny
Forum Guru
Forum Guru
Posts: 1281
Joined: Tue Sep 20, 2022 12:11 am
Location: New York, USA

Re: Netwatch UP threshold

Thu May 01, 2025 2:26 am

Ugh! Everything takes so much time, troubleshooting, effort, frustration.

I've got a whole big (for me) script, and continue to get an error.

Troubleshooting by commenting out lines, this line is the culprit:
:local my_host $host
With this being the only non-commented line, the script fails:
executing script Netwatch-details from netwatch failed, please check it manually
 
User avatar
Amm0
Forum Guru
Forum Guru
Posts: 4862
Joined: Sun May 01, 2016 7:12 pm
Location: California
Contact:

Re: Netwatch UP threshold

Thu May 01, 2025 2:44 am

In the case of variables names, you cannot use underscore without quotes AFAIK. So :local "my_variable" not :local my_variable.
 
Josephny
Forum Guru
Forum Guru
Posts: 1281
Joined: Tue Sep 20, 2022 12:11 am
Location: New York, USA

Re: Netwatch UP threshold

Thu May 01, 2025 4:56 am

Thanks, that fixed that line.

But the rest didn't work.

I tried various combinations of quotes.

I removed the underscore.

Then I discovered that hyphens aren't liked either.

There are too many unintuitive rules with this scripting!

This works:
:local myhost $host
/tool netwatch
:local mystatus [get [find where host=$myhost] status]
:local mycomment [get [find where host=$myhost] comment]
:local myinterval [get [find where host=$myhost] interval]
:local mysince [get [find where host=$myhost] since]
:local mytimeout [get [find where host=$myhost] timeout]
:local mypacketinterval [get [find where host=$myhost] packet-interval]
:local mypacketcount [get [find where host=$myhost] packet-count]
:local mypacketsize [get [find where host=$myhost] packet-size]
:local mythrmax [get [find where host=$myhost] thr-max]
:local mythravg [get [find where host=$myhost] thr-avg]
:local mythrlosspercent [get [find where host=$myhost] thr-loss-percent]
:local mythrlosscount [get [find where host=$myhost] thr-loss-count]

:log info "host=$myhost status=$mystatus comment=$mycomment interval=$myinterval timeout=$mytimeout packet-interval=$mypacketinterval packet-count=$mypacketcount thr-avg=$mythravg thr-loss-percent=$mythrlosspercent"
 
User avatar
Amm0
Forum Guru
Forum Guru
Posts: 4862
Joined: Sun May 01, 2016 7:12 pm
Location: California
Contact:

Re: Netwatch UP threshold

Thu May 01, 2025 5:37 am

Well that is a different approach, I guess it avoid the needed permissions for netwatch. The only esoteric issue with using script/scheduler outside of the "On Down"/"On Up" netwatch scripts... be if the netwatch polling could happen while that is running, if that happened... the values your scripts could be mixed from last test and current one — although that be pretty unlikely.

I tried various combinations of quotes.
I removed the underscore.
Then I discovered that hyphens aren't liked either.
There are too many unintuitive rules with this scripting!
In fairness, the docs do say: "Valid characters in variable names are letters and digits. If the variable name contains any other character, then the variable name should be put in double quotes."
And docs do show to use a quoted name you use a $"my variable" — but I would never have suggested using _ or - ... the syntax gets even trickier if you.

Certainly a lot of rules ... but I think whomever suggested using _ is more to blame. ;)
 
jaclaz
Forum Guru
Forum Guru
Posts: 2873
Joined: Tue Oct 03, 2023 4:21 pm

Re: Netwatch UP threshold

Thu May 01, 2025 12:37 pm

Well, I can well take the blame, no problem, for omitting the need for double quotes if non-letter and non-number characters are used for variable names.

BTW coming from the (good?) ol' times where variables could only be called A, B, C, D, etc., the RoS limitations (while completely absurd in these days and times) still don't seem too bad to me.

Why, in my day ....

LUXURY!

... kids today!

https://tinyapps.org/blog/200702250700_ ... y_day.html

:lol:
 
Josephny
Forum Guru
Forum Guru
Posts: 1281
Joined: Tue Sep 20, 2022 12:11 am
Location: New York, USA

Re: Netwatch UP threshold

Thu May 01, 2025 1:24 pm

It gets even better:

Here I am thinking I will capture the actual values of these variables and write them to the log instead of turning on logging for topic netwatch and having the log fill up it, when, in reality, the only thing being logged is the netwatch settings (i.e., not the netwatch results).

Or maybe it is working as envisioned -- I don't even know any more.

So, while this has been informative and I am one step (out of 1,000,000,000,000) closer to competency (thanks to you guys!), I still need a way to capture the netwatch icmp details for when a host goes down.

And regarding being an old-timers, I am indeed one also, only without the years of design/programming/etc. And, to invoke a cross over to the docs:

I was around (and active on Bitnet, the predecessors of the current Internet; and thought Usenet groups were the next stage in human evolution) when RTFM was invented!
 
Josephny
Forum Guru
Forum Guru
Posts: 1281
Joined: Tue Sep 20, 2022 12:11 am
Location: New York, USA

Re: Netwatch UP threshold

Sun May 04, 2025 1:08 pm

Still playing with netwatch and trying these settings:
/tool netwatch
add comment=Netwatch-192.168.0.11 disabled=no down-script=Netwatch-details host=192.168.0.11 http-codes="" interval=2m name=Netwatch-192.168.0.11 packet-count=300 packet-interval=200ms \
    test-script="" thr-avg=400ms thr-loss-percent=95% type=icmp up-script=Netwatch-details

This means (if I'm understanding correctly) that every 2 minutes a netwatch process will start and include 1 minute of pings consisting of 300 packets sent 200ms apart. And, a fail would be an average rtt of 400ms (or greater) and a loss percentage of 95% (or greater) for the group of 300 packets. (I'm still not clear on the "and" part, but I think it's a safe assumption -- as opposed to "or".)

On a status change (up or down), this is the script that is run:
:local myhost $host
/tool netwatch
:local mystatus [get [find where host=$myhost] status]
:local mycomment [get [find where host=$myhost] comment]
:local myinterval [get [find where host=$myhost] interval]
:local mysince [get [find where host=$myhost] since]
:local mytimeout [get [find where host=$myhost] timeout]
:local mypacketinterval [get [find where host=$myhost] packet-interval]
:local mypacketcount [get [find where host=$myhost] packet-count]
:local mypacketsize [get [find where host=$myhost] packet-size]
:local mythrmax [get [find where host=$myhost] thr-max]
:local mythravg [get [find where host=$myhost] thr-avg]
:local mythrlosspercent [get [find where host=$myhost] thr-loss-percent]
:local mythrlosscount [get [find where host=$myhost] thr-loss-count]
:local myrttavg [get [find where host=$myhost] rtt-avg]

:log info "NETWATCH host=$myhost status=$mystatus comment=$mycomment interval=$myinterval timeout=$mytimeout packet-interval=$mypacketinterval packet-count=$mypacketcount thr-avg=$mythravg thr-loss-percent=$mythrlosspercent rttavg=$myrttavg"
Disabling and enabling this netwatch entry created the following log entry:
NETWATCH host=192.168.0.11 status=up comment=Netwatch-192.168.0.11 interval=00:02:00 timeout= packet-interval=00:00:00.200 packet-count=300 thr-avg=00:00:00.400 thr-loss-percent=950 rttavg=00:00:00.021539
It looks like all the properties that start with "thr" are threshold settings.

rtt-avg is a result or value derived by executing the process.

I looked at the docs (yes, really) and I don't see the results or stats for loss-percentage.
 
jaclaz
Forum Guru
Forum Guru
Posts: 2873
Joined: Tue Oct 03, 2023 4:21 pm

Re: Netwatch UP threshold

Sun May 04, 2025 1:42 pm

The way I read the docs, it is a "or", i.e. there are 6 different thresholds:
thr-max (Default: 1s) Fail threshold for round trip time-max (a value above thr-max is a probe fail)
thr-avg (Default: 100ms) Fail threshold for round trip time-avg
thr-stdev (Default: 250ms) Fail threshold for round trip time-stdev
thr-jitter (Default: 1s) Fail threshold for round trip time-jitter
thr-loss-percent (Default: 85.0%) Fail threshold for loss-percent
thr-loss-count (Default: 4294967295(max)) Fail threshold for loss-count

whichever fails first triggers the netwatch.

The tricky ones are (IMHO) the "statistic" ones, avg and stdev, particularly the latter.
I believe that the real world behaviour for these might be influenced by the sheer number of pings performed, i.e. by packet-count, a higher number of pings per run should "flatten" the statistics, making these threshold "less sensitive", whilst a lower number of pings may make it over-sensitive (provided that any of the other threshold settings don't trigger the netwatch earlier. :?
 
Josephny
Forum Guru
Forum Guru
Posts: 1281
Joined: Tue Sep 20, 2022 12:11 am
Location: New York, USA

Re: Netwatch UP threshold

Sun May 04, 2025 1:59 pm

The way I read the docs, it is a "or", i.e. there are 6 different thresholds:
thr-max (Default: 1s) Fail threshold for round trip time-max (a value above thr-max is a probe fail)
thr-avg (Default: 100ms) Fail threshold for round trip time-avg
thr-stdev (Default: 250ms) Fail threshold for round trip time-stdev
thr-jitter (Default: 1s) Fail threshold for round trip time-jitter
thr-loss-percent (Default: 85.0%) Fail threshold for loss-percent
thr-loss-count (Default: 4294967295(max)) Fail threshold for loss-count

whichever fails first triggers the netwatch.

The tricky ones are (IMHO) the "statistic" ones, avg and stdev, particularly the latter.
I believe that the real world behaviour for these might be influenced by the sheer number of pings performed, i.e. by packet-count, a higher number of pings per run should "flatten" the statistics, making these threshold "less sensitive", whilst a lower number of pings may make it over-sensitive (provided that any of the other threshold settings don't trigger the netwatch earlier. :?
Fantastic analysis!

Makes perfect sense.

I wish there was a way of logging more stats, such as loss-percentage.
 
jaclaz
Forum Guru
Forum Guru
Posts: 2873
Joined: Tue Oct 03, 2023 4:21 pm

Re: Netwatch UP threshold

Sun May 04, 2025 4:20 pm

Still it is not at all clear (to me :oops: ) the difference between the previously listed ICMP probe options and the ICMP properties:
sent-count ICMP packets sent out
response-count Matching/valid ICMP packet responses received
thr-loss-count number of lost packets
thr-loss-percent number of lost packets in percent
thr-avg mean value of round trip time
thr-max max round trip time
thr-jitter jitter ( = max - min) of round trip time
thr-stdev standard deviation of round trip time
the last six of which share the same name (and are listed in the doc in a different order, to better confuse the reader).
 
User avatar
Amm0
Forum Guru
Forum Guru
Posts: 4862
Joined: Sun May 01, 2016 7:12 pm
Location: California
Contact:

Re: Netwatch UP threshold

Sun May 04, 2025 5:24 pm

The way I read the docs, it is a "or", i.e. there are 6 different thresholds:
thr-max (Default: 1s) Fail threshold for round trip time-max (a value above thr-max is a probe fail)
thr-avg (Default: 100ms) Fail threshold for round trip time-avg
thr-stdev (Default: 250ms) Fail threshold for round trip time-stdev
thr-jitter (Default: 1s) Fail threshold for round trip time-jitter
thr-loss-percent (Default: 85.0%) Fail threshold for loss-percent
thr-loss-count (Default: 4294967295(max)) Fail threshold for loss-count

whichever fails first triggers the netwatch.
In ICMP probe, ALL values must be within spec at end test (including those NOT in defined, which assume the "Default: " above) - the "and/or" may be confusing.
The tricky ones are (IMHO) the "statistic" ones, avg and stdev, particularly the latter.
I believe that the real world behaviour for these might be influenced by the sheer number of pings performed, i.e. by packet-count, a higher number of pings per run should "flatten" the statistics, making these threshold "less sensitive", whilst a lower number of pings may make it over-sensitive (provided that any of the other threshold settings don't trigger the netwatch earlier. :?
The UP / DOWN result happens after all of the packet-count have been sent — there is no "fail early" – it sends all the packet-count first THEN it evaluates if all the 6 varaibles are within spec. Just ONE – any one – above the limit will cause a DOWN.

In ICMP probe, ALL values must be within spec at end - the and/or may be confusing.
 
User avatar
Amm0
Forum Guru
Forum Guru
Posts: 4862
Joined: Sun May 01, 2016 7:12 pm
Location: California
Contact:

Re: Netwatch UP threshold

Sun May 04, 2025 5:33 pm

The tricky ones are (IMHO) the "statistic" ones, avg and stdev, particularly the latter.
I believe that the real world behaviour for these might be influenced by the sheer number of pings performed, i.e. by packet-count, a higher number of pings per run should "flatten" the statistics, making these threshold "less sensitive", whilst a lower number of pings may make it over-sensitive (provided that any of the other threshold settings don't trigger the netwatch earlier. :?
I re-read docs this AM.

The docs are incomplete(/wrong) on the values used in a netwatch script AFTER a test are not thr-* — rather rtt-*. You can you see these values in the "Status" section & if logging the RESULTS of a test you want to use rtt-* like rtt-max/min/stddev/etc. More specifically $"rtt-max", since variable has special character (-). The thr-* are the spec/requirement that DEFINE what success is, not the result.

For example, if this is used in up-script= or down-script:
:log info "NETWATCH host=$host status=$status comment=$comment interval=$interval rtt-avg=$"rtt-avg" rtt-min=$"rtt-min" rtt-max=$"rtt-max" rtt-stdev=$"rtt-stdev" rtt-jitter=$"rtt-jitter" "
you get a log entry if status changes:
NETWATCH host=8.8.8.8 status=up comment= interval=1000 rtt-avg=12769 rtt-min=12406 rtt-max=12995 rtt-stdev=200 rtt-jitter=589 

Filed a doc bug about this (SUP-187116).... since docs should mention the rtt- values.
 
jaclaz
Forum Guru
Forum Guru
Posts: 2873
Joined: Tue Oct 03, 2023 4:21 pm

Re: Netwatch UP threshold

Sun May 04, 2025 6:50 pm

Nice finding about the rtt- prefix instead of the thr- one, now it starts making sense. (and confirms that proofreading is a lost art), but Amm0, with all due respect :) , you need to use more linear English if you want to explain something (or maybe you also got the Latvian virus that make affected people use excessively periphrasis or double negations? :shock: ).
But you are right, by "earlier" I was meaning "directly", without considering an average or standard deviation of all the results.

Let's see if the following is accurate (and simple) enough:

There are 6 thresholds in ICMP probe options, they are:
1. thr-max (Default: 1s) Fail threshold for round trip time-max (a value above thr-max is a probe fail)
2. thr-avg (Default: 100ms) Fail threshold for round trip time-avg
3. thr-stdev (Default: 250ms) Fail threshold for round trip time-stdev
4. thr-jitter (Default: 1s) Fail threshold for round trip time-jitter
5. thr-loss-percent (Default: 85.0%) Fail threshold for loss-percent
6. thr-loss-count (Default: 4294967295(max)) Fail threshold for loss-count
if in a single run of the netwatch ANY of the above is exceeded the netwatch will fail (trigger the "down" status).

There are 9 properties (or results) of the ICMP probe (that in the help doc are incorrectly called with prefix thr- instead of rtt- or no prefix and given in a jumbled up order), here (hopefully) corrected and re-ordered:
1. rtt-max max round trip time
1 bis. rtt-min min round trip time
2. rtt-avg mean value of round trip time
3. rtt-stdev standard deviation of round trip time
4. rtt-jitter jitter ( = max - min) of round trip time
5. loss-percent number of lost packets in percent
6. loss-count number of lost packets
7. sent-count ICMP packets sent out
8. response-count Matching/valid ICMP packet responses received

Edit: updated with the new info
Last edited by jaclaz on Sun May 04, 2025 10:53 pm, edited 1 time in total.
 
User avatar
Amm0
Forum Guru
Forum Guru
Posts: 4862
Joined: Sun May 01, 2016 7:12 pm
Location: California
Contact:

Re: Netwatch UP threshold

Sun May 04, 2025 7:21 pm

Amm0, with all due respect :) , you need to use more linear English if you want to explain something (or maybe you also got the Latvian virus that make affected people use excessively periphrasis or double negations? :shock: ).
LOL. Perhaps. I'm waiting for Apple Intelligence to do proofreading in an edit box, you'd think with all the AI talk it should be trivial to check grammar ;) But you called me out...too lazy to proofread my posts and/or too lazy cut-and-paste to grammar checker.

And I do think the thr-* ones are available in the script, except they'd always return the same value since that what's configured. Results are in rtt-* etc.
 
Josephny
Forum Guru
Forum Guru
Posts: 1281
Joined: Tue Sep 20, 2022 12:11 am
Location: New York, USA

Re: Netwatch UP threshold

Sun May 04, 2025 10:07 pm

I can't get rtt-loss-percent to work.

This in a script:
:local myrttlossper [get [find where host=$myhost] rtt-loss-percent]
kicks back an error
 
User avatar
Amm0
Forum Guru
Forum Guru
Posts: 4862
Joined: Sun May 01, 2016 7:12 pm
Location: California
Contact:

Re: Netwatch UP threshold

Sun May 04, 2025 10:14 pm

It's just $"loss-percent", $"thr-loss-percent" defines where the $"loss-percent" fails. These variables already pre-defined in the down/up-script= so they do not have to be declared or "get"
 
Josephny
Forum Guru
Forum Guru
Posts: 1281
Joined: Tue Sep 20, 2022 12:11 am
Location: New York, USA

Re: Netwatch UP threshold

Sun May 04, 2025 10:23 pm

It's just $"loss-percent", $"thr-loss-percent" defines where the $"loss-percent" fails. These variables already pre-defined in the down/up-script= so they do not have to be declared or "get"
That works.

Is there some terminology somewhere that would inform me of which variables are already defined? And what the type of variable that is predefined is called so when I read "loss-percent" somewhere I can know that this is a predefined variable?

Thanks!
 
User avatar
Amm0
Forum Guru
Forum Guru
Posts: 4862
Joined: Sun May 01, 2016 7:12 pm
Location: California
Contact:

Re: Netwatch UP threshold

Sun May 04, 2025 10:25 pm

The variable names match winbox, except the name is all lowercase, and any spaces become a dash (-):
netwatch-icmp-variables.png
As noted, both here and docs, if it contains a - (or space as shown in winbox), then you need to use $"first-second" in any scripts.

Is there some terminology somewhere that would inform me of which variables are already defined? And what the type of variable that is predefined is called so when I read "loss-percent" somewhere I can know that this is a predefined variable?
It should be in the docs is the issue.
You do not have the required permissions to view the files attached to this post.
 
Josephny
Forum Guru
Forum Guru
Posts: 1281
Joined: Tue Sep 20, 2022 12:11 am
Location: New York, USA

Re: Netwatch UP threshold

Sun May 04, 2025 11:01 pm

I do indeed see:
Screenshot 2025-05-04 155716.png
But I would never have figured out with your help that the system has predefined variables matching what is displayed (and that is not considering the inclusion of a "-" or the need for $"<variable-name>")

How would one find this out? I suspect there are plenty of other areas of ROS where predefined variable exist.
You do not have the required permissions to view the files attached to this post.
 
jaclaz
Forum Guru
Forum Guru
Posts: 2873
Joined: Tue Oct 03, 2023 4:21 pm

Re: Netwatch UP threshold

Sun May 04, 2025 11:58 pm



But I would never have figured out with your help ...
Talking of double negations .. I presume there Is an "out" that slipped from your fingers ... (I would try putting It after the "with"). :lol:
 
Josephny
Forum Guru
Forum Guru
Posts: 1281
Joined: Tue Sep 20, 2022 12:11 am
Location: New York, USA

Re: Netwatch UP threshold

Mon May 05, 2025 12:03 am



But I would never have figured out with your help ...
Talking of double negations .. I presume there Is an "out" that slipped from your fingers ... (I would try putting It after the "with"). :lol:
Typo: “Without”
 
Josephny
Forum Guru
Forum Guru
Posts: 1281
Joined: Tue Sep 20, 2022 12:11 am
Location: New York, USA

Re: Netwatch UP threshold

Mon May 05, 2025 3:44 pm

Following up for posterity:

Since changing the netwatch parameters to the following:
/tool netwatch
add comment=Netwatch-192.168.2.2 disabled=no down-script=Netwatch-details host=192.168.2.2 http-codes="" interval=2m name=Netwatch-192.168.2.2 packet-count=400 packet-interval=200ms test-script="" thr-avg=\
    400ms thr-loss-percent=95% type=icmp up-script=Netwatch-details
I have not had the nightly triggers (i.e., host down).

My hope is that this confirms that these nightly "events" were effectively momentary drops.

(FYI: 192.168.2.2 is a host across a Wireguard tunnel.)
 
jaclaz
Forum Guru
Forum Guru
Posts: 2873
Joined: Tue Oct 03, 2023 4:21 pm

Re: Netwatch UP threshold

Mon May 05, 2025 8:41 pm

To me it still looks like hammering. :shock:

400 packets sent at 200 ms interval every 2 minutes?

The defaults end up being 6 runs per minute x 10 packets/run=60 packets/minute (which already seem to me a lot).

Your last settings come up as 1/2 run per minute x 400 packets/run=200 packets/minute.
 
Josephny
Forum Guru
Forum Guru
Posts: 1281
Joined: Tue Sep 20, 2022 12:11 am
Location: New York, USA

Re: Netwatch UP threshold

Tue May 06, 2025 3:25 am

To me it still looks like hammering. :shock:

400 packets sent at 200 ms interval every 2 minutes?

The defaults end up being 6 runs per minute x 10 packets/run=60 packets/minute (which already seem to me a lot).

Your last settings come up as 1/2 run per minute x 400 packets/run=200 packets/minute.
By "hammering" do you mean sending an overly large number of pings?

I don't know what "run" is. I'm unable to reverse-engineer it using your "6 runs per minute" reference.

Are you saying there is too much or too little connectivity checking?

400 packets at 200ms interval is 80 seconds of ping packets.

Maybe it is having a "keep-alive" type of effect?

Maybe I should change it interval=5m? Does that mean it runs the test for 80 seconds, waits 5 minutes, then repeats. Or does it means that every 5 minutes it starts an 80 second test.
 
User avatar
Amm0
Forum Guru
Forum Guru
Posts: 4862
Joined: Sun May 01, 2016 7:12 pm
Location: California
Contact:

Re: Netwatch UP threshold

Tue May 06, 2025 3:58 am

Converting @jaclaz calcs.... you're running 3⅓ pings every second...

Assuming you the path is over fiber/cable with decent speed, I don't think it matters much – it's still not a lot of data even at 3⅓ pings/sec. But does seem high frequency. If it's working, I'd leave... or perhaps copy @jaclaz's spreadsheet to do the math to see the effects of the various settings.

The whole idea of icmp check is does allow fine-grain checking. At some level, if the goal is to NOT have to do math... the "simple" netwatch check is way to go.

One other detail, IMO the main benefit of "icmp" netwatch is that you can monitor latency... which matter a lot if network is merely congested, not down. Since ping latency going up is often sign of congestion, I'd make sure you do log all the rtt-* one specifically, especially rtt-jitter and rtt-avg since those give you a clue if you're logging at logs after a failure (or complaints about "network is slow").
 
Josephny
Forum Guru
Forum Guru
Posts: 1281
Joined: Tue Sep 20, 2022 12:11 am
Location: New York, USA

Re: Netwatch UP threshold

Tue May 06, 2025 4:25 am

I like math, and I'm not bad at it.

But how do we get to 3.5 pings/second? A ping every 200ms is 5 pings/second, right? with each netwatch test being 400 pings, that's 80 seconds of pings, right?

What am I doing wrong?

What I am hoping to have achieved by using the ICMP type of netwatch, and by tweaking these threshold values, is to eliminate the positive (i.e., down) results that would occur with a momentary flapping (going down) of the connection.
 
User avatar
Amm0
Forum Guru
Forum Guru
Posts: 4862
Joined: Sun May 01, 2016 7:12 pm
Location: California
Contact:

Re: Netwatch UP threshold

Tue May 06, 2025 4:43 am

/tool netwatch
add comment=Netwatch-192.168.2.2 disabled=no down-script=Netwatch-details host=192.168.2.2 http-codes="" interval=2m name=Netwatch-192.168.2.2 packet-count=400 packet-interval=200ms test-script="" thr-avg=\
    400ms thr-loss-percent=95% type=icmp up-script=Netwatch-details
Math is simple:
packet-count / interval
so...
400 / 2m or using seconds... 400 / 120s
reduced is
3.33/s
(which is same as @jaclaz's 200/min)
 
Josephny
Forum Guru
Forum Guru
Posts: 1281
Joined: Tue Sep 20, 2022 12:11 am
Location: New York, USA

Re: Netwatch UP threshold

Tue May 06, 2025 4:54 am

/tool netwatch
add comment=Netwatch-192.168.2.2 disabled=no down-script=Netwatch-details host=192.168.2.2 http-codes="" interval=2m name=Netwatch-192.168.2.2 packet-count=400 packet-interval=200ms test-script="" thr-avg=\
    400ms thr-loss-percent=95% type=icmp up-script=Netwatch-details
Math is simple:
packet-count / interval
so...
400 / 2m or using seconds... 400 / 120s
reduced is
3.33/s
(which is same as @jaclaz's 200/min)
What does "packet-interval" mean?

Is it's packet-count/interval, then it will continuously (i.e., without a break) send packets every 300ms
 
User avatar
Amm0
Forum Guru
Forum Guru
Posts: 4862
Joined: Sun May 01, 2016 7:12 pm
Location: California
Contact:

Re: Netwatch UP threshold

Tue May 06, 2025 6:08 am

Maybe @jaclaz can share the spreadsheet, that might be easier to see what going on. Or you can use /tool/torch or /ip/firewall/connections to see the effects.

packet-interval is often each icmp packet is sent within the interval. So one packet goes out when netwatch starts, then after packet-interval, another icmp packet is sent, process repeats until packet-count packets are sent or it reach next interval. That isn't the whole story since there is timeout, and interval is controlling, ...both explained by the formula shown a few posts above.
 
jaclaz
Forum Guru
Forum Guru
Posts: 2873
Joined: Tue Oct 03, 2023 4:21 pm

Re: Netwatch UP threshold

Tue May 06, 2025 12:34 pm

As I read it, your last settings have these three relevant parameters:
1) interval=2m
2) packet-count=400
3) packet-interval=200ms

#1 means that the probe will run every two minutes (or 120 seconds)
#2 means that at each run 400 packets will be sent
#3 means that the packets will be sent one after the other at 200 ms intervals.

so, in 120s 400 packets will be sent in total, which on average means 400/120=3.33 per second or 400/2=200 per minute.
But what I presume actually happens is that once the probe is run, 400 packets are sent at 200 ms intervals, that will take 400x200=80,000 ms, or 80 seconds, then nothing is sent for the remaining 40 seconds, until the 120 seconds are reached and then the probe is run again.
So you have an initial burst of packets sent of 400/80=5 per second or 400/1.333=300 per minute, and then 40 seconds of inactivity (that comprise the timeout setting).

The formula Amm0 posted, that I tried to replicate on the spreadsheet (that I am attaching) has a check column and a ratio one, if the ratio is over 100 % it means that the parameters are wrong, if it is lower the settings are possible.

I added a couple of columns to (hopefully) explain what is actually happening when the probe is run.

As I see it, the default settings are (without reason?) very different from what a "normal" check with ping in Linux or Windows have, (much faster than default and actually faster that the minimum you can set them to), so I personally would try to have them set in a way similar to what the defaults in the OS's are (I do not understand what the advantage is in pinging again and again in a very short time).
You do not have the required permissions to view the files attached to this post.
 
Josephny
Forum Guru
Forum Guru
Posts: 1281
Joined: Tue Sep 20, 2022 12:11 am
Location: New York, USA

Re: Netwatch UP threshold

Tue May 06, 2025 1:15 pm

@jaclaz

That is exaclty as I understood it also: packets are sent every 200ms until 400 packets are sent. This process repeats every 2 minutes. And, a 95% packet loss threshold.

It was with this understanding of how the parameters interact that I chose them with the goal of netwatch not triggering a DOWN situation for a momentary flap. My thinking is that if 30 seconds (for example) goes by without connectivity (which would mean 150 packets -- packets sent 200ms apart for 30 seconds, and therefore a 50% packet-loss threshold) then netwatch would not declare a DOWN situaiton.

This is obviously not the appropriate setting in every situaiton (might not even be the best for my goal), but for me, I did not want to know about 30 second (for example) DOWN situations.

And it satisfies the ((packet-count * packet-interval) + timeout) requirement. I like this conception -- I had it as (packet-count * packet-interval) must be less than timeout. That is, I did not add the interval to (packet-count * packet-interval) because I still don't know what exactly triggers the start of a new interval: The sending of the first packet or the sending of the last.

@Ammo suggests (i.e., states with deserved authority) that the interval countdown starts at the sending of the first packet (of an interval).

Thank you for the sharing the xls file -- now I understand "OK" and "WRONG" (I would have used "CORRECT" and "INCORRECT," or "WORKS" and "DOESN"T-WORK," or "ERROR" and "NO-ERROR").
 
jaclaz
Forum Guru
Forum Guru
Posts: 2873
Joined: Tue Oct 03, 2023 4:21 pm

Re: Netwatch UP threshold

Tue May 06, 2025 4:18 pm

OK doesn't really mean "correct", it is more like "NOT WRONG". :wink:

I am still not convinced of the default (or of yours) settings, a machine that goes ping:
https://www.youtube.com/watch?v=VQPIdZvoV4g
should do so at regular intervals.

The graph would be something *like*:
|____|____|____|____|____|____|____|____|____ ...
the graph of the default settings would be more *like*:
||||||||||____________________________________
with an initial high frequency burst of pings, lasting 1/20 of the interval followed by nothing for the remaining 19/20.

Your settings would have a similar shape, but they seem to me "better" as your initial burst is a looooong one and covers 2/3 of the interval, leaving only 1/3 of inactivity.

Still, you run the probe every 120s.

If Amm0 is correct (and I believe he is) nothing happens to netwatch status during the time needed for the amount of packets to be sent + timeout (80-83 seconds), and then there are 37-40 seconds of nothingness before the next run.

If you had ONLY the lost packet percent as threshold (for simplicity) you would be monitoring the interface for 80 seconds, and if instead of 0 lost packets you find more than 95% of 400, i.e. 380 packets lost the netwatch probe will come out as "down".
Let us assume that packets are not lost "here and there" but are lost all in a same sequence or block.
Since you set an interval of 200 ms between packets, a "glitch" in the connection lasting 380x200= 76000 ms or 76 seconds will be needed to trigger the down status.
This - more or less - would be your actual "resolution", i.e. you could try with these settings to physically disconnect the cable for one whole minute or slightly more and then reconnect it, the netwatch ICMP probe should not be able to sense it.
As a matter of fact this is true if you disconnect the cable EXACTLY at the time the ICMP probe starts or within 4 seconds from its start, if you instead disconnect it exactly at the end of the run, you can keep it disconnected for up to 40+76=116 seconds or almost two minutes without netwatch reaching the threshold.

Now what would happen if you run (still with only the packet loss set as 95% and with the default timeout of 3s) with settings like:
interval=120s (same as you have now)
packet-count=80
packet-interval=1,000ms
To reach the threshold you need to lose 95%x80=76 packets, that will take 76 seconds to be sent, so you have more or less the same "resolution" of the above, and you have exactly the same 40s interval where nothing is sensed/happens.

And what if you change the settings to:
interval=120s (still same as you have now)
packet-count=115
packet-interval=1,000ms
thr-loss-percent=66%
To reach the threshold you need 66%x115=76 lost packets that wiil take the same 76 seconds to be sent, but you don't have anymore the variability between 76 and 116, because you are actually monitoring during all (almost all) the 120s interval.

So, provided that the way I understood the mechanism is correct :? , it seems to me that:
interval should be as low as possible (with some common sense, the default 10s seems too little, I would settle for 60 seconds or 1 minute)
packet-count should be as low as possible
packet-interval should be as high as possible (1,000 ms or one second sounds good as it is the default on general OS and anyway no less than the minimum Windows allows - 500 ms - or Linux strongly suggests - 200 ms)

The result of the formula ((packet-count-1)*packet-interval)+timeout should be as close to 100% (Ratio in the spreadsheet) of the interval as reasonably possible (taking into account some slack due to the times that are actually taken for sending the pings below 95% or maybe 90% sounds like conservative enough).
 
jaclaz
Forum Guru
Forum Guru
Posts: 2873
Joined: Tue Oct 03, 2023 4:21 pm

Re: Netwatch UP threshold

Tue May 06, 2025 5:33 pm

And now, for no apparent reason :shock: , a spreadsheet with a (nice?) table of values to play with.
I am attaching also a screenshot so that one can preview the thingy.
You do not have the required permissions to view the files attached to this post.
 
User avatar
Amm0
Forum Guru
Forum Guru
Posts: 4862
Joined: Sun May 01, 2016 7:12 pm
Location: California
Contact:

Re: Netwatch UP threshold

Wed May 07, 2025 1:33 am

So, provided that the way I understood the mechanism is correct :? , it seems to me that:
interval should be as low as possible (with some common sense, the default 10s seems too little, I would settle for 60 seconds or 1 minute)
Well, I'd say that setting interval= is more about often you want any up/down/test scripts to run. At the end of each interval, those scripts may run (and test will run). There is no early return... so at each interval, it runs until sent-count == packet-count THEN runs test script (if any) plus the up/down if status changed from last interval result.

On the value of interval=...
- In @Josephny case, it's just logging... so a longer interval= may be okay/preferable to avoid log clutter.
- Conversely, if the up/down scripts did something like [en/dis]able a /ip/route, you may want that 10s or 30s window
(e.g. in a multiwan failover case, the interval effects the "recovery time" – so 1m might be too long to wait for failover to trigger for some)
- MikroTik 10s default is likely good, since there are NO scripts run by default... But the interval effect how often WinBox/WebFig/app "update" the UI to show a new status. No one wants to wait 1m to see the status update in UI, and that's all it does without scripts. (Now with script, you got some decision to make on interval=)
 
jaclaz
Forum Guru
Forum Guru
Posts: 2873
Joined: Tue Oct 03, 2023 4:21 pm

Re: Netwatch UP threshold

Wed May 07, 2025 11:08 am

Yep :) , not so casually the main setting to experiment with in the table of the provided spreadsheet is "interval".
I imagine the values for interval that are actually useful as a Bell curve, with values that range from 0 :shock: to 120 or so, hence the 60 seems to me a good value to start with.

Realistically the default interval=10 is is around the minimum that can be set, but in the right scenario it is just fine.

What I really do not agree on (of the default settings) is only the too small packet-interval of 50 ms.

When the two most common OS's around use normally 1000 and at their minimum 200-500, using 50 seems simply "out of range", it may have its uses but it shouldn't (IMHO) be a default.
 
Josephny
Forum Guru
Forum Guru
Posts: 1281
Joined: Tue Sep 20, 2022 12:11 am
Location: New York, USA

Re: Netwatch UP threshold

Wed May 07, 2025 11:59 am

So is this what we all believe happens?

Using an example of:

interval: 10s
timeout: 500ms
packet-interval: 100ms
packet-count: 20

time 0: Netwatch starts the process of sending "packet-count" number of packets (20) at "packet-interval" (100ms)

time 2000ms: Netwatch stops sending packets

time 2500ms: Only at this time (i.e., after seconding 20 packets at 100ms apart and waiting 500ms for the last packet's timeout) does netwatch evaluate the results (UP or DOWN) using the thr-* settings and then run any scipt.

time 10s: Netwatch starts process again

Is this accurate?
 
Josephny
Forum Guru
Forum Guru
Posts: 1281
Joined: Tue Sep 20, 2022 12:11 am
Location: New York, USA

Re: Netwatch UP threshold

Wed May 07, 2025 12:14 pm

So, provided that the way I understood the mechanism is correct :? , it seems to me that:
interval should be as low as possible (with some common sense, the default 10s seems too little, I would settle for 60 seconds or 1 minute)
Well, I'd say that setting interval= is more about often you want any up/down/test scripts to run. At the end of each interval, those scripts may run (and test will run). There is no early return... so at each interval, it runs until sent-count == packet-count THEN runs test script (if any) plus the up/down if status changed from last interval result.

On the value of interval=...
- In @Josephny case, it's just logging... so a longer interval= may be okay/preferable to avoid log clutter.
- Conversely, if the up/down scripts did something like [en/dis]able a /ip/route, you may want that 10s or 30s window
(e.g. in a multiwan failover case, the interval effects the "recovery time" – so 1m might be too long to wait for failover to trigger for some)
- MikroTik 10s default is likely good, since there are NO scripts run by default... But the interval effect how often WinBox/WebFig/app "update" the UI to show a new status. No one wants to wait 1m to see the status update in UI, and that's all it does without scripts. (Now with script, you got some decision to make on interval=)
I think the optimal/ideal/appropriate settings will vary with not only whether there exists UP/DOWN/TEST scripts or logging, but also the goal of using netwatch. I come to my unusual settings and this exploration in general because the defaults were providing, for my purposes, false positives (DOWNs). That is, the chosen combination of settings will be affected by the sensitivity level one desires.

In my case, I am perfectly fine if netwatch considers a connection to be UP even if it goes down for 60 seconds. Others would consider a connection to be DOWN after no connectivity for 1 second.

With even with finer granularity, I would set 2 netwatches: One for 60 seconds and one for 1 second, and execute a script for the 60 second DOWN and a log (or different script) for the 1 second (or some such combination/setting).
 
jaclaz
Forum Guru
Forum Guru
Posts: 2873
Joined: Tue Oct 03, 2023 4:21 pm

Re: Netwatch UP threshold

Wed May 07, 2025 12:29 pm

Is this accurate?
Yes , it seems accurate to me , in the sense of accurate representation of how we have understood it works, not necessarily accurate in an absolute way.

What is missing is (very likely irrelevant in practice, still ...) is the actual time the ping takes, while it is likely a very, very small time, it cannot be zero.
I.e. in your example packet 0 is sent at 0 ms, but is packet 1 sent at time 100 ms or at time (say) 101 ms? And does this difference depend on speed of connection?
0. 0 ms
1. 100 ms (or 101 ms?)
2. 200 ms (or 202 ms?)
...
19. 2000 ms (or 2020 ms?)

Then how does the time needed to execute the connected down script (if triggered) affect the behaviour (if it does affect it)?

Let's say that the down script is an evolution of the one posted before :wink: :
#DO NOTHING
delay 15s
#CONTINUE DOING NOTHING
do the 15s come into play?

I.e. does the "next run" actually always start at time +10s from previous start or when a script is triggered it waits for the actions triggered by the previous run to be completed?
 
Josephny
Forum Guru
Forum Guru
Posts: 1281
Joined: Tue Sep 20, 2022 12:11 am
Location: New York, USA

Re: Netwatch UP threshold

Wed May 07, 2025 12:38 pm

Is this accurate?
Yes , it seems accurate to me , in the sense of accurate representation of how we have understood it works, not necessarily accurate in an absolute way.

What is missing is (very likely irrelevant in practice, still ...) is the actual time the ping takes, while it is likely a very, very small time, it cannot be zero.
I.e. in your example packet 0 is sent at 0 ms, but is packet 1 sent at time 100 ms or at time (say) 101 ms? And does this difference depend on speed of connection?
0. 0 ms
1. 100 ms (or 101 ms?)
2. 200 ms (or 202 ms?)
...
19. 2000 ms (or 2020 ms?)

Then how does the time needed to execute the connected down script (if triggered) affect the behaviour (if it does affect it)?

Let's say that the down script is an evolution of the one posted before :wink: :
#DO NOTHING
delay 15s
#CONTINUE DOING NOTHING
do the 15s come into play?

I.e. does the "next run" actually always start at time +10s from previous start or when a script is triggered it waits for the actions triggered by the previous run to be completed?
Excellent points to consider -- I suspect there are other stages at which processing or evaluating or other things might take time and beg the question of whether the timer is running or suspended for these unintended and unspecified delays.
 
User avatar
rextended
Forum Guru
Forum Guru
Posts: 13078
Joined: Tue Feb 25, 2014 12:49 pm
Location: Italy
Contact:

Re: Netwatch UP threshold

Thu May 08, 2025 2:11 pm

///
 
User avatar
anav
Forum Guru
Forum Guru
Posts: 23602
Joined: Sun Feb 18, 2018 11:28 pm
Location: Nova Scotia, Canada
Contact:

Re: Netwatch UP threshold

Thu May 08, 2025 5:01 pm

/// I will stick with simple ///