Netwatch UP threshold

ilium007 · April 26, 2025, 4:39am

What are the default thresholds for a netwatch script to mark a check as UP? I have found forum responses where it seems there used to be netwatch attributes such as success-count but these do not seem to be available in rOS 7.

I have my netwatch “up-script” set to:

:local details ( "sent-count=" . ($"sent-count") . " response-count=" . ($"response-count"))
:log info "PRIMARY INET UP $details"
:if (($"response-count") = ($"sent-count")) do={
/ip route set distance=1 number=[find comment="primary_route"] 
}

but realised that if the response-count was less then sent-count (say, if the link is not yet reliable) the primary_route distance will never be set, netwatch marks as UP and may flap DOWN/UP. I need a way to set my primary_route only if the UP condition is reliable and not just a single successful ICMP result.

I can’t find documentation on how routerOS determines the netwatch UP condition.

I thought it may just be the reverse of the DOWN threshold ie, in my case (I’m using the netwatch ICMP probe type):

packet-count=15 packet-interval=500ms thr-avg=600ms thr-jitter=2s thr-loss-percent=85% thr-max=2s thr-stdev=500ms

But it marks the connection as UP after a single ICMP packet is successfully sent.

jaclaz · April 26, 2025, 8:49am

The netwatch ICMP is (IMHO) complex and mis- or under- documented.

However (from what I understand) the logic seems to be like a double negation, UP is “NOT down”.

But then I don’t understand why you are comparing response-count against sent-count.
If they are equal, it should be the same as thr-loss-count=0.

But then setting appropriate values for packet-count and thr-loss-count probes there should be no need for a comparison within the UP (or DOWN) script, the condition is already checked (and in case triggers the up or down states) by the netwatch.

The problems seem to me to be caused by the other probes settings that are seemingly very difficult to “tune” correctly.

Amm0 wrote a script to more easily set them, maybe useful:
http://forum.mikrotik.com/t/adjusting-netwatch-icmp-check-by-of-current-rtt-values-scalenetwatch/174380/1

ilium007 · April 26, 2025, 9:28am

Thanks - I did look at his script but I still can’t work out how netwatch transitions to “up” state. I ended up writing my own failover script to schedule. Will post in the forums for discussion.

Amm0 · April 26, 2025, 12:21pm

I’m not sure my script is best example of how icmp check works, it assumes you understand the netwatch model.

Docs could be improved to explain the high level logic of netwatch. But they do describe all the parameters: https://help.mikrotik.com/docs/spaces/ROS/pages/8323208/Netwatch#Netwatch-icmpICMPprobeoptions

For ICMP check… what’s likely going on is that all values have defaults (which docs DO show) — so if something unset/empty/blank, a default is used, and that default will be used to UP/DOWN.
So for an “icmp” check ALL of the thr-* (like thr-rtt) value MUST be below either what you set OR, imporantly, the default thr-* value. But often it’s the “hidden default” that causes an unexpected down.

The UP/DOWN for icmp check is determined after number of packets sent every . By defaults, that means it runs for 500ms (1050ms), and that’s when netwatch test will compare ALL of the thr- stuff like thr-loss-percent etc over the collection of pings. Another check is run again at next that applies to all tests. Also * should be below BOTH the “global” and that apply to all netwatch types.

To troubleshoot, it may be easier to look at a failure and compare the thr-* values with the default values shown in docs. If any are higher than default, then set your netwatch to use a higher value for one of the thr-* in your config.

There is also the “simple” check, where you avoid all the thr-* values that determine the results. So idea of the “icmp” check is that you do want to monitor (and fail) if ping was “slow”, not just not responding. But what “slow” means does vary, thus all the options to configure the icmp check. But that does mean you need to define them in a lot of cases. (And to re-itererate the default values are used to determine success/failure, so an empty/unset value is still part of the ICMP check).

jaclaz · April 26, 2025, 1:58pm

The point about:
"< packet-count > * < packet-interval > should be below BOTH the “global” < interval > and < timeout > that apply to all netwatch types. "
is interesting, never actually thought about it.

The defaults are:
General properties:
interval=10s
timeout=3s
ICMP specific probe options:
packet-interval=50ms
packet-count=10

If we use a fictional “my-packet-time” defined as:
my-packet-time=packet-interval*packet-count
the default is
my-packet-time=0.05 * 10=0.50s

So BOTH:
my-packet-time<interval = 0.50<10
and
my-packet-time<timeout = 0.50<3
are true.

The default packet-interval of 50ms does at first sight sound a lot like hammering[1], so it is understandable that ilum007 increased it to 500ms.

But that, combined with the increased number of packets from 10 to 15 makes the my-packet-interval=0.5 * 15=7.5s that will probably be less than interval, but larger than timeout.

[1] the default ping timeout in Windows is 4 seconds and in Linux the -i default parameter is 1 second

Amm0 · April 26, 2025, 2:21pm

I dunno actually, thus the advice (“should”)… is to avoid having to know. Again the docs aren’t clear.

Anyway that ICMP packet-interval= is very different from interval= was why I mentioned (and potentially confusing). And I guess, the minor point was there may be some value in align them to “spread” the pings across the interval= by adjusting the packet-interval=. Although multiple interval-ish things do require some understanding… i.e. it’s nice you can really customize everything — but it not easy to know what you’re checking .

jaclaz · April 26, 2025, 3:08pm

Yep, but until someone manages to “decrypt” the documentation, translating it from Mikrotikish to plain English and adding some commentaries, all this flexibility is counterproductive.

BTW (only as a side-side note) the
thr-loss-percent (Default: 85.0%)
is curious, in the sense that if one leaves the packet count to the default 10, it should mean that 9 is up, but 8 is down, so is the threshold actually 85% or it is 80%?
Do values between 81% and 89% all the same, or are they 80% to 89% (or 81% to 90%)?

I.e. is it the comparison against threshold values made with > or with >=?

ilium007 · April 26, 2025, 3:22pm

RouterOS doesn’t do floating point math so 85% - 89% of 10 pings would be 9 required I think. See my failover script and explanation of using scaled math to work out the percentages: https://gist.github.com/ilium007/5cbe63ce9a148746a7842c1dc55bb967

:put (((10*85)+99)/100)
9
:put (((10*86)+99)/100)
9
 :put (((10*87)+99)/100)
9
:put (((10*88)+99)/100)
9
:put (((10*89)+99)/100)
9
:put (((10*90)+99)/100)
9
:put (((10*91)+99)/100)
10

Amm0 · April 26, 2025, 3:32pm

One has to imagine netwatch is implemented in C, so internally netwatch can do floating point… Now it’s an open question whether thr-loss-percent is inclusive or not (i.e. == or >= )

However, @ilium007 is correct user scripting does not do floating point… so scripting thr-loss-percent= requires the above trick.

jaclaz · April 26, 2025, 5:10pm

Well, that seems to me another thing.

If in that formula 91% comes out as 10, it is not really-really rounding, it is something else, using that formula also 81=9.

Mathematical rounding should be a formula where 80=8, 84=8 and 85=9, that formula instead is as if in a spreadsheet you use int:
=int(((10*91)+99)/100) → 10
so the value when something changes is not 85%, it is 81% (and 81% up to 90% all come out as 9).

With integer limitations, the easier formula would be:
(80+5)/10=8
(81+5)/10=8
(82+5)/10=8
(83+5)/10=8
(84+5)/10=8
(85+5)/10=9
(86+5)/10=9
(87+5)/10=9
(88+5)/10=9
(89+5)/10=9
(90+5)/10=9
(91+5)/10=9

Josephny · April 30, 2025, 10:54am

Do I understand this correctly that the following code will check the host once every 1 minute by sending out 100 packets with a 100ms interval between packets and return a ‘HOST-IS-DOWN’ result if the average response time of those 100 packets is greater than 400ms AND the total packet loss rate is 90% or above (i.e., 90 or more packets fail to elicit a response from the host)?

/tool netwatch
add disabled=no host=192.168.2.2 http-codes="" interval=1m name=Netwatch-192.168.2.2 packet-count=100 \
 packet-interval=100ms thr-avg=400ms thr-loss-percent=90% type=icmp

The goal for me here, BTW, is to not wake up to a screen full of notifications about down hosts when the ISP does some 1 minute flapping in the middle of the night.

jaclaz · April 30, 2025, 11:57am

That is rather easy, suggested “down” script contents:

#DO NOTHING

Amm0 · April 30, 2025, 12:00pm

That looks right to me. As I said, I’d compare the “Status” to make sure all the other RTT things are well within the defaults. If not, or even close, specifically set the various thr-* higher.

If you temporary enable topics=netwatch in /system/logging, it will log both the values got, and the “threshold” amount so you can see how close you are too. You’d likely want to disable netwatch in logging after testing (or use different log than default “memory”), as that could fill your log. But the logging will show pretty clearly how “close” you are to the limits.

WinBox4 with 7.19rc1 (perhaps earlier, didn’t check) will give you a warning that it adjusts interval= to match. They give the formula for that.

Josephny · April 30, 2025, 12:10pm

The goal for me here, BTW, is to not wake up to a screen full of notifications about down hosts when the ISP does some 1 minute flapping in the middle of the night.

That is rather easy, suggested “down” script contents:
#DO NOTHING

LOL! That’s just what they think you’d think they’d say you’d say.

wiseroute · April 30, 2025, 1:31pm

@josephny,

ii think your script output problem lies in your understanding how the script being processed by mt.

packet-count=15 packet-interval=500ms thr-avg=600ms thr-jitter=2s thr-loss-percent=85% thr-max=2s thr-stdev=500ms

try to add parameters one at a time and see how the netwatch script being processed. see how the ‘and’ and ‘or’ act between those parameters results.

Josephny · April 30, 2025, 1:55pm

That looks right to me. As I said, I’d compare the “Status” to make sure all the other RTT things are well within the defaults. If not, or even close, specifically set the various thr-* higher.

If you temporary enable topics=netwatch in /system/logging, it will log both the values got, and the “threshold” amount so you can see how close you are too. You’d likely want to disable netwatch in logging after testing (or use different log than default “memory”), as that could fill your log. But the logging will show pretty clearly how “close” you are to the limits.

I dunno actually, thus the advice (“should”)… is to avoid having to know. Again the docs aren’t clear.

WinBox4 with 7.19rc1 (perhaps earlier, didn’t check) will give you a warning that it adjusts interval= to match. They give the formula for that.

netwatch-icmp-interval-warning.png

Great suggestion to enable logging – I didn’t know the thresholds would be shown.

Josephny · April 30, 2025, 2:10pm

With netwatch:

/tool netwatch
add disabled=no host=192.168.2.2 http-codes=“” interval=1m name=Netwatch-192.168.2.2 packet-count=100 packet-interval=100ms thr-avg=400ms
thr-loss-percent=90% type=icmp

Enabled logging and seeing this every minute:

2025-04-30 09:59:32 netwatch,debug [ Netwatch-192.168.2.2 ] Stats:
2025-04-30 09:59:32 netwatch,debug [ OK ] rtt-max: 40.442 ms [ <= 1000.000 ms ]
2025-04-30 09:59:32 netwatch,debug [ OK ] rtt-jitter: 24.666 ms [ <= 1000.000 ms ]
2025-04-30 09:59:32 netwatch,debug [ OK ] rtt-avg: 22.661 ms [ <= 400.000 ms ]
2025-04-30 09:59:32 netwatch,debug [ OK ] rtt-stdev: 4.330 ms [ <= 250.000 ms ]
2025-04-30 09:59:32 netwatch,debug [ OK ] loss count: 0 [ <= 4294967295 ]
2025-04-30 09:59:32 netwatch,debug [ OK ] loss: 0% [ <= 90.0% ]

Any way to have it logged only when host is determined to be down?

Maybe a down-script:

:local thisBox [/system identity get name];
:local Host $host
/tool netwatch
:local Status [get [find where host="$Host"] status]
:local Comment [get [find where host="$Host"] comment]
:local Interval [get [find where host="$Host"] interval]
:local Rtt-ave [get [find where host="$Host"] rtt-ave]
:local Loss [get [find where host="$Host"] loss]
:log info "watch_host=$Host comment=\"$Comment\" rtt-ave=$Rtt-ave interval=$Interval loss=$Loss

jaclaz · April 30, 2025, 2:23pm

@Amm0
I guess that writing that formula in the documentation was just too much work?

Seriously, which parameters lead to 19s620ms?

@Josepny
Using in scripts variables with the same name of ROS parameters/values/commands/etc. is usually not a good idea.
point #16 here:
http://forum.mikrotik.com/t/gp-csa-for-mikrotik-devices/182176/1
use (say) my_ prefix, i.e. my_status, my_comment, etc.

Amm0 · April 30, 2025, 2:45pm

interval=10s packet-count=50 packet-interval=380ms

jaclaz · April 30, 2025, 3:19pm

And timeout=1s, OK. (interval is not used in the formula, it is just the reference).