Netwatch on ROS7 False Down?

RandyRiver88 · November 16, 2022, 2:42am

Hi,

I am having trouble with the new Netwatch, I am trying to monitor if my Wireguard-VPN is up and passing traffic.

I created a simple static route for example 8.8.8.8/32 through the Wireguard-Gateway, and 2nd rule for 8.8.8.8 as a black hole.

I am using a ICMP Netwatch check.

Configured as:
Host: 8.8.8.8
Interval: 00:01:00
Packet Interval: 1.00
Packet Count: 10
Thr Loss Count: 10

Netwatch keeps randomly reporting as DOWN.

When I check the Status for the last check I am seeing:

Sent Count: 10
Response Count: 10
Loss Count: 0
Loss Percent: 0%

Any idea what could be going on here? I am running 7.7 b6 but this issue has been on going for the last few releases, I would go as far to say when this new Netwatch was implemented.

Am I doing something wrong here or could this actually be a bug?

I have also tested this with 1.1.1., 9.9.9.9, 8.8.8.8 and another public server. All with same intermittent results, across 3 different routers/sites.

Here is an example with 9.9.9.9 on a 5 min interval:

Status: down
Since: Nov/15/2022 21:44:24
Done Tests: 274
Failed Tests: 23
Sent Count: 60
Response Count: 60
Loss Count: O
Loss Percent: 0.0 %
RTT Avg: 123.666 ms
RTT Min: 82.502 ms
RTT Max: 160.734 ms
RTT Jitter: 78.232 ms
RTT Stdev: 17.151 ms

Status: up
Since: Nov/15/2022 21:49:24
Done Tests: 275
Failed Tests: 23
Sent Count: 60
Response Count: 60
Loss Count: 0
Loss Percent: 0.0 %
RTT Avg: 99.896 Ms
RTT Min: 61.099 ms
RTT Max: 152.939 ms
RTT Jitter: 91.840 ms
RTT Stdev: 21.360 ms

Guntis · November 16, 2022, 6:31am

It’s due to thr-rrt-avg value, https://help.mikrotik.com/docs/display/ROS/Netwatch ,you can increase it above 100 to avoid this.

RandyRiver88 · November 16, 2022, 1:43pm

Thankyou.

However, this is poor design.

In theory, if the field is ‘not enabled’ and has ‘no value set’ then it shouldn’t be used when determining if the test result is UP/DOWN.

If this field is going to be used regardless in determining a UP/DOWN result then it should be a mandatory field with 100ms as default.

RandyRiver88 · November 16, 2022, 2:16pm

This still doesn’t work.

After changing it to 200, it doesn’t report a down when losing all packets.

Status: up
Since: Nov/16/2022 09:03:01
Done Tests: 26
Failed Tests: 0
Sent Count: 10
Response Count: 0
Loss Count: 10
Loss Percent: 100.0 %
RTT Avg: 0.000 ms
RTT Min: -0.001 ms
RTT Max: 0.000 ms
RTT Jitter: 0.000 ms
RTT Stdev: 0.000 ms

Edit: Changing Thr Loss Percent from (100%) and making it check Loss Count (10) instead it seems to work now. Took 10 years to get a decent netwatch update but seems poor implementation at best.

Znevna · November 16, 2022, 2:34pm

It’s not rocket science to understand the few settings presented with the default values also mentioned in the manual:
https://help.mikrotik.com/docs/display/ROS/Netwatch#Netwatch-ICMPprobeoptions

challado · August 11, 2023, 1:18pm

I disagree. Yes is a rocket science because every vendor do your understanding about Word, nature, Global Warming, etc.
In manual simply but “thr-loss-percent”. But HOW thr is these? Is average of all tests? What is? and thr-rtt-avg? What is?
And, obviously, these information is ommited on status. Only basic information is displayed, and you can’t do nothing because you can’t guess the values.

pe1chl · August 11, 2023, 1:52pm

I agree it is confusing and incomplete. Like so many other chapters in the HELP system, it immediately dives into explaining properties, without even spending a single paragraph on a global description on how things fit together.

At first when I saw the new modes in Netwatch I believed that I could have the normal way of ping checking, but with some threshold on failures.
E.g. I want to do a single ping each minute, but only after 3 of them have failed I want to go into “down” state.
But it does not seem that is what the new “icmp” type can do, except after very careful tweaking of the config. It would send 3 pings in quick succession every minute and alert me when they do not reply, but it would still be difficult to have a way of monitoring that can e.g. tolerate a reboot of a remote system and come back within 3 minutes.

And if it can do it, I will need to craft the proper settings myself.

Amm0 · August 11, 2023, 3:19pm

Docs could be better – totally agree some explanatory text and examples are missing.

But the real issue is that all the ICMP params have some default value that used if not set. I’m okay with defaults BUT the netwatch ICMP ones are too restrictive.

e.g. these aggressive defaults are what’s cause the “false down” – and since you may not have set the failing one, it’s not obvious at all… And the default are NOT very visible in UI either, so very hard to know what’s failing.

IMO you’d rather “tighten down” setting from more forgiving default … rather than “guess how high” something needs to be to not fail…

Like the new network concept, but agree needs some work/better docs… And using “type=simple” is also “icmp”/ping and avoid this issue if one doesn’t care to monitor the ping metadata…

pe1chl · August 11, 2023, 4:07pm

Yes, type=simple is the classic Netwatch type: a ping sent every [interval] seconds, no reply → down event.
I would have liked a simple extension that allows “N missed pings” before it declares a down event.
What we got was more sophisticated than that, but difficult to tame.

challado · August 19, 2023, 12:33pm

I think that “if the value is not set, I WON’T WILL use these metric”, but mikrotik put on it a DEFAULT value, and all metricits works only with AND operator, not with OR. Simply is inneficient and bad designed.

anav · August 19, 2023, 2:45pm

Is this netwatch ping, to replace doing so in IP Routes?
In other words to ensure internet is actually reachable through ISP?

challado · August 21, 2023, 10:34pm

Anav, I won’t understand so good your question, but my English is poor.
In really check-gateway=ping is a good choice to detect problems to your router, but… If the problem is ACROSS the router, never detect link downs. Here we have several problems with that with two links for redundancy. The internet downs but gateway is still active, because problem is AFTER the gateway. Then, redundancy is useless in this case. I use netwatch to ping to some destinations (static routes on link 1 or link 2) to detect if LINK1 is DOWN or LINK2 (redundancy) is down too. But some links are STARLINK, and ping have abrupt latency variations, but NOT packet loss (link is active, but with a poor performance, but… NOT OFFLINE).

Amm0 · August 21, 2023, 10:47pm

I put in a feature request for the /ip/route’s check-gateway= to support “linking” to one (or more) netwatch entries (and any allowed things like http and icmp with jitter/etc stats). http://forum.mikrotik.com/t/feature-request-link-check-gateway-in-routes-to-a-netwatch-item-s/163771/1

But today check-gateway=ping is just the next-hop router, and no IP to check is allowed (other than using recursive routes which your familar).