Netwatch tool

airvb · February 6, 2025, 8:08am

Hello,
To monitor my home server’s proper functioning, I have set up monitoring with Netwatch using the https-get type.
Not understanding why the alarm is triggered, the script ( see below) runs when the alarm is generated.

So often, but not regularly, the alarm is triggered, and it’s the tcpConnectTime parameter collected with this command ([tonsec [/tool netwatch get terra tcp-connect-time ]]) that is found to be zero.
Less than a minute later, the alarm is cleared with a normal tcpConnectTime value.

What I don’t understand is that this parameter is not supposed to be a triggering element in Netwatch with the https-get parameter. See the doc: https://help.mikrotik.com/docs/spaces/ROS/pages/8323208/Netwatch#Netwatch-HTTP-GET%2FHTTPS-GETprobepass%2Ffailcriteria

Any ideas?

Config of netwatch :

/tool netwatch export where host=192.168.88.200
# 2025-02-06 08:54:25 by RouterOS 7.17.1
# model = RBD52G-5HacD2HnD
# serial number = B4xxxxxxx
/tool netwatch
add comment="Verification Terra repond aux requetes https" disabled=no down-script=terra_offline host=192.168.88.200 http-codes="" interval=1m name=terra port=\
    443 src-address=192.168.88.1 startup-delay=0s test-script="\
    \n" thr-http-time=200ms thr-tcp-conn-time=20s type=https-get up-script=terra_online

script : script=terra_offline

# Extract Netwatch stats for the "terra" test
# Parse the stats to extract the required variables
:local tcpConnectTime ([tonsec [/tool netwatch get terra tcp-connect-time ]] )
:local httpStatusCode ([/tool netwatch get terra http-status-code ])
:local httpRespTime ( [/tool netwatch get terra http-resp-time ])

# Log the extracted variables for debugging
:log warning ("TERRA OFF line")
:log warning ("TCP Connect Time: $tcpConnectTime")
:log warning ("HTTP Status Code: $httpStatusCode")
:log warning ("HTTP Response Time ms: $httpRespTime")

# Initialize the retry counter
:local retryCount 0

# Check if tcpConnectTime is zero and retry if necessary
:while ($tcpConnectTime = 0 and $retryCount < 2) do {
    :delay 10s
    :set retryCount ($retryCount + 1)
    :local tcpConnectTime ([tonsec [/tool netwatch get terra tcp-connect-time ]] )
    :log warning ("Retrying... TCP Connect Time: $tcpConnectTime, Attempt: $retryCount")

    # Check again if tcpConnectTime is positive after retry
    :if ($tcpConnectTime > 0) do {
        :log warning ("tcpConnectTime is positive after retry. Exiting the script.")
        :return
    }
}
# tcpconnecttime confirmé à zero

# Split the time string into components
:local hours ([ :pick $httpRespTime  0 2 ])
:local minutes ([ :pick $httpRespTime  3 5 ])
:local seconds ([ :pick $httpRespTime  6 8 ])
:local microseconds ([ :pick $httpRespTime 11 16 ])

# Convert each component to milliseconds
:local hoursInMs ($hours * 3600000)
:local minutesInMs ($minutes * 60000)
:local secondsInMs ($seconds * 1000)
:local microsecondsInMs ($microseconds)

# Sum all components to get the total milliseconds
:local totalMs ($hoursInMs + $minutesInMs + $secondsInMs + $microsecondsInMs)

# Construct the alert message
:local alertMessage ("Terra is OFF line from MK. TCP Connect Time: $tcpConnectTime, HTTP Status Code: $httpStatusCode, HTTP Response Time (ms): $totalMs")

# Send the alert using /tool fetch
/tool fetch url="https://ntfy.sh/xxxxxxx" mode=https http-method=post http-data=$alertMessage http-header-field="Priority:5, Tags:sos, Title:### ALERTE ###, Tags:sos"

Log when alarm is trigerred :

025-02-06T08:46:57.019327+01:00 mikrotik script,warning TERRA OFF line
2025-02-06T08:46:57.019791+01:00 mikrotik script,warning TCP Connect Time: 0
2025-02-06T08:46:57.020134+01:00 mikrotik script,warning HTTP Status Code: 200
2025-02-06T08:46:57.020581+01:00 mikrotik script,warning HTTP Response Time ms: 00:00:00.003371
2025-02-06T08:47:07.022830+01:00 mikrotik script,warning Retrying... TCP Connect Time: 0, Attempt: 1
2025-02-06T08:47:17.025749+01:00 mikrotik script,warning Retrying... TCP Connect Time: 0, Attempt: 2
2025-02-06T08:47:56.111672+01:00 mikrotik script,warning TERRA ONline
2025-02-06T08:47:56.112094+01:00 mikrotik script,warning TCP Connect Time: 544000
2025-02-06T08:47:56.112579+01:00 mikrotik script,warning HTTP Status Code: 200
2025-02-06T08:47:56.112944+01:00 mikrotik script,warning HTTP Response Time ms: 00:00:00.003542

airvb · February 9, 2025, 5:44pm

More information …
After enable debug mode for netwatch
What can we see when snag is raised :

2025-02-09T18:29:53.702975+01:00 mikrotik netwatch,info event down [ terra ]
2025-02-09T18:29:53.702975+01:00 mikrotik netwatch,debug [b][ terra ] [FAIL] TCP handshake[/b]
2025-02-09T18:29:53.702975+01:00 mikrotik netwatch,debug    [ OK ] http-resp-time: 3.092ms [ <= 200.000ms ] 
2025-02-09T18:29:53.703181+01:00 mikrotik netwatch,debug http-status-code: 
2025-02-09T18:29:53.703181+01:00 mikrotik netwatch,debug [ OK ]  [ 100 <= ] 200 [ <= 299 ] 
2025-02-09T18:29:54.640063+01:00 mikrotik netwatch,debug [ terra ] [FAIL] TCP handshake
2025-02-09T18:29:54.640063+01:00 mikrotik netwatch,debug    [ OK ] http-resp-time: 2.788ms [ <= 200.000ms ] 
2025-02-09T18:29:54.640211+01:00 mikrotik netwatch,debug http-status-code: 
2025-02-09T18:29:54.640211+01:00 mikrotik netwatch,debug [ OK ]  [ 100 <= ] 200 [ <= 299 ] 

2025-02-09T18:30:52.819479+01:00 mikrotik netwatch,info event up [ terra ]
2025-02-09T18:30:52.819479+01:00 mikrotik netwatch,debug [ terra ] [ OK ] TCP handshake
2025-02-09T18:30:52.819626+01:00 mikrotik netwatch,debug    [ OK ] tcp-conn-time: 0.376ms [ <= 8000.000ms ] 
2025-02-09T18:30:52.819626+01:00 mikrotik netwatch,debug    [ OK ] http-resp-time: 2.915ms [ <= 200.000ms ] 
2025-02-09T18:30:52.819771+01:00 mikrotik netwatch,debug http-status-code: 
2025-02-09T18:30:52.819771+01:00 mikrotik netwatch,debug [ OK ]  [ 100 <= ] 200 [ <= 299 ]

Still don’t undestand why for netwatch https-get type the TCP handshake is test ?

It gives a lot a false positive alarm .

Do you think it’s a bug or i haven’t understood the netwatch https-het function ?

jaclaz · February 9, 2025, 7:07pm

the default setting specification is confusing:
thr-tcp-conn-time (Default: 00:05…00:30) Fail threshold for tcp-connect-time, the configuration uses microseconds, if the time unit is not specified (s/m/h), log and status pages display the same value in milliseconds.

You probably have it set to 8s max (from the log the successful connection is [ OK ] tcp-conn-time: 0.376ms [ <= 8000.000ms ) possibly the documented defaults are wrong, but what is the minimum set then? But in the script you seemingly set it to 20 seconds thr-tcp-conn-time=20s?

But from your log it seems that the connection was down immediately before the TCP handshake failure:

2025-02-09T18:29:53.702975+01:00 mikrotik netwatch,info event down [ terra ]
2025-02-09T18:29:53.702975+01:00 mikrotik netwatch,debug [ terra ] [FAIL] TCP handshake
2025-02-09T18:29:53.702975+01:00 mikrotik netwatch,debug [ OK ] http-resp-time: 3.092ms [ <= 200.000ms ]
2025-02-09T18:29:53.703181+01:00 mikrotik netwatch,debug http-status-code:
2025-02-09T18:29:53.703181+01:00 mikrotik netwatch,debug [ OK ] [ 100 <= ] 200 [ <= 299 ]
2025-02-09T18:29:54.640063+01:00 mikrotik netwatch,debug [ terra ] [FAIL] TCP handshake
2025-02-09T18:29:54.640063+01:00 mikrotik netwatch,debug [ OK ] http-resp-time: 2.788ms [ <= 200.000ms ]
2025-02-09T18:29:54.640211+01:00 mikrotik netwatch,debug http-status-code:
2025-02-09T18:29:54.640211+01:00 mikrotik netwatch,debug [ OK ] [ 100 <= ] 200 [ <= 299 ]

2025-02-09T18:30:52.819479+01:00 mikrotik netwatch,info event up [ terra ]

and was restored only after almost a full minute.

So the doubt I have is more why in the meantime you had two successful http responses http-status-code: and http-resp-time:, if the event is down those should have failed as well.

Anyway since also when you had the alarm almost a minute passed
025-02-06T08:46:57.019327+01:00 mikrotik script,warning TERRA OFF line
…
2025-02-06T08:47:56.111672+01:00 mikrotik script,warning TERRA ONline

maybe there is a timing problem and the 10s delay is too little or the number of retries need to be increased.

It is entirely possible that, no matter why the netwatch is triggered, and no matter why, a full minute is needed to resume.

airvb · February 15, 2025, 12:56pm

Thank you for your reply.

I’ve tried a lot of different things without getting anywhere.

One thing is sure, when event down is raised , http-resp-time and http-status-code remains on the last values before the FAIL.
then it’s impossible to make an alert on these parameters since they don’t evolve after the tcp connectime alert is activated.

If I have time I’ll make a ticket

Finally I switched back to tcp-conn type detection with port 443, so if the web server is stopped the alert is activated.

jaclaz · February 15, 2025, 4:48pm

In your log the http-resp-time has once 3.092ms and once 2.788ms, so they don’t seem the same remained sticky".

airvb · February 15, 2025, 6:10pm

You’re right … need to investigate more …

jaclaz · February 15, 2025, 6:28pm

Yes, as from log both, or at least the second, http-resp-time (and http-status-code) happen once event down happened, which makes little sense.
The second set appears almost a second later, that seems to me too much to be a delay of the OS,

You need to check the whole log, not only the netwatch messages.

jaclaz · February 16, 2025, 10:26am

Also, keep an eye on this:
http://forum.mikrotik.com/t/netwatch-https-get-rarely-false-on-down-triggers/181987/1

If it works it seems to me like a nice way to double check and exclude false alarms (if they are false).