Trouble with NTP Server - "leap 11 not synchronized"

Hello!

I’m running RouterOS 7.5 on a Chateau 5G (D53G-5HacD2HnD). RouterOS 7 includes the full NTP client & server built in, and doesn’t need a separate package installed.

I am trying to use the Chateau as the NTP server for all the devices on its LAN, so that they all have a single consistent time source, rather than individual connections to potentially different internet time sources. For these devices my goal is to maximise relative consistency among the fleet, rather than absolute accuracy. Additionally, I want all devices on the LAN to have access to time synchronisation, but only a subset of them are actually permitted to send traffic to the WAN & internet. Therefore I want to use the Mikrotik’s NTP server to provide time to the LAN-only devices.

The problem that I have is that RouterOS reports that its NTP client is working and synchronized, but the NTP clients getting time from the Chateau report that the NTP Server is not synchronized. This prevents the downstream clients from using the Chateau’s time for synchronisation. Why could this be happening?

More details on configuration settings and evidence below.

The RouterOS NTP client is enabled, working and synchronized:

[me@MYROUTER] > /system/ntp/client print
         enabled: yes
            mode: unicast
         servers: 0.pool.ntp.org,1.pool.ntp.org,2.pool.ntp.org,3.pool.ntp.org
             vrf: main
      freq-drift: 150.301 PPM
          status: synchronized
   synced-server: 1.pool.ntp.org
  synced-stratum: 2
   system-offset: -346.674 ms
   
[me@MYROUTER] /system/ntp/client/servers> print
Flags: X - disabled; D - dynamic 
 0    address=0.pool.ntp.org resolved-address=27.124.125.251 min-poll=6 
      max-poll=10 iburst=yes auth-key=none 

 1    address=1.pool.ntp.org resolved-address=139.99.222.72 min-poll=6 
      max-poll=10 iburst=yes auth-key=none 

 2    address=2.pool.ntp.org resolved-address=203.29.242.183 min-poll=6 
      max-poll=10 iburst=yes auth-key=none 

 3    address=3.pool.ntp.org resolved-address=27.124.125.252 min-poll=6 
      max-poll=10 iburst=yes auth-key=none 
      
[me@MYROUTER] /system/ntp> monitor-peers
 type="ucast-client" address=139.99.222.72 refid="17.253.66.125" stratum=2 hpoll=9 ppoll=9 root-delay=0.335 ms root-disp=0.305 ms 
   offset=-258.078 ms delay=96.48 ms disp=16.001 ms jitter=120.42 ms 

 type="ucast-client" address=27.124.125.251 refid="162.159.200.123" stratum=4 hpoll=9 ppoll=9 root-delay=2.563 ms root-disp=0.625 ms 
   offset=-269.07 ms delay=138.172 ms disp=41.043 ms jitter=176.6 ms 

 type="ucast-client" address=203.29.242.183 refid="130.95.179.80" stratum=2 hpoll=9 ppoll=9 root-delay=1.831 ms root-disp=18.432 ms 
   offset=-450.306 ms delay=138.533 ms disp=19.173 ms jitter=95.711 ms 

 type="ucast-client" address=27.124.125.252 refid="202.70.69.81" stratum=2 hpoll=9 ppoll=9 root-delay=0.885 ms root-disp=0.366 ms 
   offset=-540.282 ms delay=355.316 ms disp=15.846 ms jitter=182.991 ms

NTP Server is enabled for unicast (which is the only mode I want to use):

[me@MYROUTER] > /system/ntp/server print
              enabled: yes
            broadcast: no
            multicast: no
             manycast: no
  broadcast-addresses: 
                  vrf: main
      use-local-clock: no
  local-clock-stratum: 5
             auth-key: none

However my NTP clients on the LAN are reporting that although they can talk to the NTP server running on the Chateau, it is indicating that is not synchronized. Therefore my client devices are refusing to synchronise to the Chateau.

Here is the output from a Linux client device trying to synchronise to the NTP server at the Chateau’s IP address:

[pi@MYCLIENT:~ $ sudo ntpdate -uvdb 172.16.0.254 
17 Oct 17:10:12 ntpdate[1730]: ntpdate 4.2.8p12@1.3728-o (1) 
Looking for host 172.16.0.254 and service ntp 
host found : 172.16.0.254 
transmit(172.16.0.254) 
receive(172.16.0.254) 
transmit(172.16.0.254) 
receive(172.16.0.254) 
transmit(172.16.0.254) 
receive(172.16.0.254) 
transmit(172.16.0.254) 
receive(172.16.0.254) 
172.16.0.254: Server dropped: leap not in sync 

server 172.16.0.254, port 123 
stratum 3, precision -20, leap 11, trust 000 
refid [139.99.222.72], root delay 0.091080, root dispersion 0.553360 
transmitted 4, in filter 4 
reference time:    e6f797a8.b609c482  Mon, Oct 17 2022 16:58:16.711 
originate timestamp: e6f79a9b.d38c64fd  Mon, Oct 17 2022 17:10:51.826 
transmit timestamp:  e6f79a7b.2579c861  Mon, Oct 17 2022 17:10:19.146 
filter delay:  0.02751  0.02744  0.02734  0.02747  
        0.00000  0.00000  0.00000  0.00000  
filter offset: 32.67808 32.67838 32.67861 32.67889 
        0.000000 0.000000 0.000000 0.000000 
delay 0.02734, dispersion 0.00023 
offset 32.678617 

17 Oct 17:10:19 ntpdate[1730]: no server suitable for synchronization found

Some relevant facts that I want to draw attention to. The ntpdate log shows both transmit(172.16.0.254) and receive(172.16.0.254) events, which indicates that the client device is getting a response packet back from the Chateau. The response from the server contains the correct stratum (3 - since it is synchronising to a stratum 2 pool.ntp.org server) and shows a sensible refid for the upstream pool.ntp.org server. But the leap value is “11”. This is a bitmask of two bits; a value of 00 indicates full synchronisation, 01 and 10 have specific meanings concerned with leap second handling, and 11 means “unsynchronized”.

Looking at some of my Windows devices on the LAN, and examining the status of the standard Windows NTP service w32time, shows similar status info. It displays the leap value bitmask’s decimal value (3) rather than two separate bits (11):

PS C:\Windows\system32> w32tm /query /status
Leap Indicator: 3(not synchronized)
Stratum: 0 (unspecified)
Precision: -23 (119.209ns per tick)
Root Delay: 0.0775065s
Root Dispersion: 8.2703314s
ReferenceId: 0x00000000 (unspecified)
Last Successful Sync Time: 17/10/2022 6:02:40 AM
Source: 172.16.0.254
Poll Interval: 10 (1024s)

w32tm is able to successfully query the NTP server on the Chateau and report on the relative time difference, so this again reiterates that there does not seem to be any issue with routing or blocking NTP packets:

PS C:\Windows\system32> w32tm /stripchart /computer:172.16.0.254
Tracking 172.16.0.254 [172.16.0.254:123].
The current time is 17/10/2022 5:52:03 PM.
17:52:03, d:+00.0006544s o:-00.1444332s  [                           *                           ]
17:52:05, d:+00.0006025s o:-00.1441264s  [                           *                           ]
17:52:07, d:+00.0006019s o:-00.1438235s  [                           *                           ]
17:52:09, d:+00.0005737s o:-00.1435123s  [                           *                           ]
17:52:11, d:+00.0005563s o:-00.1432195s  [                           *                           ]

Why would RouterOS’s NTP client be showing “synchronized” while the server component is telling clients that it’s not synchronized?

It’s possible that this might be related in some way to either the quality of synchronisation, or the use of “NTP pool” DNS names which can yield different servers when re-queried.

I found this morning that my clients were successfully synchronising to the RouterOS NTP server. Then, they stopped synchronising, giving the same information as before (NTP server “leap” value indicating that it wasn’t synchronised). I didn’t change the RouterOS configuration during this time, it just intermittently worked or not.

Since then, I have added two more NTP server names, pointing at NTP servers which are operated by the 5G ISP. After doing this, the RouterOS NTP client chose to synchronise to those servers instead of pool.ntp.org servers. Now the RouterOS NTP server is telling LAN clients that it is synchronised, and the clients are syncing successfully. I don’t know yet if this will prove to be a stable fix.

I’m still not sure whether the reported lack of synchronisation was due to some oddity of DNS resolution interacting with the RouterOS NTP client & server, or dispersion/jitter values, or something else completely.

The RouterOS system logs do show quite a few “ntp change time” events with no apparent actual change:

oct/17 17:44:18 system,info ntp change time Oct/17/2022 17:44:18 => Oct/17/2022 17:44:18
 oct/17 18:18:26 system,info ntp change time Oct/17/2022 18:18:26 => Oct/17/2022 18:18:26
 oct/17 19:09:41 system,info ntp change time Oct/17/2022 19:09:41 => Oct/17/2022 19:09:41
 oct/17 19:46:01 system,info ntp change time Oct/17/2022 19:46:01 => Oct/17/2022 19:46:01
 oct/17 20:03:06 system,info ntp change time Oct/17/2022 20:03:06 => Oct/17/2022 20:03:06
 oct/17 20:45:46 system,info ntp change time Oct/17/2022 20:45:46 => Oct/17/2022 20:45:46
 oct/17 21:05:03 system,info ntp change time Oct/17/2022 21:05:03 => Oct/17/2022 21:05:03
 oct/17 21:45:32 system,info ntp change time Oct/17/2022 21:45:32 => Oct/17/2022 21:45:32
 oct/17 22:02:41 system,info ntp change time Oct/17/2022 22:02:41 => Oct/17/2022 22:02:41
 oct/17 23:13:06 system,info ntp change time Oct/17/2022 23:13:06 => Oct/17/2022 23:13:06
 oct/17 23:47:17 system,info ntp change time Oct/17/2022 23:47:17 => Oct/17/2022 23:47:17
 00:29:57 system,critical,info ntp change time Oct/18/2022 00:29:59 => Oct/18/2022 00:29:57
 00:56:22 system,info ntp change time Oct/18/2022 00:56:22 => Oct/18/2022 00:56:22
 01:30:36 system,info ntp change time Oct/18/2022 01:30:36 => Oct/18/2022 01:30:36
 01:46:45 system,info ntp change time Oct/18/2022 01:46:45 => Oct/18/2022 01:46:45
 03:04:31 system,info ntp change time Oct/18/2022 03:04:31 => Oct/18/2022 03:04:31
 03:40:52 system,info ntp change time Oct/18/2022 03:40:52 => Oct/18/2022 03:40:52
 03:57:57 system,info ntp change time Oct/18/2022 03:57:57 => Oct/18/2022 03:57:57
 05:12:32 system,info ntp change time Oct/18/2022 05:12:32 => Oct/18/2022 05:12:32
 05:29:28 system,info ntp change time Oct/18/2022 05:29:28 => Oct/18/2022 05:29:28
 05:55:05 system,info ntp change time Oct/18/2022 05:55:05 => Oct/18/2022 05:55:05
 06:22:57 system,info ntp change time Oct/18/2022 06:22:57 => Oct/18/2022 06:22:57
 07:37:36 system,info ntp change time Oct/18/2022 07:37:36 => Oct/18/2022 07:37:36
 08:20:20 system,info ntp change time Oct/18/2022 08:20:20 => Oct/18/2022 08:20:20
 09:03:02 system,critical,info ntp change time Oct/18/2022 09:03:02 => Oct/18/2022 09:03:01
 09:37:04 system,info ntp change time Oct/18/2022 09:37:04 => Oct/18/2022 09:37:04
 10:00:13 system,info ntp server record added
 10:00:13 system,info ntp settings changed by REDACTED
 10:00:31 system,info ntp change time Oct/18/2022 10:00:31 => Oct/18/2022 10:00:31
 10:00:34 system,info ntp server record added
 10:00:34 system,info ntp settings changed by REDACTED
 10:19:49 system,info ntp change time Oct/18/2022 10:19:49 => Oct/18/2022 10:19:49

The negative time jumps flagged as “critical” severity are particularly noticeable.

Does this mean that RouterOS NTP client is using clock-jump synchronisation (SNTP-style, like Debian’s systemd-timesyncd) rather than clock rate discipline (NTP-style, like ntpd/chrony/w32time) to achieve synchronisation?

For comparison purposes, here’s the w32tm output from a client machine when the RouterOS NTP server is announcing that it is properly synchronised and the Windows machine is able to synch to it as expected:

PS C:\Windows\system32> w32tm /query /peers
#Peers: 1

Peer: 172.16.0.254
State: Active
Time Remaining: 884.4168654s
Mode: 3 (Client)
Stratum: 3 (secondary reference - syncd by (S)NTP)
PeerPoll Interval: 10 (1024s)
HostPoll Interval: 10 (1024s)

PS C:\Windows\system32> w32tm /query /status
Leap Indicator: 0(no warning)
Stratum: 4 (secondary reference - syncd by (S)NTP)
Precision: -23 (119.209ns per tick)
Root Delay: 0.0679568s
Root Dispersion: 8.1670319s
ReferenceId: 0xAC1000FE (source IP:  172.16.0.254)
Last Successful Sync Time: 18/10/2022 10:33:46 AM
Source: 172.16.0.254
Poll Interval: 10 (1024s)

Ok, adding the new servers to the RouterOS NTP client configuration only temporarily fixed the problem.

In Winbox, the RouterOS NTP Client is still showing “synchronised” state but the LAN clients are once again saying that the RouterOS NTP Server is returning “unsynchronized” status when queried.

My Windows test machine last synchronised successfully at 11:07:54AM; the RouterOS logs show an “ntp change time” at 11:10:00 AM. It seems like after the “ntp change time” event happens, the RouterOS NTP Server loses synchronisation.

Seems like it might be a RouterOS bug. I guess I’d better raise a support ticket.