starting 6.44.5 long-term some services such as winbox/ssh to router/dhcp/wifi/ospf hang

I’ve got 3 routers forming in a private network using OSPF and SSTP.

#1 RouterBOARD 962UiGS-5HacT2HnT - 6.44.6 (long-term)
#2 and #3 RB951G-2HnD - 6.42.10 (long-term)

While routers #2 and #3 have been working non-stop more than 300 days (but stay on the old RouterOS release), the router #1 started hanging quite often after it had been upgraded on a new RouterOS release 6.44.5 (and later on 6.44.6), so I cannot even diagnose what is wrong.

“Hanging” means:

  1. all wifi devices cannot connect in both 2.4 and 5 GHz ranges
  2. The wired connection to the router of my Windows PC gets that address 169.254.xxx.xxx (DHCP failed)

If I assign the IP address to that wired connection manually (the same as if it was acquired from the router’s DHCP), set my router’s IP as a gateway and as a DNS:
3) I can ping the router - OK
4) Web browsing works just fine (DNS is working, NAT is working)
5) I cannot connect to the router via ssh, winbox and http. ssh command simply hangs indefinitely.
6) I cannot ping other hosts of OSPF network

I also have a watchdog configured to reboot the router if 8.8.8.8 is not reachable, but it does not trigger, so I assume 8.8.8.8 is reachable.

As long as I cannot connect to the router and read the logs, the only option for me is to reboot by unplugging the AC adapter.
Once it is rebooted I try to read logs that are located on the USB flash drive. Unfortunately, they seem to have stopped recording several hours before I recognized the problem.

If anyone has seen a similar picture or has an idea of what might be wrong or which logs can be presented to identify the problem, please help.

Thanks!

I would recommend to do both a backup and export of the configuration and then netinstall the router to the same version as you have now, then restore the backup.
When it still remains (and depending on your skills) you may want to reset it to defaults again and load the export. This is always a bit tricky.

Thanks :slight_smile: I surely wanted to avoid this reset/reload thing. But from what you say this sounds like a routine which I will resort to as that Router #1 is 1 meter distant from me unlike other routers that are 500 km away from me and it hard to make up my mind to upgrade them to the last long-term RouterOS knowing in advance that smth might go wrong.

Maybe there are ways to diagnose RouterOS problems when you don’t have access to it temporary. Maybe some scheduled job with diagnostic output could help.

In the earlier routers there was an RS232 port which would be helpful in such cases, but the model you have no longer has that (outdated) feature.
There probably is a serial TTL connection still on the board (some pads or a header) but you would need to open it to do that, and when there is a hardware problem you may want to solve via warranty it is better not to do that :slight_smile:
You could connect an RS232-to-USB cable to the USB port and see if you can configure that as a console port, I am not sure if that works.

There is no issue with running that particular RouterOS version on your HAP AC, I have that running as well somewhere and it works OK.
There is either a hardware problem with your router or something wrong in the internal configuration, that is why I suggest the reset and when that does not work, return it to the dealer.
It should absolutely be able to run for a year or more without any issue.

Hi! Resets, netintall, reconfiguration did not help.
But it seems one thing has worked out: I disabled logging on the 16GB USB flash disk attached to the router.

Only one of my 3 routers had USB disk and logged its activity to it. Others didn’t. I noticed that the problematic device had low memory (30 Mb free vs 100 Mb free on the routers without logging). Though I don’t really know how the memory management is done, I started to suspect a memory leaking, and I searched for the service that causes the difference in free memory among my routers. Since I ejected the USB disk and opted for syslogd logging, the free memory on the problematic router has become stable (100Mb free). And I haven’t observed hanging so far.

Maybe there is some problem with logging on USB disks attached to the router. While researching on the problem I scheduled a job to run and log free memory and cpu usage every 10 seconds. I wanted to understand what happens when the router becomes unresponsive. It turned out that even that basic scheduled job stops logging anything. It produced no output during several minutes until watchdog decides to reboot the router.

With “free memory” displays in Linux it is always necessary to know how this is exactly calculated. The system uses a “page cache” to hold disk blocks in memory, and this is used both for holding program code (a copy of the program on disk in memory) and as a disk cache for data files (such as your logging file on external USB).
When free memory displayed is not excluding the part of the page cache that is used for data files, it can sometimes show a misleadingly low (and decreasing) value.
That then jumps back up again when you unmount the disk holding the data file.

I’m not sure what figure RouterOS is displaying, I never investigated that (and never used permanently connected external disks). However, I would expect it to be correct and show that value without the page cache.
See this example for my own Linux machine:

              total        used        free      shared  buff/cache   available
Mem:          7.8Gi       2.0Gi       564Mi       191Mi       5.3Gi       5.4Gi

Here you see that from the 8GB memory (old machine…) 264M is free, but 5.3GB is in the cache so when that is subtracted actually 5.4GB is available for use.
When “free memory” would be displayed as a single figure, it could either be 564M or 5.4G in this case. As you can see, quite a dramatic difference. Only 2GB is really “in use”.