Page 1 of 1

Uptime rollover bug/SNMP

Posted: Mon Nov 16, 2020 10:31 pm
by sathackr

About 497 days ago we deployed our first Mikrotik CRS326 switches running RouterOS 6.44.3 into production.

Today they are one-by-one becoming unreachable via SNMP, and when viewing system uptime in the Web UI, it's becoming clear that the uptime counter is being measured in 32bits and has rolled over.

We suspect this is causing SNMP to fail.

Has there been any update in versions >6.44.3 to address this issue? We have over 400 of these switches deployed and do not want to have to track rebooting them every 497 days.

Re: Uptime rollover bug/SNMP

Posted: Mon Nov 16, 2020 11:20 pm
by joegoldman
497 days is a long time to go without security upgrades etc.

Perhaps set up a yearly maintenance and upgrade cycle.

Or at the least - have SNMP monitoring start warning at day 450, and become critical at day 480.

Who knows - maybe uptime is 64bit int in newer version of RouterOS - a lot of new versions since your current one.

Re: Uptime rollover bug/SNMP

Posted: Mon Nov 16, 2020 11:50 pm
by mkx
Linux kernel had 64-bit uptime counter (regardless the HW platform "bitness") since version 2.6 which was released in mid-December 2003.
ROSv7 is built around much newer linux kernel, so the issue will be gone. Not with ROSv6 though, MT is not going to upgrade kernel inside (it's not a trivial task, they stuck to same kernel for too long).

While I tend to agree that some minimum maintenance is right thing to do I don't see that as pressing for a switch where (almost) everything happens inside ASIC / switch chip.

Re: Uptime rollover bug/SNMP

Posted: Wed Nov 18, 2020 11:54 pm
by sathackr
Yep -- also we are always hesitant to upgrade firmware unless there is a specific issue to address. The risk of firmware upgrade and even just a reboot is not zero. We know that 6.44.3 & 6.44.5 work very well on hundreds of switches and thousands of customers. We're not in a hurry to change it every month when there is a new firmware upgrade and/or potential new firmware regression.

More than a couple of times I've had a MT device fail after a firmware upgrade or simple reboot (corrupt routerboot, corrupt flash, and self-recovery fails and causes and outage and requires subsequent truck roll)

We protect the devices with a robust firewall rule set, and while not perfectly secure, it serves our purposes.

The rollover bug itself isn't necessarily a problem, but SNMP dies somehow in connection with it and makes the devices unmonitorable.