Upgraded to 7.1.2 and losing default route every 12 hours

Hello,

I upgraded my router to 7.1.2 and now I’m losing all routing every 8-12 hours. The DHCP default route just disappears, breaking everything along with it. I tried to work around by adding a static route, but it got marked invalid.

According to debug syslog, this seems to happen right after “calc” does a bunch of stuff regarding interface reachability:

Feb 12 12:30:51 10.5.0.1 route,debug,calc 1.3.1 Tag updated routes for merging
Feb 12 12:30:51 10.5.0.1 route,debug,calc 1.3.3 Loading new prefix values
Feb 12 12:30:51 10.5.0.1 route,debug,calc 1.3.4 Merge route updates
Feb 12 12:30:51 10.5.0.1 route,debug,calc 2.2 Merge forwarding path updates
Feb 12 12:30:52 10.5.0.1 route,debug,calc Prepare queued LINK/%*c/10-5/0/SRC10.5.0.1%*c
[snip rest of interfaces]
Feb 12 12:30:52 10.5.0.1 route,debug,calc Set initial reachability for interface LINK/%*c/10-5/0/SRC10.5.0.1%*c
[snip rest of interfaces]
Feb 12 12:30:52 10.5.0.1 route,debug,calc Apply reachability to LINK/%*c/10-5/0/SRC10.5.0.1%*c
[snip rest of interfaces]
Feb 12 12:30:52 10.5.0.1 route,debug,calc Resolving LINK/%*c/10-5/0/SRC10.5.0.1%*c
[snip rest of interfaces]
Feb 12 12:30:52 10.5.0.1 route,debug,calc 3 Main publish
Feb 12 12:30:59 10.5.0.1 route,debug,calc 6.1 Cleanup merge

Within 6 seconds from there VPNs start failing, and within 15 seconds there are DNS failures. It seems to be doing this on ALL interfaces, which doesn’t seem right. I got it to come back up by removing the ping check on the static route, but I still don’t show a DHCP default route, even though I have a valid DHCP lease…

Also odd but notable, I have winbox open through the failure, annd the route list doesn’t update at all, nothing changes, but if I change tab and come back, it suddenly updates and removes everything except the connected interfaces and the invalid static route.

So following up, it failed again while I was around paying attention, and I was able to un-break it by releasing and renewing the DHCP lease, even though it was in a bound state.

It does look like for the ~12 minutes proceeding the failure the wan connection was maxed out at 100 mbps, but that doesn’t really seem like significant throughput to cause anything to stop working. I did see all the same interface reachability messages from calc at about the same time, perhaps a few seconds after the failure.

Also attaching new telegraf stats I setup showing detail during the failure. - Attachment seems broken to me, dropbox link if the attachment doesn’t show https://www.dropbox.com/s/lr73z59hb2s5r3q/mikrotik%20stats%20during%20failure.JPG?dl=0
mikrotik stats during failure.JPG

What does the log say ?
There must be some info at the time the Lease renews or not …

Also, what program is that you’ re using for SNMP monitoring ?

It still shows as bound, I was just assuming the DHCP release/renew was resetting whatever failed process or mechanism that is happening.

The log even says when it stops that it has lost it’s existing IP address, even though the route is broken and the default route doesn’t exist.

The log as it recovers does mention the calc process making changes to the WAN interface’s reachability, so I suspect that is what is failing.

A large block of calc changes setting interface reachability seems to proceed every failure, and a message from the same process about the wan interface is present during the recovery, it seems DHCP triggers calc to do something on the interface.

As for the monitoring, it’s telegraf + influxdb + grafana.