Community discussions

MikroTik App
 
User avatar
shaoranrch
Member Candidate
Member Candidate
Topic Author
Posts: 184
Joined: Thu Feb 13, 2014 8:03 pm

Possible critical routing bug 6.38.7

Fri Feb 02, 2018 12:58 am

Hi everyone,

Here's the scenario
Hardware: CCR1036-12G-4S
ROS: 6.38.7
What's the router used for: Edge routing, terminates 2 eBGP peers and 2 iBGP peers
Routing protocols in use: OSPF and BGP only
Additional info
  • No firewalling done except for rules that block access to SSH, winbox, BGP etc.
  • There's a VRF created with 2 interfaces tied to it, this VRF has 2 /30 (private IPs 10.100.0.0/30 and 10.100.0.4/30, one in each interface) and a single default route in it
  • The router has been 217 days active with no issues at all
  • No simple queues
  • Only mangle rules are set to mark output packets that we consider high-priority (Control traffic, OSPF, BGP, etc.)
  • Only queue trees prioritizing traffic marked by mangle

I hope this get everyone in context, there are no weird configurations or things done in this device.

The router has had a route towards 172.16.0.0/24 for weeks, we had no issues at all communicating to 172.16.0.0/24 hosts, except yesterday a single host from 172.16.0.0/24 wasn't reachable from this only device.

The host's IP in question is 172.16.0.8.

I checked the routing table and could still see 172.16.0.0/24 pointing to the correct exit interface and next-hop more so, I could ping 172.16.0.9 and 172.16.0.10 just fine, there are no routes /32 towards 172.16.0.8 in its routing table, not in the main routing table nor in the VRF table, yet I couldn't reach this host.

Upon examination I found out everything towards 172.16.0.0/24 except 172.16.0.8 is using the intended interface, but traffic towards 172.16.0.8 is being sent over a different interface, what's even more annoying is the fact that the router is sending this traffic with a DST MAC address of FF:FF:FF:FF:FF:FF (broadcast).

I used
ip route check dst-address
To evaluate what's the router doing, and as you can see by the outputs in this screenshot, it's deciding to send traffic towards 172.16.0.8 using ETH1 while the rest of 172.16.0.0/24 is using V100, also it's marking 172.16.0.8 as its next-hop.

Image

I've checked for hours the config and operational status of the router, it hasn't been changed, diffs from config show no changes have been made recently, this just started by itself, probably a reboot would fix this, but since this is a mission critical router (one of severals) I can't just go and reboot it now, gotta plan the downtime, luckily this destination isn't something important (as of right now).

Nevertheless there's also the concern about it fixing but appearing again in the future after the reboot. We don't want to upgrade to 6.39.3 because we upgraded 2 edge routers on two different physical locations (all 1036) to it and randomly after 30 - 40 days of working they started rebooting with logs outputing "Kernel Failure". So we rolled back to 6.38.7 and they haven't rebooted again, nor we can upgrade higher due to a strict policy about only using BFO images.

Has anyone experienced this before? If so, did you do anything to avoid this?

Who is online

Users browsing this forum: No registered users and 75 guests