RB5009 dropping all traffic for a few seconds

we are having an issue on the RB5009UPr+S+ on ROS ver 7.18.2 where randomly ALL the ether ports and the SFP port will disconnect and then reconnect a few seconds later. The log shows the drop, and then our BFD & OSPF sessions drop and have to reconnect. It’s a 1-2sec blip but it happens over and over every few mins or sometimes will go a few hours.

as a test we totally replaced the router with another RB5009UP unit in case it was a failing hardware, but the 2nd unit is now doing the exact same thing.

We have not made any changes to the router, this seems to have started since upgrading to 7.18.2. This router has running for a few years without any issues prior to the latest round of ROS upgrades we did.

I have a support ticket open with MK already, and sent them the supout, but have not yet heard back from them. Just wondering if anyone else is seeing this or if it’s something specific to us?

Did you reuse the original unit’s PSU on the new one?

I have 24V from the battery bank direct into the 2 wire dc input, and a 24v dc adaptor on the barrel connector as well. it’s been setup this way for a few years, no issues. ROS upgraded to 7.18.2 and now getting this random link dropping issue. it will go several hours with no problems, then drop 4-5 times within a few minutes, each time for only 1 or 2 seconds. I cannot seem to determine what is causing it unless there’s some sort of bug in this ROS version.
I have already upgraded the firmware so I am not sure if I can downgrade ROS now?

You can always downgrade (down to factory version, you can not go lower).

Isn’t there anything visible about these drops in log files ?

Two very close voltages will cause the diode bridge inside the device to toggle between sources each time one of the two rises slightly above the other. This in turn looks to the power supplies like a sudden load appearing and then disappearing at each switch.

What you want instead is a significant difference such that the backup DC power source toggles into on-load mode only when the main source is going into dropout. Given a typical regulation spec of 5%, either one source should be ≥ 24×1.05=25.2V or the other should be < 24×0.95=22.8V.

Now we learn that these long-abused DC sources remain in use after the switch to the backup router. Because the problem source has to be in one of the elements you didn’t change, they’re definitely on the “suspect” list.

You claim the unchanging element is the ROS version, and you might be right, except for this: why aren’t we hearing about this happening to all the other users of 7.18.2 on RB5009s?

EDIT: I’ll dare to add a few predictions here:


  1. If you put a data logger on these two DC supplies and run it at a sufficiently high sample rate, capturing voltage and current, you will find a time correlation between the drop-out event and a switch from one power supply to the other.
  2. At the same time, the one being switched to will be found to have dropped below the RB5009’s minimum voltage spec, which happens to be right at 24V. (Minus load regulation, my “5%” rule of thumb above.)

Here you are teetering on the hairy edge of success, and then you go a-dancing. :roll_eyes:

What devices are connected to the router? It could be a grounding issue, or failed cabling to one of the cameras/radios/whatever else you have plugged into it. Or it could be related to a switching loop.

Does any of the attached equipment show anything in its logs (besides ports dropping)?

Is it being used to provide power to one of the devices? Is one of those devices possibly surging or reaching a higher level of throughput than you have ever had before?

I have seen where some radios getting upgraded to newer firmware draw more power than previous versions did before.

I don’t think it’s the DC power as it’s directly attached to a giant 24V battery bank consisting of 12 x 6V batteries but i will test

Nothing has been changed or added to the site. all of the radios directly connected are backhaul radios. it’s strange because the issue will happen every 2-5mins for maybe ten times in a row, then drop to every 10-20mins then will go for hours, and then the cycle starts all over again. We’re trying a different power source today to eliminate that possibility. We have the same setup at many other tower sites and it’s rock solid.

The potential issue I’m bringing up has nothing to do with the mass of your battery bank, but with the fact that both it and the wall supply will be fluctuating around the ideal 24.0000V by some amount, such that each one will end up running the device in alternation, depending on which one happens to have the higher voltage at any given instant.

If you have it in your mind that the battery supply is purely for backup when the wall supply goes down, that is an illusion, purely in your own mind. Electronics don’t ask where the electrons come from when deciding which ones to use. All that matters when using a diode OR bridge is this: the supply with the higher voltage wins. Nothing more. The thing is, you’ve set up a situation here where you don’t know which supply will “win” at any given instant; they toggle semi-randomly based on factors out of your control.

Above, I gave two alternatives to solve this, but that was before I read the RB5009 specs and found that 24V is its minimum supported value. That leaves you with one option: make the wall supply high enough that the battery never takes over until the wall supply is falling through its dropout voltage threshold.

If you don’t have an adjustable DC power source there to test with — as all reasonable labs do :nerd_face: — then the next common step up in commercial supplies is 30V. The only reason I can think of not to buy one and try it is that you are using the RB5009’s PoE output and the powered device can’t handle that much voltage.

First of all, i GREATLY appreciate all this information, thank you so much!

Right now the 2pin voltage from the battery bank is at 27.0V and the jack voltage is at 24.3V. There hasn’t been a time in the last few months that the 2pin battery bank voltage would be anywhere close to 24V as the overnight voltage generally won’t dip below even 25V this time of year at least…

I will try a new AC supply to the jack voltage, and I will remove the 2pin connector altogether just for testing purposes for now. If that 100% solves the issues I know for sure it’s a power supply causing the outages.

there’s literally NOTHING else showing in the logs to point me in any directions, it’s just the ethernet and SFP all dropping and then coming back a second later, and then a whole bunch of line entries about BFD down and OSPF reconnecting etc.

If that 100% solves the issues I know for sure it’s a power supply causing the outages.

And if not, then you will not only have shut my mouth with data, you will have switched to a wall supply that keeps your router from constantly vampiring power from the battery when it isn’t needed.

Even the worst case is a win here. Yay!

Does this imply that the battery bank is actually recharged by some kind of solar plant? :question:

Usually 27V is the float/cutoff voltage for battery chargers used on (nominal) 24V batteries, and 27.6V or 28V is the charging voltage.

The AC power adapter at 24.3V seems like normal, possibly a bit on the low side, since it is the no-load voltage (as the battery bank should be clearly prevailing).

Well, this is interesting…

Power - i think I can rule that out. I tried it with a new power supply, same issues. Tried it with ONLY the jack voltage and no 2pin power from DC, same issues. Tried it with ONLY the 2 pin DC voltage and no jack connector at all, same issues.

however, somehow, for some reason, it’s different/better???

Yesterday and today, it would stay up (no ethernet drops) for 7hrs, then 11hrs, then as much as 16hrs today. It would never go more than a few hours max before. When it does start to drop the ethernet and SFP it still drops all of them at exactly the same time, AND it does it several times every few seconds to every few minutes, it usually does this 5+ times every few seconds to every few minutes and then it’s good again for several hours - as much as 16hrs today??

Could it be some sort of networking related issue like a failing device causing the router to drop all connections? there’s zero details in the logs aside from interface down/up and then the multiple BFD/OSPF msgs as things reconnect back up.

I tried creating a firewall input rule to log packets heading to the router itself, but aside from legit traffic, there’s only a handful of TCP SYN packets, maybe 10 a minute at most, nothing that would really spike the CPU, etc.

I’ve been connected via winbox while the event happens, often it doesn’t even drop my winbox session at all. the voltage is solid, the cpu temp is at 45C right in normal range, the cpu is hovering around 20% no spikes, memory is flat at 800+ MB free, like there is literally zero indication of something “wrong” yet it still keeps happening!?? As I am typing this it’s flapping up/down like crazy, every min or so

I have not been stumped like this for a LONNNNNNG time. My only guess is that it’s some sort of traffic on the network hitting the router and causing it?? is that even possible?

I am having a similar issue on my CRS310-8G+2S+ on “7.18.2” its very intermittent, cant tell if its just the gateway that goes, or the interfaces that drop, the logs do not show interfaces drop, but after some time, traffic just stops on all interfaces, but non of the interfaces show link downs.

Not sure if in routing > nexthops i should be seeing vlan10, Flap Count of 3041712727 and similar for all other vlans and if this is normal.

But Never had this issue before updating to 7.18.2 a few weeks ago.

Just a thought—maybe try enabling debug logging for interface (perhaps other topics). Other ideas: downgrading to a lower ROS version; making supout.rif while the issue is happening and getting that to MikroTik.

I sent the supout to Mikrotik over a week ago, just waiting on their response. i will check into debug logging thank you.

So just to update everyone that reads this thread if they run into this in the future… in the end the issue for us was a PXP radio link that is plugged into this router. There was no real indication that this was the root cause but we tried to take offline each PXP link one by one and this one UBNT AirFibre 2.4ghz link was causing every single ehternet port and teh SFP even to drop??? Now that the link is offline we’ve had zero drops for 5 days straight! Luckily this was just a backup link, so we have opted to leave it off for now. very strange one!

I was helping a colleague and your post popped into my mind. Just an idea, and don’t treat it as more, but I’ll jot it down here anyway.

A similar behavior arises when a port goes down and it is part of a bridge that doesn’t have an admin-mac assigned manually. Mikrotik (and generally Linux) systems default to take one of the ports (I’m not sure exactly which - this max depend on their order of being added to the bridge, lowest interface number, etc.) that is running and assign the MAC address of that as the bridge’s MAC address. The problem is the running part, because if the selected interface goes down then the MAC address of the bridge changes (and changes back once it comes up again) which leads to seeming traffic loss until the arp entries on the other hosts get updated. This leads to an effect similar to an all-port flap in reaction to what is actually a one-port flap.

Could this be happening to you?

Interesting. I cannot confirm this behavior on my RB5009 (RouterOS 7.18.2, auto-mac=yes). When the Ethernet interface that provides the bridge MAC address goes down, the bridge MAC address does not change. The Mikrotik documentation does not mention the behavior you describe either. On what hardware and RouterOS version did you observe the bridge MAC address change when the providing port changes state?

AFAIK it is not like the sheer moment the port goes down, for whatever reason, the MAC of the bridge changes, there are some events/changes that may trigger this change, depending on a number of factors.

The behaviour is known and the official Mikrotik recommendation is to set auto-mac=no and assign manually the MAC to the bridge:
http://forum.mikrotik.com/t/bridge-auto-mac-issue/162131/1
http://forum.mikrotik.com/t/bridge-auto-mac-issue/162131/1

Besides, it is what the default script running on most devices do at first configuration.

Like other Mikrotik “peculiarities” I personally believe that the best course of action is to avoid having a configuration that may - even in only a few, rare cases - create the issue, since it costs nothing or nearly nothing, hence I listed it as Rule #6:
http://forum.mikrotik.com/t/the-twelve-rules-of-mikrotik-club/182164/1