CPU 100% after upgrade on x2 routeros

mdkberry · March 23, 2017, 8:30am

Hi
I have a RB2011 router and a DuxSpot-IN with RB912UAG-‐ 2HPnD Wireless CAP AP
I upgraded both to routerOS 6.38.5 (from 6.33.5 where it was fine) and just noticed from the graphs that since then (2 days now) they are both at CPU 100% with no let up.

EDIT: just checked interfaces and all other graphs and there is no traffic changes or increased load apparent anywhere. it seems to be ticking along fine as normal but the CPU Load just sits at 100% no change at all on both devices. on CPU resources page the IRQ % fluctuates between 0 to 3 and disk % sticks at 0%
it isnt slow to work on it through management interface, and no one has complained of issues at the office.

though on second checks it is delaying response when using terminal to access it, fine from web browser.

EDIT 2: Digging a bit deeper using /tool profile it shows ‘management’ as using 75% or more
any suggestions or ideas?

BartoszP · March 23, 2017, 11:25am

Have you restarted them since upgrade ?

mdkberry · March 24, 2017, 8:25am

astounded that worked but it did. I honestly rebooted those suckers at the time after the upgrade, and yet a restart today fixed the issue. thanks

BartoszP · March 24, 2017, 9:01am

So we have new rule of thumb: Double restart after upgrade.

mdkberry · March 24, 2017, 11:39am

spoke too soon.
This has worked to fix the CPU since the reboot, but now the RB912UAG CAPSMAN AP cannot get DHCP from the RB2011 router.

I have managed to access it remotely from the RB2011 using MAC-Telnet and can see no settings have changed in the config
but it wont pick up its ip from the DHCP Server running on the RB2011.

it sits saying ‘searching’ under /ip dhcp-client and says there is no router to host.
the RB2011 is saying 'offering lease to x.x.x.x without success.

I can set a static IP on the RB912UAG and then ping each way, but I still cannot access it via Web browser and it still refuses to collect a DHCP client address.
what might have changed between OS 6.33.5 and 6.38.5 that would effect this.

I am going the through the firewall changing logs to see if I can spot something impacting the DHCP service between boxes. But neighbour is giving me a green light but I clearly cannot see it. It appears as if the CAPSMAN wifi is up and working but cannot test for certain.

??

I produced a supout.rif and sent it to support but again any ideas? anyone else had trouble with DHCP service after upgrade?

mdkberry · March 24, 2017, 11:06pm

Turns out this looks like it may be a bug in the current upgrade 6.38.5 as others are expressing the same problem here
http://forum.mikrotik.com/t/ros-6-38-serious-dhcp-server-problem/105259/1

pe1chl · March 25, 2017, 10:28am

In this case, yes. That was written in some release note of an inbetween version, but of course the poster
missed this when he made such a long jump.

What I normally do:

update RouterOS + reboot
go to system/routerboard
upgrade Firmware + reboot

Then you cover everything in 2 reboots.

pukkita · March 25, 2017, 7:03pm

@mdkberry:

What’s new in 6.38 (2016-Dec-30 11:33):

Important note!!!
RouterOS v6.38 contains STP/RSTP changes which makes bridges compatible with IEEE 802.1Q-2014 by sending and processing BPDU packets without VLAN tag.
To avoid STP/RSTP compatibility issues with older RouterOS versions, upgrade RouterOS to v6.38 on all routers in Layer2 networks with VLAN and STP/RSTP configurations.
The recommended procedure is to start by upgrading the remotest routers and gradually do it to the Root Bridge device.
If after upgrade you experience loss of connectivity, then disabling STP/RSTP on RouterOS bridge interface will restore connectivity so you can complete upgrade process on your network.

Try this, or recreating the bridges.

I think due to this major change, 6.38.x still needs ironing out some glitches in this regard. I’ve experienced that same symptom even on routers with nothing else attached but PCs,

I tend to stick to bugfix as best practice…

@pe1chi

What I normally do:

update RouterOS + reboot

go to system/routerboard

upgrade Firmware + reboot

Same here

I add an additional step when possible:

update RouterOS + reboot
go to system/routerboard
upgrade Firmware + reboot
reset to no defaults + reboot
connect by mac telnet + enable RoMON
connect by RoMON, restore configuration.

pe1chl · March 25, 2017, 7:48pm

Do you do this after every upgrade, or only when chasing difficult problems?
I have sometimes wondered if there can be incosistency between what you see in a /export and what is
really happening in the router, as I believe there is a software layer in RouterOS that handles the commandline
and GUI but then creates config files for daemons and /proc or /sys files depending on your config. In rare
cases like crashes or unexpected events during config changes there could be inconsistency between the two.
However, I have never actually fixed a problem by going back to defaults and restoring the /export.

pukkita · March 25, 2017, 8:04pm

When dealing with devices under suspicion, and everytime I set up a new device.

Current RouterOS NAND handling is really robust, I just reset to no defaults now when setting up new devices - but I have fixed troubled devices this way in the past.

mdkberry · March 26, 2017, 2:24am

thanks for suggestions.

FYI there is a solution for the DHCP issue I mentioned (also I posted it in the thread link I attached)

basically on the wireless bridge settings on the RB2011 add in the MAC address of the bridge to its own Admin MAC address, which in my case was empty.
this then allows the AP to find the RB2011 and confirm it has received a DHCP lease.
The AP needs no changes, and STP on the RB2011 does not need to be disabled.

simple fix for an epic problem.

pukkita · March 26, 2017, 8:35am

Usually setting no admin address means ROS will auto select it.

Funny I didn’t suggest trying that as caring about this was an issue of the past, seems 6.38 needs some help in this regard.

do you still have the logs? which MAC address was being used by DHCP previously?

Good job!

mdkberry · March 27, 2017, 1:13am

not kept them. with RB2011 having an empty Admin MAC and the AP having its own MAC address in its Admin MAC I figure the break in communication of DHCP was on the return since it had no MAC other than its own and was failing there. Even with static IP set it didnt help. Not sure if wireless devices would have worked or not. but they are all in this morning and no complaints so confirmation the Admin MAC solution works.