450G + 5.0 Beta1 issue

After 40 hours I had a lot of hangs.

I have downgraded to 4.9 and will run for a few days to see if I have any hangs at all..

Another issue that effects 4.9 and 5.0B2…

For some reason connecting to the metarouter via winbox will suddenly stop working from the WAN side. The LAN side works fine.

This led to to ping it from the WAN side and this also seems dead..

These are enabled via the firewall. They normally work. At some point they just stop working. Rebooting does not bring them back. Its just sorta random.

Its like there is something wacky with the firewall. The rules look fine.. Again, it works sometimes and sometimes it does not. It will fail while running as well.

NAT’ed traffic seems to flow through it ok.

38 hours of perfect operation without any hangs at all by downgrading to 4.9.. So no doubt something got wacky with the metarouter with 5.0+

Well im sure the Mikrotik guys will find it. Difficult issue tho as its intermittent and not perfectly reproducible.

Im done testing it for now. I will revisit this as new betas come out.

I believe 4.9 and maybe further back, has the same issue but far less frequently. I just saw a hang for around 3 minutes then it returned back to normal. 3 days have gone by with perfect operation.. I have a second 450G that I reset/config’ed/started at the same time as this one with identical setup EXCEPT the metarouter is disabled. It is still going without a issue.

So maybe the problem exists already in 4.X but for some reason gets way worse in 5.X ?

This is where i am with this problem.

this is problem appear on RB450G when MetaROUTER is enabled. But not sure how exactly MetaROUTER causes the router to freeze and why.

have tried everything starting from running OpenWRT with asterisk (no crashes in 2 month or more), RouterOS with OSPF as part of network (no freezes for 2 month until today 15.05.2010)

It seems that disabling some packages makes it more stable, but leaving all the packages there, it still can work stable for long periods of time (OSPF router).

BTW, running 2 instances did not change anything.

This is a difficult problem to isolate. I wish I could help more.

The config I was using caused a hang of a mikrotik Metarouter about once a day. I sent in my config to support and I gave support a login into my router.

I am on 4.9 now and testing for stable operation. Next week I am going to go back to 5.0B2 and will share my config and provide support a login.

Some random thoughts on the problem:

What hangs ? Does the CPU hang or does it have 100% CPU usage ? Some hardware monitoring watching the logic would reveal a lot. A Logic analyzer connected to the router CPU should show exactly what happens when it hangs. This might provide the answer or a very good clue.

Maybe some outside automated attack coming from the internet causes a CPU spike ? There is ALOT of junk that goes around on the net and maybe something causes a MetaROUTER to hang ? To test this a 450G would have to be disconnected from the net and still fail. I might try this test.

5 days of perfect operation on 4.9 -without- the metarouter running.

I have now installed 5.0B2 with the Metarouter disabled.

I did a full reset after upgrade and used the terminal to do config. I noticed its on bootloader 2.6 for 5.0B2.

I am using pingplotter on 2 computers. One computer is pinging from the lan side to it and also through it to the cable modem. I am using the second computer to ping it from the wan side and also ping a device that goes through NAT. This is the same setup I used to test 4.9.

Multiping is set to ping 3 times a second so I will see any hang.

I am going out of town for a few days and it will chart the results while I am gone.

After a few days of testing I am going to enable Metarouter and continue testing.

Im doing this to carefully verify the problem. I was not doing full config resets before so I want to make sure I have this issue well understood.

During this testing I have discovered a issue with my new DOCSIS 3.0 Cisco cable modem, the DPC 3000. About once a day it looses connectivity for about 10-20 seconds. This was a interesting discovery.

5 days of flawless operation with 5.0B2 on a 450G and 750G.

Im not sure this has any meaning but I am doing more in depth testing.

A sample 24 hr plot. Note the scale. Top of the scale is 0.1ms.. I am using multiping to ping the LAN side of the router and ping the cable modem thru the router.
I am pinging at .1 Sec intervals, 10 times a second. This provides a high resolution plot.

The load on the CPU of the router can be sensed by how quickly the ping is returned it turns out.

Top chart is the 450G
Bottom chart the cable modem thru the router

Typical 24hr period. The cisco cable modem does some weird stuff.. The 450G is perfect… Over days.
24hrNoMetarouter.gif
Once I started the Metarouter these plots changed and got quite noisy. I presume a higher CPU load to run the Metarouter. But odd patterns emerged. Fairly long periods of what might be higher CPU usage.. The loads on all the routers had not changed at all. This was not seen in 5 days of plotting…

Note this is a 3hr chart and the scale is larger..
3hrMetaRouterON.gif
This is most likely meaningless.. But interesting :slight_smile:

This is logical but maybe not precise. Someone can confirm this? Is this true at least for Linux?

A very good post by the way. I like how you went really deep to investigate. I like that the ping investigation is plotted and can be so easily presented here.

I think that works on any computer.. BUT yea the differences would be small…

Multiping gets down to .1 but you can see .05ms… Of course MANY things can influence it..

So a baseline… This is the router with the MetaRouter set to disabled.. Note the scale. Full scale is 0.1ms. This is a 24 hour chart recorded with 100 ping per second resolution. Im pinging the LAN side of the router. I have Multiping options @ .1ms ping interval, 0 interval between pings, 28 byte packet size..
WithoutMeatRouter.gif
Then I turned on the meta router.. My CPU usage increased, Winbox siad about 2%-7%. In fact its amazingly visible.. Note the scale is 10 times higher now at 1ms. So this is at least 10 times noisier.. The red spots are 100% packet loss, this is the “hang”.. During 24hrs it got hung up twice. But different intervals, no log entries, not triggered by any action I can see. They are also not a set distance apart in time. The hangs were 2min 49 seconds and 9 min and 50 seconds.

The more interesting parts are the raised ping times that last for 10-30 minutes.. Must be higher CPU usage ? There is no additional data traffic during these intervals. In fact really heavy data flow does not seem to effect these results. High data flow does not lead to much increase in ping times at all.
MetarouterOn.gif
Later I turned off the meatrouter and everything returned to normal..
Metarouterturnoff.gif
So what does this tell us… Most likely nothing !.. Might just be normal…

Something fairly serious happens tho when you turn on a metarouter.. The router is obviously stressed more..

I’m a performance freak that would almost go to extremes. So if I needed Asterisk for example I would run that in a separate OpenWRT-enabled device or even - waaaaay better - in a mothaf’ing second hand x86 piece of scrap metal desktop PC from our ancestors :smiley: (overclocked, stress-tested, coooled, and with new caps re-soldered in the PSU and mobo by my trusty electronics guy).

Unless MikroTik can prove to us that the main routing performance will be intact when you run something with Metarouter/KVM :slight_smile: So far you mate, have proven the opposite :smiley:

on ppc boards (RB1000, RB800) metarouter is stable. On RB433AH it hasn’t hanged for several month and only reboots are install of new version.

Only black sheep is RB450G that have weird issues with metarouter running on them.

KVM is completely different story - as far as i have seen - it has not crashed ever. And i am running Ubuntu and RouterOS as 2 KVM guests on X86 with 4 cores. There this impact of running guest OS is negligible, because main packet processing happens on single core anyway. Only slow down can happen if guest OS does something memory intensive.

Also, on every single core router you will feel some drag when you start using virtualization. Consider that that there is completely other OS with its services and all the interrupts have to be processed by guest and then by host OS. And for some operations that introduces quite an overhead. But that should be expected. On the other hand, what good is hardware if it runs with 10 - 20% load, if adding guest brings it up to 40 - 50% costs increase only up to maximum 10% if that much at all.

Thank you for saving Virtualsiation for us, JanisK.

So if I have a x86 MT Router that runs at 40% CPU load at max, in peak hours, And if I would add a KVM with Asterisk, and I see that CPU goes only up to 80% for example… Should I expect No raises in ping times, still good latency.. still perfect routing performance?

P.S. another important question. Let’s say the x86 MT Router goes to 90% CPU usage but very rarely for a few moments from time to time in peak hours. And let’s say that I have a KVM with a service running on it that is not so important. For example a website, a forum, a CRM, who knows. What if RouterOS could take complete prioirty over the KVM? So that when RouterOS needs the CPU - it gets it without any delay whatsoever. And when the KVM needs the CPU - it waits for RouterOS to free some. So the service running in the KVM would work a little slower but this would happen very rarely. Is this prioritising of processes inside the router possible? :slight_smile:

And for some operations that introduces quite an overhead. But that should be expected. On the other hand, what good is hardware if it runs with 10 - 20% load, if adding guest brings it up to 40 - 50% costs increase only up to maximum 10% if that much at all.

The main router ping time increases as a result of the virtualization are extremely minor. They look dramatic in the chart, but a increase of .5ms is really minor.

The advantages of using virtualization are many. Its truly amazing you can do it on a $120 router !

Is the performance of the MetaROUTER the same on a single CPU 450G as the main router ? of course not. Thats to be expected. Are there trade off for using a metarouter ? Of course.

Its a stunningly cool option. It would be cool to get it to work on the 450G correctly :slight_smile:

virtualization is not exactly simple. This issue is also intermittent and so far not easily reproducible. Its gotta be annoying to try and fix.

What is interesting to me in the charts is the elevated response times that last 20-30 minutes. Im not sure what runs for 20-30 minutes at a time in the router. It also occurs somewhat randomly but separated by hours. Thats a long time for a process to run taking up some CPU time.

I have not run Multiping against the metarouter yet. I am currently just running it against the main router. I will setup multiping to ping the metarouter and see what happens.

Those 20-30 mins graph activity may be some process scheduled underneath. A cron job inside RouterOS. I hope those don’t harm routed packets performance.
Oh wait. Those .05ms increases may be due to a process on the host that sent the ping requests. Or due to network activity along the way.

Oh wait. Those .05ms increases may be due to a process on the host that sent the ping requests. Or due to network activity along the way.

Thats why I have done a weeks worth of testing… None of the noise or steady increases appears in the tests until I activate the Metarouter. I have 24 hours of flat line charts. .1ms full scale. Just black flat line almost… This eliminates all other sources of unwanted interference. I am connected with everything gigabit with good 3 foot cables.

Im also using 3 different computers and see the same spikes at the same time on all three. So the computers are not influencing the results in any meaningful way.

In reality I can run just tons of traffic through the router and see very little difference in these charts. I can get a slight peak when I first start winbox.

If I write a script loop I can create much bigger sustained stuff on the charts.

So data throughput does not seem to influence the charts hardly at all. CPU usage does influence the charts.

ANYWAY…

Its interesting… And easy to try yourself. Pick a fast computer, gigabit interface, hook it to the router with a short high quality cable.

Go get the free demo of Multiping http://www.nessoft.com/multiping/download.html

add in your router IP address… Change the ping interval by TYPING in .01 and press enter.

Set some options… Edit>Options>Packet
Time Interval between pings: 0
Packet Size 28
Number Of Samples to hold in Memory 0

Apply/save and your done…

Right click on the chart to look at different time scales… Its possible to add custom time scales to look at 14 days for example.

O M G…

I might have progress… At the very least I have clearly effected the issue…

I was thinking about how it could only effect the 450G and not other routers based on the same chips. I also remembered that someone thought that the power supply made a difference..

I am a analog electrical engineer. One of the things I do is make analog electronics…

I looked at the router and decided to supplement the 330uF cap that is across the 1.2Vdc supply for the CPU… I went completly overboard and put 4400uF cap there which would really make sure that CPU supply was s t a b l e… So a more then 10 times increase… I might get around to doing the other power supply caps..

The idea here is that some spike in CPU usage is causing a lower then normal voltage on the 1.2 supply for the CPU for a very brief time causing weirdness and causing hangs…

This has had a apparent effect… I am still testing, but I have not had ANY hangs on the main router in 20 hours. That is a huge difference. BUT I am still testing to make sure.. Also I have seen something interesting occur that was not visible before and might indicate something important…

I think what happens now is a CPU usage spike occurs but the main router stays working and does not suffer from a power issue when a huge increase of current is required for the usage spike..

AGAIN IM JUST GUESSING AT MOST OF THIS… BUT I have some interesting results…

This was before my cap mod.. This shows red packet loss when the router hung up. It also shows to ever present increases in ping times sustained for 10-30+ minutes..

I then did the cap mod.. Strangely the plot is cleaner.. But the import thing is no red blotches..

There is however a weird increase in ping times that went off chart..

Zooming in on this section of chart..

Zooming in more on just the 4 minute high spike and changing the vertical scale so I can see the whole spike, I notice a much shorter spike that is WAY off the scale.. And this spike preceeds a much higher then normal ping time..

Looking even closer

Zooming in more on just that spike and allowing a MUCH higher vertical scale..

This yielded a HUGE ping spike which was around 6000ms or 6 seconds.. This lasts for around one second and I have about 80 pings for that second.

THIS spike is unlike anything I have seen during my testing so far.. This was some REALLY HIGH CPU usage.. However it did NOT cause the MAIN router to hang up this time.. Maybe that cap allowed the CPU to keep running through this obvious spike in usage and CPU current draw..

The Metarouter however DID hang up. It locked up at exactly the moment of this spike…




So…

By placing a MUCH bigger cap on the power supply rail to the CPU I made it more stable and doing this kept the main router working while something in the metarouter caused a huge increase in CPU demand and current flow from the supply…

Now what software issue is causing the metarouter to produce a runaway CPU usage for a VERY short 1 second period, I dont know that…

I think I did confirm what the other guys were saying about power supplies possibly making a difference tho and this might explain why..

Im going to continue my testing. I want to confirm that by placing a MUCH bigger cap on the power supply rail I have stabilized the CPU to make it through this impressive spike in usage and current.. I am hoping I will not see another main router hang. I expect to see Metarouter hangs however…

INTERESTING… Wow this problem has been pretty fun to work on.

I think if I help isolate and resolve the issue i should get a free 450G :slight_smile:

Please post pics. I want to see the exact capcitor you replaced, I want to see the one that was put there in manufacturing to compare with my RB450Gs. I can’t believe how helpful your work is. A little research can get you light years ahead. So thank you.

I spoke too soon… I think I am mistaken…

Man this problem is annoying…

From the last 12 hours.. Clearly the problem is still there, just like it was before… I think…

So I am afraid Network Pro you will need to take back your Karma vote ! hehehehe

Hmmmmmmmmmm…


This is from pinging the main router on the LAN side…

And this is from pinging the MetaROUTER from the Meta Router LAN side. Notice how pings went up and stayed up. Notice right at the end when I rebooted the router and pings came back to normal.. And note how some hangs on the main router did not result in lost packets, just really long ping times..