Community discussions

MUM Europe 2020
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

MetaROUTER stability issues on certain MIPSBE and PPC boards

Sun Apr 01, 2012 2:40 pm

As there aren't any news on the metarouter stabillity to report, it more and more seems as MT has given up on MR.
Personally I worked arround the usage of MRs by using dedicated hardware, and, besides the usual MT software quality issues now and then, everything is well.

Maybe the most effective way of handling MR alltogether, would be to really give up on this feature, at least on MIPS, but the PPC line doesn't seem to be very well also, judging by the recent posts on this topics. The time spent on by MT on MR, if any in the recent past, could surely be used better on other topics.

EDIT: Some progress has been made for RB450G boards, see summary http://forum.mikrotik.com/viewtopic.php ... 50#p319788
EDIT2: Not a single word on any progress since about a month, despite regular requests for comment.
EDIT3: Some significant progress has been made with ROS 5.21RC1, see http://forum.mikrotik.com/viewtopic.php ... 38#p333305
EDIT4: Still unstable on MIPSBE and PPC as of 5.22(rc2 ) and newer.
Last edited by timberwolf on Thu Jan 10, 2013 8:24 am, edited 8 times in total.
 
User avatar
sergejs
MikroTik Support
MikroTik Support
Posts: 6619
Joined: Thu Mar 31, 2005 3:33 pm
Location: Riga, Latvia
Contact:

Re: Has MikroTik given up on MetaROUTER?

Mon Apr 02, 2012 11:11 am

MetaROUTER is working fine for many customer in different setups.

Could you please give more information about your problem? We will appreciate to receive detailed problem description and attached support output file to support (support@mikrotik.com) from you. We will try our best to solve your issue.
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: Has MikroTik given up on MetaROUTER?

Mon Apr 02, 2012 11:51 am

Could you please give more information about your problem?
He is probably referring to this thread: http://forum.mikrotik.com/viewtopic.php?f=15&t=35800

I myself have only just recently started playing with MetaROUTER (and on RB450G), and have experienced the same reboot problems, which caused me to find the thread linked to above. I only started playing with it last night and encountered my first reboot within 10 minutes. Host 450G is minimally configured. Guest was latest OpenWRT build made by forum member liquidcz. Host rebooted for the first time minutes after launching the guest, and before I had a chance to make any configuration changes within the guest.

I would be happy to open tickets and submit supouts, but since I'm new to this problem and discussion, I'd like to spend some more time first to see if I can reliably find a way to reproduce the problem; I realize that if I just say "yep, I have the same problem," that doesn't really help you to find the cause, and I'd rather not waste your time. :)

I also have access to some PPC RouterBoards (including 1100AH) that I plan to test as well to see if I have better success.

-- Nathan
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: Has MikroTik given up on MetaROUTER?

Mon Apr 02, 2012 12:04 pm

MetaROUTER is working fine for many customer in different setups.

Could you please give more information about your problem? We will appreciate to receive detailed problem description and attached support output file to support (support@mikrotik.com) from you. We will try our best to solve your issue.
Sergejs, I and a handfull of other users constantly tried to support MT in solving the problems with MetaROUTER on the RB450G and other MIPS-BE boards. Nathan refers to the correct thread, but there are more, even a poll from me regarding the stabillity on PPC based boards, with not so promising replies. What more do you expect from us, the users? We did everything we could, without any visible results from MT.

So please tell me and the other users, which bought an RB450G just for MetaROUTER, what we should do. Noone will buy a PPC based board for over 300$ and hope for luck. And noone want's to run a MetaROUTER with basically no configuration inside.

Sorry, but please wakeup.
 
User avatar
sergejs
MikroTik Support
MikroTik Support
Posts: 6619
Joined: Thu Mar 31, 2005 3:33 pm
Location: Riga, Latvia
Contact:

Re: Has MikroTik given up on MetaROUTER?

Mon Apr 02, 2012 2:19 pm

We use MetaROUTERs in our network, and it works fine without reboots. As well MetaROUTER are being used by many other users with success.

timberwolf,
I think there might be some specific configuration issue, that results to metarouter reboot/crash on your router. The only thing we should do to find the problematic point at your setup, and fix it whether in RouterOS code or in your configuration.

We will be very happy to receive instruction step by step (something more than few words "my router reboots"),
how to repeat MetaROUTER crash/report at the latest MikroTik RouterOS version, just as Nathan referred in his posts. We will try our best to fix problems as soon as possible.
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: Has MikroTik given up on MetaROUTER?

Mon Apr 02, 2012 2:34 pm

sergejs
Please bring yourself up to date by starting reading at this post: http://forum.mikrotik.com/viewtopic.php ... 70#p282770
We wen't through all sorts of tests, with even the simplest setup MetaROUTER isn't usable on a RB450G.
I/We simply can't afford to do all this tests all over again...
I might be able to arrange remote SSH access to at least one blank RB450G which I have lying arround at the moment. This unit works perfectly stable, as long as no MetaROUTER ist created, it is the unit which I used for the tests from the other thread. Would this be of use to you?
 
barkas
Member Candidate
Member Candidate
Posts: 260
Joined: Sun Sep 25, 2011 10:51 pm

Re: Has MikroTik given up on MetaROUTER?

Mon Apr 02, 2012 2:42 pm

We use MetaROUTERs in our network, and it works fine without reboots. As well MetaROUTER are being used by many other users with success.
I can't quite believe you.

There has not been one single user in here who has stated that it works for him so far.
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: Has MikroTik given up on MetaROUTER?

Mon Apr 02, 2012 2:52 pm

I might be able to arrange remote SSH access to at least one blank RB450G which I have lying arround at the moment. This unit works perfectly stable, as long as no MetaROUTER ist created, it is the unit which I used for the tests from the other thread. Would this be of use to you?
I could also arrange to do something similar on our end. Once I have made the necessary arrangements, I will contact support with access details.

-- Nathan
 
User avatar
janisk
MikroTik Support
MikroTik Support
Posts: 6283
Joined: Tue Feb 14, 2006 9:46 am
Location: Riga, Latvia

Re: Has MikroTik given up on MetaROUTER?

Mon Apr 02, 2012 4:14 pm

go and read through my replies in these threads. all information has been delivered to the developers and different configuration retested over and over again.

take RB433AH and run metarouter there.
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: Has MikroTik given up on MetaROUTER?

Mon Apr 02, 2012 8:53 pm

go and read through my replies in these threads. all information has been delivered to the developers and different configuration retested over and over again.

take RB433AH and run metarouter there.
Well that's exactly the kind of answer I was expecting, thanks for nothing.
And of course there are soooo many reports of the stability on a RB433AH we could trust on....
And YES we always go and buy another piece of hardware when the first one doesn't work reliable, and of course we don't care if the replacement has a differnt set of features.
If you find any sign of sarcasm in the sentences above, you may keep it.
 
reverged
Member Candidate
Member Candidate
Posts: 270
Joined: Thu Nov 12, 2009 8:30 am

Re: Has MikroTik given up on MetaROUTER?

Mon Apr 02, 2012 11:46 pm

sergejs and janisk,

Can you detail the stable metarouter config used in your network?

- Which device?
- Which image (or ROS metarouter?)?
- Which firmware/ROS version/packages?
- export compact the config?
- etc...

Then, perhaps, others can test this config and see if there is stability. If stability exists, then a delta to this basic config can be made to see where stability ceases. supout files then are perhaps more valuable.

I have an RB450G spare that I can config and leave running. I will look for a 433AH...

Metarouter is one of the coolest features to ROS, but we need it stable.
I too have tried the liquid image on a 450G and it rebooted with just the metarouter started - no interface; no traffic.
 
neticted
Member Candidate
Member Candidate
Posts: 121
Joined: Wed Jan 04, 2012 10:36 am

Re: Has MikroTik given up on MetaROUTER?

Tue Apr 03, 2012 12:12 am

[quote="janisk"take RB433AH and run metarouter there.[/quote]

No need to buy just to try. I tried that. 433AH, configuration reset, loaded metarouter and with no other custom configuration it reboots.
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: Has MikroTik given up on MetaROUTER?

Tue Apr 03, 2012 1:38 am

all information has been delivered to the developers and different configuration retested over and over again.
One thing that may be helpful that I'm not sure if anyone has done yet is to attach a console logger to the serial port of a 450G, to try and catch any stack traces that the kernel may have printed to the console before it reboots (if it is even printing anything out; I haven't thought to watch the serial console output until now).

I can also arrange to have that done as well as to give you remote access to the logger.

-- Nathan
 
User avatar
sergejs
MikroTik Support
MikroTik Support
Posts: 6619
Joined: Thu Mar 31, 2005 3:33 pm
Location: Riga, Latvia
Contact:

Re: Has MikroTik given up on MetaROUTER?

Tue Apr 03, 2012 10:09 am

timberwolf
Unfortunately your reply is not too informative, there is no details about your configuration which does not work for you properly, that I can test and find out what is wrong.

barkas,
Do you have any live configuration now?
We would like to receive your report, submit it to support (support@mikrotik.com), the following information is required,
- support output file from physical router running 5.14 version;
- brief description about guest configuration;
- steps required to "crash" guest or instance.
- post your ticket number here, I will follow up the problem.

reverged,
- RB433UAH
- it is RouterOS
/system resource> print 
uptime: 5w5d15h49m25s
version: 5.14
free-memory: 17504KiB
total-memory: 29708KiB
cpu: MIPS 4Kc V0.10
cpu-count: 1
cpu-frequency: 680MHz
cpu-load: 7%
free-hdd-space: 475280KiB
total-hdd-space: 476224KiB
write-sect-since-reboot: 0
write-sect-total: 0
bad-blocks: 0%
architecture-name: mipsbe
board-name: RB MetaROUTER
platform: MikroTik
- default set of packages;
- configuration used on the router:
bridge, DHCP, Firewall Filter, Firewall NAT, DNS cache, OSPF + filters, PPPoE

NathanA,
console output might be very helpful. We are waiting for your report.
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: Has MikroTik given up on MetaROUTER?

Tue Apr 03, 2012 11:08 am

sergejs
Sorry I am not quite sure what I had posted back then and what not. But what I did was as simple as following:
1.) Netinstall RB450G which the following packages: routerboard, system, security, advanced-tools, routing, ppp, ntp
2.) Configure Name of RB450G
3.) Create a MR with 32MB RAM and 32MB Disk max.
4.) Add a static interface to MR, which only connects the host and the guest
5.) Configure an IP on each side of the interface
6.) Configure Name of Metarouter
7.) Setup a ping from Metarouter to Host
8.) Wait...
The uptime of this whole setup then varied from minutes to at best 1 day.
As I recall from reports from barkas, such a setup will become a lot more instable if you add some OSPF and MPLS setup to the MR.
But we will have to wait if he chimes in.

I will try and do the exact same setup this evening, if I find any spare time.

EDIT: I think I also did send at least two supout files, too. Please have a word with janisk, as he stated again that ALL information has been passed on to the developers.
 
barkas
Member Candidate
Member Candidate
Posts: 260
Joined: Sun Sep 25, 2011 10:51 pm

Re: Has MikroTik given up on MetaROUTER?

Tue Apr 03, 2012 11:19 am

barkas,
Do you have any live configuration now?
We would like to receive your report, submit it to support (support@mikrotik.com), the following information is required,
- support output file from physical router running 5.14 version;
- brief description about guest configuration;
- steps required to "crash" guest or instance.
- post your ticket number here, I will follow up the problem.
First, nice that somebody finally woke up and tries to address the problem. I have no live configuration at the moment.
But, if you care to read the many threads here, it is easy to reproduce, because those things will crash periodically in any configuration.
Insofar your insistence on bureaucracy offends me. Why don't you ask your colleague janisk, who should know everything about it.
 
User avatar
janisk
MikroTik Support
MikroTik Support
Posts: 6283
Joined: Tue Feb 14, 2006 9:46 am
Location: Riga, Latvia

Re: Has MikroTik given up on MetaROUTER?

Tue Apr 03, 2012 2:00 pm

sergejs
Sorry I am not quite sure what I had posted back then and what not. But what I did was as simple as following:
1.) Netinstall RB450G which the following packages: routerboard, system, security, advanced-tools, routing, ppp, ntp
2.) Configure Name of RB450G
3.) Create a MR with 32MB RAM and 32MB Disk max.
4.) Add a static interface to MR, which only connects the host and the guest
5.) Configure an IP on each side of the interface
6.) Configure Name of Metarouter
7.) Setup a ping from Metarouter to Host
8.) Wait...
The uptime of this whole setup then varied from minutes to at best 1 day.
As I recall from reports from barkas, such a setup will become a lot more instable if you add some OSPF and MPLS setup to the MR.
But we will have to wait if he chimes in.

I will try and do the exact same setup this evening, if I find any spare time.

EDIT: I think I also did send at least two supout files, too. Please have a word with janisk, as he stated again that ALL information has been passed on to the developers.
[admin@450G-2] > sy resource print 
                   uptime: 3w9h10m55s
                  version: 5.6
              free-memory: 207140KiB
             total-memory: 257120KiB
                      cpu: MIPS 24Kc V7.4
                cpu-count: 1
            cpu-frequency: 680MHz
                 cpu-load: 6%
           free-hdd-space: 464128KiB
          total-hdd-space: 520192KiB
  write-sect-since-reboot: 115
         write-sect-total: 1325900
               bad-blocks: 1.5%
        architecture-name: mipsbe
               board-name: RB450G
                 platform: MikroTik
[admin@450G-2] > metarouter print 
Flags: X - disabled 
 #   NAME                                                      MEMORY-SIZE     DISK-SIZE     USED-DISK STATE        
 0   mr1                                                             32MiB     unlimited        277kiB running 
[admin@MikroTik] > sy resource print 
                   uptime: 3w9h8m47s
                  version: 5.6
              free-memory: 20988KiB
             total-memory: 29708KiB
                      cpu: MIPS 4Kc V0.10
                cpu-count: 1
            cpu-frequency: 680MHz
                 cpu-load: 6%
           free-hdd-space: 464124KiB
          total-hdd-space: 464401KiB
  write-sect-since-reboot: 0
         write-sect-total: 0
               bad-blocks: 0%
        architecture-name: mipsbe
               board-name: RB MetaROUTER
                 platform: MikroTik

rebooted due to reordering of my table, had to unplug power cord from the router. It has been running w/o crashes since 5.6 was installed on it.
 
User avatar
janisk
MikroTik Support
MikroTik Support
Posts: 6283
Joined: Tue Feb 14, 2006 9:46 am
Location: Riga, Latvia

Re: Has MikroTik given up on MetaROUTER?

Tue Apr 03, 2012 2:05 pm

here is another one, a bit newer versions, was added at later time
[admin@450G] > sy resource print 
                   uptime: 6w5d53m14s
                  version: 5.13rc1
              free-memory: 208804KiB
             total-memory: 257112KiB
                      cpu: MIPS 24Kc V7.4
                cpu-count: 1
            cpu-frequency: 680MHz
                 cpu-load: 2%
           free-hdd-space: 474292KiB
          total-hdd-space: 520192KiB
  write-sect-since-reboot: 138
         write-sect-total: 506305
               bad-blocks: 0%
        architecture-name: mipsbe
               board-name: RB450G
                 platform: MikroTik
[admin@mr-test] > sy resource print 
                   uptime: 6w4d23h36m12s
                  version: 5.13rc1
              free-memory: 20068KiB
             total-memory: 29700KiB
                      cpu: MIPS 4Kc V0.10
                cpu-count: 1
            cpu-frequency: 680MHz
                 cpu-load: 10%
           free-hdd-space: 474288KiB
          total-hdd-space: 474565KiB
  write-sect-since-reboot: 0
         write-sect-total: 0
               bad-blocks: 0%
        architecture-name: mipsbe
               board-name: RB MetaROUTER
                 platform: MikroTik
metarouter has 2 static interfaces that are bridged with physical ones, inside router has bridged both interfaces, so it is passing through traffic
 
barkas
Member Candidate
Member Candidate
Posts: 260
Joined: Sun Sep 25, 2011 10:51 pm

Re: Has MikroTik given up on MetaROUTER?

Tue Apr 03, 2012 2:09 pm

Thanks a lot for cherrypicking the probably only one that is hand-tuned enough that it actually works. How about you post one of the not working ones?
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: Has MikroTik given up on MetaROUTER?

Tue Apr 03, 2012 2:11 pm

janisk
Which packages where installed and what do you suggest?
 
User avatar
janisk
MikroTik Support
MikroTik Support
Posts: 6283
Joined: Tue Feb 14, 2006 9:46 am
Location: Riga, Latvia

Re: Has MikroTik given up on MetaROUTER?

Tue Apr 03, 2012 2:43 pm

here are the packages:
admin@450G] > system package print 
Flags: X - disabled 
 #   NAME                                       VERSION                                      SCHEDULED              
 0 X dhcp                                       5.13rc1                                                             
 1   system                                     5.13rc1                                                             
 2   routerboard                                5.13rc1                                                             
 3 X hotspot                                    5.13rc1                                                             
 4 X ppp                                        5.13rc1                                                             
 5 X advanced-tools                             5.13rc1                                                             
 6   option                                     5.13rc1                                                             
 7   routing                                    5.13rc1                                                             
 8   wireless                                   5.13rc1                                                             
 9   security                                   5.13rc1                                                             
10   ntp                                        5.13rc1                                                             
11   ipv6                                       5.13rc1                                                             
12   mpls                                       5.13rc1  
[admin@MikroTik] > sy package print 
Flags: X - disabled 
 #   NAME                                       VERSION                                      SCHEDULED              
 0   system                                     5.6                                                                 
 1   hotspot                                    5.6                                                                 
 2   routerboard                                5.6                                                                 
 3   ipv6                                       5.6                                                                 
 4   ppp                                        5.6                                                                 
 5   security                                   5.6                                                                 
 6   mpls                                       5.6                                                                 
 7   wireless                                   5.6                                                                 
 8   advanced-tools                             5.6                                                                 
 9   option                                     5.6                                                                 
10   routing                                    5.6                                                                 
11   ntp                                        5.6                                                                 
12   dhcp                                       5.6
these are not cherry-picked routers or else there would be no point of having them.

Actually i get them similar way sales send them to customer - request certain number of devices and set delivery point.
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: Has MikroTik given up on MetaROUTER?

Tue Apr 03, 2012 3:01 pm

janisk
Then how do you explain the troubles I and many others are having?
It's not like we pick routers which are faulty, just to show you guys off.
And there still aren't any reports from users which use MR without troubles even on the PPC plattform. It's quite normal that negative reports take a bigger percentage in forums then positive reports, but so far you guys from MT are the only ones with detailed positive reports about MR in here. And yet your setups are quite special without any real world application.
Try some little more complicated setup, like for example OSPF, L2TP or for example MPLS in a MR.
 
barkas
Member Candidate
Member Candidate
Posts: 260
Joined: Sun Sep 25, 2011 10:51 pm

Re: Has MikroTik given up on MetaROUTER?

Tue Apr 03, 2012 3:02 pm

Crashes once per day with metarouter activated.

There's one bridge to the metarouter configured.
sy package print
Flags: X - disabled
 #   NAME                    VERSION                    SCHEDULED
 0   security                5.14
 1   system                  5.14
 2   routing                 5.14
 3   ups                     5.14
 4   ntp                     5.14
 5   routerboard             5.14
 6   mpls                    5.14
 7   ppp                     5.14
 8   multicast               5.14
 9   ipv6                    5.14
10   dhcp                    5.14
11   hotspot                 5.14
12   user-manager            5.14
13   advanced-tools          5.14
 
barkas
Member Candidate
Member Candidate
Posts: 260
Joined: Sun Sep 25, 2011 10:51 pm

Re: Has MikroTik given up on MetaROUTER?

Tue Apr 03, 2012 3:12 pm

How about that? It seems you have been able to reproduce it, after all.
if resources are available (router has few % of cpu left and there is ram) i have seen no difference in reboot frequency with or without load. Even simple usage patterns did not cause it to reboot more.

Reboots usually where done by watchdog, disabling it - revealed that router freezes from time to time.

At the moment idea is that problem is software related, but has to be tested on different hardware (like RB433AH - same cpu, decent amount of RAM). And that problem is, that something does not like MetaROUTER being ran on the RB450G.
To be clear here: I consider the objective of a bugreport to be that the vendor is able to reproduce the bug. Once the bug is reproduced, it is no longer my problem as customer to lobby the vendor into actually fixing it. Nor is it my problem if you don't want to fix a bug - I can switch to a different product, you know. It is also not my problem if you lose sales because your products are unable to successfully complete QA tests that customers may make before choosing to buy.
 
User avatar
janisk
MikroTik Support
MikroTik Support
Posts: 6283
Joined: Tue Feb 14, 2006 9:46 am
Location: Riga, Latvia

Re: Has MikroTik given up on MetaROUTER?

Tue Apr 03, 2012 4:05 pm

that is the problem - these router where used in tests since 3.x release when metarotuer as such has been introduced. Due to some specific limitations a lot of testing was done or - wait for it - RB433AH. If problems where reported, then first setup was made on RB433AH and router model in report.

These 2 routers was used since i started posting about this problem. So - they have been crashing, but not anymore.

Main issues about the problem - RB450G have some weird problem that cannot be reproduced on demand, also no known common denominator has been found what causes freezes of the router.

What is known - when freeze happens, router is not responding over the ethernet, if you are running script inside router, that does something inside router every second, it works, same goes for script running in guest, no matter RouterOS or OpenWRT, it is running after the freeze as if nothing has happened. If you send small amount of packets, like ICMP ping to router every second, router after the freeze will reply to all of the packets, that is if ICMP ping packets on host sending them out does not timeout during the freeze time.

If watchdog is enabled, router is rebooted by it no matter how long the freeze is.

have seen freezes from 3 to 10 seconds, there are some reports of few minutes.
 
barkas
Member Candidate
Member Candidate
Posts: 260
Joined: Sun Sep 25, 2011 10:51 pm

Re: Has MikroTik given up on MetaROUTER?

Tue Apr 03, 2012 4:18 pm

that is the problem - these router where used in tests since 3.x release when metarotuer as such has been introduced. Due to some specific limitations a lot of testing was done or - wait for it - RB433AH. If problems where reported, then first setup was made on RB433AH and router model in report.

These 2 routers was used since i started posting about this problem. So - they have been crashing, but not anymore.

Main issues about the problem - RB450G have some weird problem that cannot be reproduced on demand, also no known common denominator has been found what causes freezes of the router.

What is known - when freeze happens, router is not responding over the ethernet, if you are running script inside router, that does something inside router every second, it works, same goes for script running in guest, no matter RouterOS or OpenWRT, it is running after the freeze as if nothing has happened. If you send small amount of packets, like ICMP ping to router every second, router after the freeze will reply to all of the packets, that is if ICMP ping packets on host sending them out does not timeout during the freeze time.

If watchdog is enabled, router is rebooted by it no matter how long the freeze is.

have seen freezes from 3 to 10 seconds, there are some reports of few minutes.
Exactly that seems to be the problem. It's almost irrelevant what you do, since it will freeze / reboot anyway. That would hint at some core routine that is used in any case.

By the way, Ticket#2012040366000592 .
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: Has MikroTik given up on MetaROUTER?

Tue Apr 03, 2012 4:39 pm

What is known - when freeze happens, router is not responding over the ethernet, if you are running script inside router, that does something inside router every second, it works, same goes for script running in guest, no matter RouterOS or OpenWRT, it is running after the freeze as if nothing has happened. If you send small amount of packets, like ICMP ping to router every second, router after the freeze will reply to all of the packets, that is if ICMP ping packets on host sending them out does not timeout during the freeze time.

If watchdog is enabled, router is rebooted by it no matter how long the freeze is.

have seen freezes from 3 to 10 seconds, there are some reports of few minutes.
janisk,

I can confirm almost everything you say here.

I hooked a terminal up to the serial port and watched it. I was hoping that the crash was a kernel panic of some kind and that I would be able to capture a stack trace. But you are correct: it is the watchdog that is rebooting it. So I saw nothing of interest on the console. :( If I turn the watchdog off, the reboots stop, but then I see the freezes that you talk about.

So the reboots are not crashes, but simply the watchdog reacting to the router being nonresponsive.

Now, I did end up learning something interesting with the serial console experiment that may or may not be of interest to janisk, sergejs, and crew: it's not just the router being nonresponsive over the network/ethernet. When the router freezes up at random for 1-2 minutes with a MetaRouter guest running, *the console is also nonresponsive*. So if I try to type something out on the serial console, nothing gets echoed back to me. But if I wait and watch the console when it finally "unfreezes" itself, everything that I typed shows up on the console! So yes, it would appear that even the console is buffering characters that I send to it, and eventually it acts on my console input.

To me, it feels like something is eating up all of the CPU cycles -- making the whole router unresponsive -- and then suddenly returns back to normal.

I can also confirm that I am not having any problems with an RB433AH that I configured identically to my RB450G. It runs like a champ with MetaRouter for me.

This is very strange...like you said, RB450G and RB433AH hardware is very similar; at least, the CPU is the same. So there must be some other difference between the two boards that we are all missing. Wild hypothesis: it's a driver issue. A kernel driver/module for a particular piece of hardware on the RB450G has a bug (race condition of some sort?) that is only triggered under rare conditions, but somehow the way that MetaRouter interacts with it is triggering that bug.

One obvious hardware difference between the 450G and the 433AH is the gigabit ethernet. That, and the presence of a switch chip.

I do notice something different in interrupt request hit numbers between the two devices...the RB450G is counting up IRQ hits on the GPIO interface (IRQ 18) at an *astronomical* rate (roughly 200 hits/sec!) I just checked another 450G out in the field and it is doing the same thing. The 433AH, though, has 0 on GPIO. (Note that it looks like it happens whether there is a MetaRouter actively running or not, so it probably has nothing to do with it, but I thought I would mention it on the off-chance that it is related...)

-- Nathan
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: Has MikroTik given up on MetaROUTER?

Tue Apr 03, 2012 4:49 pm

...if you are running script inside router, that does something inside router every second, it works, same goes for script running in guest, no matter RouterOS or OpenWRT, it is running after the freeze as if nothing has happened.
Well you can't be sure if it really runs during the freeze or just catches up, as the script has no idea of time. See also my second idea, further down.
If you send small amount of packets, like ICMP ping to router every second, router after the freeze will reply to all of the packets, that is if ICMP ping packets on host sending them out does not timeout during the freeze time.
I have an idea for that, assuming the network hardware does use DMA buffers, than all those ICMP echo requests end up in this buffer, waiting to be processed by the CPU. If anything blocks the CPUs interrupt servicing, like a crazy service routine, this could be the behaviour one would see from the outside. The interrupt flag of the NIC gets flagged but isn't serviced, as the CPU unblocks it starts processing other interrupt requests. I know this behaviour from different CPU architectures, like bigger ARM or small AVR Controllers. I don't know the Atheros MIPS interrupt controller implementation, as there aren't any datasheets available.
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: Has MikroTik given up on MetaROUTER?

Tue Apr 03, 2012 5:02 pm

When the router freezes up at random for 1-2 minutes with a MetaRouter guest running, *the console is also nonresponsive*. So if I try to type something out on the serial console, nothing gets echoed back to me. But if I wait and watch the console when it finally "unfreezes" itself, everything that I typed shows up on the console! So yes, it would appear that even the console is buffering characters that I send to it, and eventually it acts on my console input.
Yes this still fits within my theory.
To me, it feels like something is eating up all of the CPU cycles -- making the whole router unresponsive -- and then suddenly returns back to normal.
I think it is something similar to that.
I do notice something different in interrupt request hit numbers between the two devices...the RB450G is counting up IRQ hits on the GPIO interface (IRQ 18) at an *astronomical* rate (roughly 200 hits/sec!) I just checked another 450G out in the field and it is doing the same thing. The 433AH, though, has 0 on GPIO. (Note that it looks like it happens whether there is a MetaRouter actively running or not, so it probably has nothing to do with it, but I thought I would mention it on the off-chance that it is related...)
Man you are a genius! This could definitly be it! I guess that those GPIO interrupts lock out some service routine for MetaROUTER. Either by contention or by a simple programming error or by a hardware glitch, like incorrect flag clearing sequences.
janisk, sergejs
Please pass this information on to your devs, this is most likely the cause. If your devs can pass on datasheet information, and details for GPIO interrupt service routines, I could also take a look at it, with no obligations on your side.
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: Has MikroTik given up on MetaROUTER?

Tue Apr 03, 2012 6:05 pm

Yes this still fits within my theory.
I agree; I just wanted to make sure to get that out there because I didn't want people to get too hung up on it being a "network layer" problem (processes are still running normally but you lose contact with the device). I think *everything* is "freezing" up, and that none of the normal processes are getting anywhere fast when this happens, and then they "catch up", as you say, after the issue clears up.
Man you are a genius!
...time will tell... :?

Interestingly, I just went through various mipsbe-based RB models on our network and found a few others that are also doing the same thing. I've also found other mipsbe models that don't (and some that don't even show *any* IRQ for GPIO). It would be interesting to try MetaRouter on as wide a variety of devices as possible, note which ones MR freezes up on, and note whether there is a direct correlation between that and the models where GPIO interrupt service counts are sky-high.

I wonder what MikroTik engineers have wired up to the CPU's GPIO lines on the models that do this...

A few that show the GPIO IRQ count issue:

- RB711UA-2HnD
- RB711GA-5HnD
- RB711A-5Hn
- RB493G
- RB450G (obviously)
- RB411AH (! this surprised me)
- SXT

Models that don't have the issue:

- RB751U-2HnD
- RB750
- RB711-5Hn
- RB493
- RB433AH
- RB433
- RB411

I unfortunately have no RB750UP, RB750G/GL, or RB450 at my disposal to look at. Also, it may take me a while before I can run many MetaRouter tests on these as most of these are deployed/in production and I don't have a lot of these sitting in stock on the shelf to test with.

-- Nathan
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: Has MikroTik given up on MetaROUTER?

Tue Apr 03, 2012 6:16 pm

...time will tell... :?
Sure, but for myself, I am 99,99% percent sure, that this is the cause. I hunted such issues down quite often, so I got a feel for it. ;-)
The other question is, will MikroTik be able to fix it. There are a few scenarios, which might prove difficult to fix.
I wonder what MikroTik engineers have wired up to the CPU's GPIO lines on the models that do this...
Me too, I don't have a RB450G within reach right now, where I am, but an RB750GL, which lists IRQ 4 for switch0 with about 40 IRQs per second but no GPIO IRQ.
Could be anything, judging by the frequency it could be some bit-banged protocol or an open pin. ;-)
 
barkas
Member Candidate
Member Candidate
Posts: 260
Joined: Sun Sep 25, 2011 10:51 pm

Re: Has MikroTik given up on MetaROUTER?

Tue Apr 03, 2012 8:21 pm

Inserting a microsd card in rb450G does not change the interrupt load.
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: Has MikroTik given up on MetaROUTER?

Tue Apr 03, 2012 11:45 pm

Inserting a microsd card in rb450G does not change the interrupt load.
Ooh, good thought. I forgot about the SD card slot. The 433AH, though, also has one, and the SXT doesn't.

Gotta be something else...

-- Nathan
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: Has MikroTik given up on MetaROUTER?

Wed Apr 04, 2012 1:40 pm

Whatever it is and even if those 200 interrupts per second are necessary, the bug quite sure isn't the caused by the pure existence of those requests.
The problem is either caused by the involved ISR(interrupt service routines) or the interrupt controller of the CPU itself.
In most cases it's a simple race condition while clearing/setting specific IRQ flags or global IRQ enable flags inside those ISRs, which causes IRQ to not be serviced, until another race condition triggers processing again.
The more I think about it, the more plausible do other effects like queuing of network traffic and serial data get, in case the global interrupt queuing stays enabled it allows UART ISRs and NIC ISR to still shift data in the correct buffers; this however would point to a pure software race condition not involving global IRQ enables but something like a simple variable/mutex/semaphore locking mechanism implemented in software. For example there might be a lock in place, which allows the processing of NIC and UART data only, when the correct context is currently active, i.e. when the outer routeros is running and not one if the metarouters.

So focusing on the source of the GPIO IRQs might only lead to a workaround and not a real solution. In the worst case this bug is caused by a compiler error, which is hard to track down.
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: Has MikroTik given up on MetaROUTER?

Wed Apr 04, 2012 2:23 pm

Whatever it is and even if those 200 interrupts per second are necessary, the bug quite sure isn't the caused by the pure existence of those requests. [...] this however would point to a pure software race condition not involving global IRQ enables but something like a simple variable/mutex/semaphore locking mechanism implemented in software. For example there might be a lock in place, which allows the processing of NIC and UART data only, when the correct context is currently active, i.e. when the outer routeros is running and not one if the metarouters.
So if I understand you correctly: in other words, the presence of the hundreds/sec interrupt requests does not directly cause the problem but merely increases the likelihood/chances that you will trigger the race condition and experience this bug, right? It happens way, way less often on boards that do not have the constant stream of GPIO interrupt requests (e.g., RB433AH), but under the right conditions, it COULD happen on any board model when ANY interrupt is raised. More raised interrupts just means more opportunities for a "collision" to take place.

-- Nathan
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: Has MikroTik given up on MetaROUTER?

Wed Apr 04, 2012 2:28 pm

So if I understand you correctly: in other words, the presence of the hundreds/sec interrupt requests does not directly cause the problem but merely increases the likelihood/chances that you will trigger the race condition and experience this bug, right? It happens way, way less often on boards that do not have the constant stream of GPIO interrupt requests (e.g., RB433AH), but under the right conditions, it COULD happen on any board model when ANY interrupt is raised. More raised interrupts just means more opportunities for a "collision" to take place.
Yes you summarize it absolutely correct. This is IMHO causing the majority of all MetaROUTER related problems.
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: Has MikroTik given up on MetaROUTER?

Thu Apr 05, 2012 2:43 pm

So I hate to throw any more fuel on this fire and confuse the issue any further, but...I just tried something that has supposedly already been debunked, and SO FAR (an hour into it) it seems to be working for me.

I changed out my 24v 0.8a power supply on my RB450G for a 12v 1a supply.

I've had multiple calls terminated by the Asterisk instance running on the MetaRouter on my 450G, and normally by this time I would have seen a reboot or freeze-up. But I have not ever since changing out the power supply.

Like I said, it's only been an hour, so this may be premature. But it is interesting to note that even IF it doesn't help 100% and these freeze-ups still occur occasionally, it *does* seem like it possibly helps reduce the number of occurrences.

I will note that changing out the PS has not changed the frequency of GPIO interrupts being raised, FWIW.

-- Nathan
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: Has MikroTik given up on MetaROUTER?

Thu Apr 05, 2012 2:54 pm

I have no idea, how the powersupply could have an influence on this although some seams to exists.
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: Has MikroTik given up on MetaROUTER?

Thu Apr 05, 2012 10:19 pm

I ran it on the 12v power all night long, often with 2 simultaneous calls going to the Asterisk instance running in the MetaRouter, with no problems.

Today, I put the 24v power supply back on, and within 15 minutes it rebooted. After it came back up, it rebooted a second time 2 minutes later.

This has to be more than a coincidence.

I did notice something: the entire time it ran on the 12v power supply, '/system heath' reported a very constant temperature of 47C that it NEVER deviated from. Within the 15 minutes of time from when I switched back to the 24v supply to when it rebooted, it went from 47C to 49C. Just before it crashed the second time (within 2 minutes), it hit 50C.

Right now it's been up for 5 minutes since the last crash, and hasn't died yet. I noticed that the temp is showing 49C again. It's not possible that the "freezing up" is some kind of system health/heat protection mechanism kicking in, is it?

No answers, just more observations. :)

-- Nathan
 
barkas
Member Candidate
Member Candidate
Posts: 260
Joined: Sun Sep 25, 2011 10:51 pm

Re: Has MikroTik given up on MetaROUTER?

Fri Apr 06, 2012 12:12 am

Mine is at 51°c, 16.4V and has rebooted 4 times in the last 24 hours.

No answer to my ticket yet.
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: Has MikroTik given up on MetaROUTER?

Fri Apr 06, 2012 9:23 am

I don't *really* think that it's an overheating thing; it's just a weird coincidence, perhaps. After I posted my last message, I deliberately loaded the CPU (bandwidth-test to localhost) in order to try and raise the temperature and see if that made it more likely to freeze. It was doing this while handing two simultaneous SIP/RTP sessions within the MetaRouter instance. It did make the temperature go up (to 51C), but it did not freeze. I finally quit the bandwidth-test, and then a few minutes after that, it froze and then watchdog rebooted it. It had been up for a total of 30 minutes, a near-record for me with the 24v power supply.

So I put the 12v power supply back in place after that, it has now been up for 10 hours, and temperature has dropped to 47C. Again, not really convinced the temp has anything to do with it. I'm just telling people what I see. What is absolutely not in doubt is that for me, a high-amperage 12v supply seems to cure all of my MetaRouter issues. I agree: it's weird, and I can't make sense of it. I can offer no explanation for WHY it works. Just that it does work, at least in this particular instance with this particular board.
Mine is at 51°c, 16.4V and has rebooted 4 times in the last 24 hours.
At 16.4v, that can't be a 12v you've got plugged in there...probably more like an 18v, maybe with PoE? (Would help explain some of the voltage drop.) Have you tried a 12v adapter plugged into the power jack?

-- Nathan
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: Has MikroTik given up on MetaROUTER?

Fri Apr 06, 2012 12:25 pm

Well temperature could indeed have an influence, the two posibillities I could see are:
1.) If it is a glitch in the interrupt controller, which means a hardware design bug inside the Atheros SoC.
2.) An overtemp shutdown- or controlmechanism, which we could check if those damn datasheets where available.

Regarding the input voltage, the supply voltage of the CPU shouldn't be dependent on it, but we don't know if there isn't a design bug on account of MT somewhere in there.
Why the CPU temperature changes when you swap the powersupply, could be related to the topology of the PCB, but the voltage converter isn't quite near the CPU...
Wait, there seems to be another converter directly aside the CPU, a rather small one, maybe for the I/O buffers, seams to small for core voltage.

Only speculating, but the link between MetaRouter and interrupts seems still valid, I think we only influence the chances wehn changing powersupplies.
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: Has MikroTik given up on MetaROUTER?

Fri Apr 06, 2012 12:50 pm

Mine is at 51°c, 16.4V and has rebooted 4 times in the last 24 hours.

No answer to my ticket yet.
I just hooked up mine to a lab supply, which far exceeds the value of the RB450G, and get a reading of 12.4V at 12V input. Will see if that changes aything.
I'am just creating 4 MetaRouters with basically no config but an IP pinging the host over a common bridge for all MRs.

EDIT: Current consumption at 12V is about 200mA, the temperature after half an hour is 58C.
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: Has MikroTik given up on MetaROUTER?

Fri Apr 06, 2012 5:45 pm

OK, as I was expecting, changing the powersupply from 24V to 12V doesn't really help, my RB450G just did reboot.
I think we can once and for all rule out the PSU ;-)
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: Has MikroTik given up on MetaROUTER?

Fri Apr 06, 2012 5:49 pm

OK, as I was expecting, changing the powersupply from 24V to 12V doesn't really help, my RB450G just did reboot.
I think we can once and for all rule out the PSU ;-)
I want to discount the power supply, too, because it doesn't make any sense to me. But my experience thus far still won't allow me to completely rule it out. Mine has been up for 20 hours now and counting. I haven't been able to get this kind of uptime with the original power supply ever.

I'll leave it running for a while longer, and see how long it takes for it to have an episode, if ever.

-- Nathan
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: Has MikroTik given up on MetaROUTER?

Fri Apr 06, 2012 9:05 pm

...But my experience thus far still won't allow me to completely rule it out. Mine has been up for 20 hours now and counting. I haven't been able to get this kind of uptime with the original power supply ever...
Well that is the point, using the same powersupply and configuration, I had uptimes between 5 Minutes and 1.5 Days. With the lab supply I got 3 hours. So my conclusion is, that it doesn't have any influence. ;-) At least not over the temperature, but some realy realy minimal shifts in timing at some point in the system. Long story short, it is not the cause merely a contributor in some strange analog way. :?
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: Has MikroTik given up on MetaROUTER?

Fri Apr 06, 2012 9:28 pm

I totally believe you. All I'm saying is that this board with this power supply is setting uptime records I have not been able to achieve before. In my experience, with the original 24v supply, it might sometimes stay up for as long as 6 hours...if there was absolutely no activity happening. I know that this also seems to go against some people's experiences which suggest that there is no correlation between load (either CPU load or network traffic) and freeze-ups, but in my case there is: after being up for 5-6 hours, if I start placing IP calls to Asterisk in MR, it will freeze within 15-30 minutes. Coincidence? Maybe, but it has happened too many times for me to believe that.

Maybe different boards have different physical tolerance levels for...whatever this thing is. And mine is at the threshold where it never kicks in when using this power supply (or at least extremely rarely...so rarely that I have not experienced it yet).

Another observation I'd like to make (which I haven't heard anyone else comment on) is that there are *definitely* TWO types of "freezes" that occur: those that affect the whole device, including the host OS (the kind most people are talking about here), and another one where *only the guest locks up*. I was getting these almost as routinely as the whole-device kind that would cause the watchdog to kick in. With the guest freeze-up, of course watchdog doesn't kick in, and my Winbox session stays up and I can bring up a MikroTik terminal and interact with the host, but I lose complete contact with the guest for 1-2 minutes, and both the MetaRouter console in Winbox AND any SSH sessions to the guest (or any other type of netowkr traffic to the guest) are completely unresponsive. Then, a minute or so later, it wakes itself up again.

I will also note that I have had 0 of those freezes since changing the power supply, as well. Bizarre but true.

I am almost at 24 hours of uptime, and I've had a SIP call with continuous bidirectional audio up to it for 3 hours now. Not a single hiccup. (And one side of the audio stream is being generated by Asterisk itself, looping various sound files!)

I will let the call run for as long as humanly possible; I may have to interrupt it at some point because it is running from my laptop to an external SIP proxy and then back to the 450G. If I'm going to run it for much longer, I'l want to tear it down and then dedicate another device to it that I can hide in a corner of the building somewhere where it will be out of the way and just run and run and run. I will also continue to periodically update everybody on my success (or lack of...).

-- Nathan
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: Has MikroTik given up on MetaROUTER?

Sat Apr 07, 2012 11:01 pm

Now at 48 hours of uptime. SIP call has run continuously, and is still chugging along.

-- Nathan
 
barkas
Member Candidate
Member Candidate
Posts: 260
Joined: Sun Sep 25, 2011 10:51 pm

Re: Has MikroTik given up on MetaROUTER?

Sun Apr 08, 2012 2:03 am

I'm a bit irritated that mine hasn't yet crashed either.
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: Has MikroTik given up on MetaROUTER?

Sun Apr 08, 2012 12:33 pm

Hmm this isn't good. :-( Could mean that there is a hardware glitch inside the SoC/CPU or a design glitch on the board. Both cases would mean, that we won't see an fix.
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: Has MikroTik given up on MetaROUTER?

Mon Apr 09, 2012 12:51 am

How could it be a SoC issue? It's the exact same part/silicon that's on the 433AH.

BTW, over 3 days of uptime now on mine. I'm telling you: this power supply has made it stable. Maybe I'm crazy, but I'd put this into production...if it were going to crash again, it'd have done it by now.

I'll continue to let it run through the rest of the weekend.

Happy Easter, everyone,

-- Nathan
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: Has MikroTik given up on MetaROUTER?

Mon Apr 09, 2012 11:01 am

How could it be a SoC issue? It's the exact same part/silicon that's on the 433AH.
Yes the same part, but not the same setup regarding connections to a switch chip and amount of RAM etc.
And I am not even sure, if it is the exact same part and not just the same CPU with a different SoC setup.
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: Has MikroTik given up on MetaROUTER?

Mon Apr 09, 2012 11:06 am

BTW, over 3 days of uptime now on mine. I'm telling you: this power supply has made it stable. Maybe I'm crazy, but I'd put this into production...if it were going to crash again, it'd have done it by now.
I would be happy, if I could report the same, 3hours was all I got, and not even with some load, just pinging...
 
peson
Trainer
Trainer
Posts: 183
Joined: Tue Jul 20, 2004 10:33 am
Location: Sweden

Re: Has MikroTik given up on MetaROUTER?

Mon Apr 09, 2012 1:56 pm

Good initiative Timberwolf :-D.
Nice work Nathan :-) It looks like you've put lots of effort on this.

I have a ticket at MT about the same reboot problem on RB1100AH.
I've done tons of testing with different configs on the RB1100AH and nothing helped.
Until now, I gave it all up with the MR testing. So after reading this tread I started to do some tests again.
Now I'm on a 450G and it acts the same.
I have a vague idea if the switching chip is causing the problem.
Both the 1100AH and 450G are using the same switching chip, the Atheros 8316

I have a RB493AH with 4 MR guests, all ROS.
It is a complex setup for the MTCINE lab.
I have a uptime + 8days on this router and I have no reboot issues at all. It has the same CPU as the 450G, but different switch chip (ICPlus178C)

I had a console crashlog on the MR in the RB450G and it looks like this:
MikroTik 5.14
MikroTik Login: Kernel unaligned instruction access[#1]:
Cpu 0
$ 0   : 00000000 0000006e 00000000 00000000
$ 4   : c00c83a0 00000001 c00c83f0 ffffffff
$ 8   : c0c0300c c038e8c0 fff7ffff c03c0000
$12   : 0000000a c03e0958 00000001 00000000
$16   : c0002000 00000000 2aab0000 004edae8
$20   : 00510000 0050db54 0050db30 0050d9d8
$24   : 00000010 c01108a8                  
$28   : c0c9a000 c0c9bec8 7f8a7bc0 c0101538
Hi    : 00000005
Lo    : 00000000
epc   : b0b74c08 0xb0b74c08
    Tainted: P           
ra    : c0101538 do_one_initcall+0x64/0x1ec
Status: 10008203    KERNEL EXL IE 
Cause : 10004010
BadVA : b0b74c08
PrId  : 0001800a (MIPS 4Kc)
Process net (pid: 213, threadinfo=c0c9a000, task=c0c233c0, tls=00000000)
Stack : c014eef0 c084b000 c0c9be78 00000001 c00c83f0 c0140018 2aab0000 004edae8
        00510000 c00c83f0 00000000 2aab0000 004edae8 c0151724 4f44c124 00000001
        000000d1 00000000 00002000 00000000 00000e04 004edae8 004edae8 ffffffff
        0000000e c010d0e4 0042a778 7f8a7bc0 7f8a812c 7f8a7bf4 00000000 00000000
        00000000 00000001 00001020 00000000 2aab0000 00000e04 004edae8 0000000e
        ...
Call Trace:
[<c014eef0>] module_sect_show+0x0/0x18
[<c0140018>] blocking_notifier_call_chain+0x14/0x20
[<c0151724>] sys_init_module+0xb0/0x1dc
[<c010d0e4>] stack_done+0x20/0x3c


Code:
unaligned data access at c0109918 show_code+0x9c/0x150
unaligned data access at c010b660 do_ade+0x1e0/0x420
Unhandled kernel unaligned access[#2]:
Cpu 0
$ 0   : 00000000 0000006e c0c9a000 b0b74bfc
$ 4   : 00000000 00000000 ffffffff 00010000
$ 8   : 35300d0a c0c0956c 00000000 30783963
$12   : 0000000a c03e0958 00000001 00000000
$16   : c0c9bca8 00000007 80000000 fffffffa
$20   : 00000008 00000020 00000006 c0338dd8
$24   : 00000000 c01108a8                  
$28   : c0c9a000 c0c9bc80 0000003e c010b5b4
Hi    : 00000005
Lo    : 0000000d
epc   : c010b660 do_ade+0x1e0/0x420
    Tainted: P           
ra    : c010b5b4 do_ade+0x134/0x420
Status: 10008202    KERNEL EXL 
Cause : 00000010
BadVA : b0b74bfc
PrId  : 0001800a (MIPS 4Kc)
Process net (pid: 213, threadinfo=c0c9a000, task=c0c233c0, tls=00000000)
Stack : c03c1c4e c0109918 c0109918 00010000 c0370000 00000000 fffffffd b0b74bfc
        fffffffa c01047e0 c033a4dc c0370000 c03c0000 c0125138 c03c0000 c0125138
        00000000 0000006e 00000000 0000003c c037687c c0c9bc13 00000000 00010000
        00000000 00000001 00000003 436f6465 0000000a c03e0958 00000001 00000000
        00000000 fffffffd b0b74bfc fffffffa 00000008 00000020 00000006 c0338dd8
        ...
Call Trace:
[<c010b660>] do_ade+0x1e0/0x420
[<c01047e0>] ret_from_exception+0x0/0xc
[<c0109918>] show_code+0x9c/0x150
[<c010a250>] show_registers+0x94/0xac
[<c010a324>] die+0xbc/0x128
[<c010b874>] do_ade+0x3f4/0x420
[<c01047e0>] ret_from_exception+0x0/0xc


Code: 00852024  54800063  8e040098 <88730000> 98730003  24030000  08042da8  000000
00  8c450018 
---[ end trace 268415cd87e731ca ]---
ip_tables: (C) 2000-2006 Netfilter Core Team
netfilter PSD loaded - (c) astaro AG
Process accounting paused

I totally agree that MT shall take their responsibilty about the MR issues, there are to many treads about failures to ignore the problem and just say:
- use a different board
We (their customers, consultants and trainers) spend lots of time trying to give them enough information to solve this.
The idea of MR rocks, but tweaking and squeezing :-) shouldn't be needed
Reboot is the last resort, try to find out what's wrong instead.
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: Has MikroTik given up on MetaROUTER?

Mon Apr 09, 2012 8:40 pm

peson,

That is interesting that you have an 1100AH that is rebooting for you. I also have an 1100AH that I have done some playing with MetaRouter on, and have not had any problems with it. I admit I have not had extended uptime tests on it, though. I will plan to start doing with my 1100AH what I have been doing with the 450G (load Asterisk on, loop a constant call through it for hours).

BTW, there are actually two hardware revisions of the 1100AH. MikroTik never really documented this anywhere, so I will call the older one "rev. A" and the newer one "rev. B". Rev. A boards are based off of the original 1100 board design (just with a different CPU), actually have "RouterBOARD 1100" silkscreened onto the board, generally contain a 512MB NAND flash onboard, and use Atheros 8316 switch chips. Rev. B boards look like the 1100AHx2 (again, just with a different CPU), have "RouterBOARD 1100AH" silkscreened onto the board, generally contain a 64MB NAND flash onboard (although this can vary), and use Atheros 8327 switch chips. I personally have one of the older Rev. A boards, and have access to Rev. B boards at work. If you are correct that the particular switch chip is causing this, then Rev. A boards should have problems while the Rev. B boards should not.

I personally have my doubts that it is related to the switch chip, but without knowing more about the low-level details of each board's design and how MR works, it is hard to say. I still say that there is a link between power input and whether you experience the freeze/reboots, and I personally wonder if it is related to the power regulator. I haven't tried to closely compare the 450G to the 433AH to see if they are using the same component(s) for this.

It is also interesting that you are seeing console crashes on your 450G; none of the rest of us are. The device simply hangs, and then the watchdog kicks it after a few seconds of this. If watchdog is turned off, the board recovers after a minute or so and does not reboot. So your symptoms are decidedly different than most people's. Have you sent those crash logs + a supout to support?

In just 2 short hours, my 450G will have been up for 4 days straight without rebooting or freezing. And this, I am convinced, is because I am using a different power supply. Again, I cannot explain how or why...I can only tell you what I am experiencing.

-- Nathan

EDIT: I'm now at over 4 days of uptime. I'm going to take it down now and relocate it for more extended stress-testing. At the same time, I will find a different 12v power supply to use, just for grins, and I'm building a new OpenWRT image to use for future tests on both the 450G and 1100AH.
 
peson
Trainer
Trainer
Posts: 183
Joined: Tue Jul 20, 2004 10:33 am
Location: Sweden

Re: Has MikroTik given up on MetaROUTER?

Mon Apr 09, 2012 11:23 pm

Nathan,
note that my crash-log came from the MR not from the 450G itself.
The watchdog of the MR rebooted it, but the host stayed alive.

I've noted the difference between 1100AH. It's sad that Mikrotik didn't put a revision note on the routers.
Another thing is that the Rev A have the encyption chip and the Rev B doesn't.

I did some testing on the 1100AH (rev A) with different PSUs feeded both the main power plug and PoE and both at the same time, with no luck.
My 1100AH have a MR that acts as a gateway to the management net and there is lots of SNMP (UDP) sessions trough it.
So a constanly SIP session to/from a MR in asterisk would be interesting to see.
The same config on a x86 ROS KVM works fine, but the 1100AH reboots.

-Paul
Reboot is the last resort, try to find out what's wrong instead.
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: Has MikroTik given up on MetaROUTER?

Tue Apr 10, 2012 12:46 am

Another thing is that the Rev A have the encyption chip and the Rev B doesn't.
Not to stray too far off-topic here, but...how do you know the Rev. A has the encryption engine on its CPU? I thought only the RB1000 CPU had that.

-- Nathan
 
peson
Trainer
Trainer
Posts: 183
Joined: Tue Jul 20, 2004 10:33 am
Location: Sweden

Re: Has MikroTik given up on MetaROUTER?

Tue Apr 10, 2012 1:49 am

Another thing is that the Rev A have the encyption chip and the Rev B doesn't.
Not to stray too far off-topic here, but...how do you know the Rev. A has the encryption engine on its CPU? I thought only the RB1000 CPU had that.

-- Nathan
sys resour pr
uptime: 9w6d2h21m50s
version: 5.12
free-memory: 1433200KiB
total-memory: 1555424KiB
cpu: e500v2
cpu-count: 1
cpu-frequency: 1066MHz
cpu-load: 6%
free-hdd-space: 476216KiB
total-hdd-space: 520192KiB
write-sect-since-reboot: 8368
write-sect-total: 56755
bad-blocks: 0%
architecture-name: powerpc
board-name: RB1100AH
platform: MikroTik

sys resource pci pr
# DEVICE VENDOR NAME IRQ
0 06:00.0 Attansic Technology Corp. unknown device (rev: 192) 18
1 05:00.0 Freescale Semiconductor Inc MPC8544E (rev: 17) 0
2 04:00.0 Attansic Technology Corp. unknown device (rev: 192) 17
3 03:00.0 Freescale Semiconductor Inc MPC8544E (rev: 17) 0
4 02:00.0 Attansic Technology Corp. unknown device (rev: 192) 16
5 01:00.0 Freescale Semiconductor Inc MPC8544E (rev: 17) 0
6 00:00.0 Freescale Semiconductor Inc MPC8544E (rev: 17) 0

From http://www.freescale.com/webapp/sps/sit ... e=MPC8544E:
Integrated security engine supporting DES, 3DES, MD-5, SHA-1/2, AES, RSA, RNG, Kasumi F8/F9 and ARC-4 encryption algorithms (MPC8544E)
So, hold on tight to the Rev A routers ;-)
/Paul
Reboot is the last resort, try to find out what's wrong instead.
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: Has MikroTik given up on MetaROUTER?

Tue Apr 10, 2012 3:56 am

Interesting. So RB1100AH Rev. A had an MPC8544E. RB1000 had an MPC8547E according to the docs, but '/system resource pci print' shows nothing. RB1100AH Rev. B has a P2010E (single-core), but apparently the PCI device ID for the P2010E is the same as the dual-core P2020E, so it shows P2020E in '/system resource pci print' on a Rev. B board.

But it is unclear to me between the MPC8544E and the P2010E which is actually the better processor. The P2010E actually has double the L2 cache of the MPC8544E (512KB vs. 256KB), AND the FreeScale site shows that the P2010E *ALSO* has the encryption engine in it too!

http://www.freescale.com/webapp/sps/sit ... 571050A9A1
Integrated security engine: Protocol support includes SNOW, ARC4, 3DES, AES, RSA/ECC, RNG, single-pass SSL/TLS, Kasumi, XOR acceleration
So if you ask me, RB1000, both versions of RB1100AH, and RB1100AHx2 all have the encryption acceleration.

-- Nathan

EDIT: I just took a look at an RB1100, which has an MPC8544 (no E at the end). I then realized what the 'E' probably stands for: encryption. The Rev. B AH claims to have a P2020E in PCI resources, but that can't be right since that's a dual-core chip. So I don't know if I trust the 'E' at the end either. So at this point I don't know if encryption engine on the Rev. B can be either confirmed or denied. Someone will have to be willing to take the heatsink off the CPU of their Rev. B board and collect a model # off of it to know for sure.
 
reverged
Member Candidate
Member Candidate
Posts: 270
Joined: Thu Nov 12, 2009 8:30 am

Re: Has MikroTik given up on MetaROUTER?

Tue Apr 10, 2012 11:07 pm

NathanA -
Some good work and you are not crazy about the power supply theory. It has come up in other threads but there was no conclusion.
What is the model number or brand of your 12V supply? How long is the dc cord?

I have some observations to add.
450G, ROS5.14
I installed an OpenWRT mr. 16MB RAM, no disk specified. Basically the winbox defaults.
No interfaces on the mr.
I simply imported it and it booted up.

First, I can't get the right-click reboot to work in winbox. The mr doesn't reboot. Only a console reboot actually reboots the mr.

Second, and this is really weird and might be related to power supply theories, etc.
I see really weird values in system health during a metarouter boot. Like bizarre values bouncing all over.
It is very repeatable each metarouter boot - although the extent of value change differs.
It's difficult to capture in winbox, so I wrote a script to spit it out every 1 second, and I see the same thing from the script.
Here's a screenshot at just random points when I noticed strange values:
erratic.GIF
The bottom image is more or less the normal values.

Why would a metarouter boot cause the system health values to go crazy???
The data is clearly bogus as there is no way it is 6C in my 450G.
You do not have the required permissions to view the files attached to this post.
 
reverged
Member Candidate
Member Candidate
Posts: 270
Joined: Thu Nov 12, 2009 8:30 am

Re: Has MikroTik given up on MetaROUTER?

Wed Apr 11, 2012 3:21 am

I can get the same to thing to happen if I install a ROS metarouter.
erratic2.gif
This doesn't seem to happen to as great an extent when I reboot a ROS metarouter.
You do not have the required permissions to view the files attached to this post.
 
cdemers
Member Candidate
Member Candidate
Posts: 184
Joined: Sun Feb 26, 2006 3:32 pm
Location: Canada
Contact:

Re: Has MikroTik given up on MetaROUTER?

Wed Apr 11, 2012 3:40 am

Been having some trouble getting a RB750G to run a stable MetaRouter, allocated 15MB ram and running a blank configuration otherwise. Running on 5.14... Tried various adapters like has been tried, 12v 500ma, 12v 1000ma, 24v 380ma, all with same results. Even netloaded it with a clean config/OS. Same results. Most of the time it gets lots of errors on boot on the console of the metarouter. As long as I don't allocate more than 15MB ram the unit does not reboot, but the single meta router can't run most of the time. And when it has run, it pauses for long periods of time and then eventually stops responding.
 
User avatar
janisk
MikroTik Support
MikroTik Support
Posts: 6283
Joined: Tue Feb 14, 2006 9:46 am
Location: Riga, Latvia

Re: Has MikroTik given up on MetaROUTER?

Wed Apr 11, 2012 9:17 am

if PSU changes do affect stability maybe you have to check capacitors on your board, maybe those pesky things are going to their end. as guest OS adds quite some load.
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: Has MikroTik given up on MetaROUTER?

Wed Apr 11, 2012 10:22 am

if PSU changes do affect stability maybe you have to check capacitors on your board, maybe those pesky things are going to their end. as guest OS adds quite some load.
janisk,

Good thought, but mine is a brand-new 450G...I looked at the capacitors anyway, and they are in good shape still.

Still no problems so far with 12V 1A, but tons of crashes with 24V 0.8A.

-- Nathan

EDIT: I started doing heavier testing on my 1100AH Rev. A, and just had my first crash within 24 hours (actually, just under 24 hours). And it crashed HARD. Watchdog did NOT reboot, and serial console was completely unresponsive. I have powercycled it and will watch it further. 450G with 12V power supply is still humming right along.
 
User avatar
liquidcz
Frequent Visitor
Frequent Visitor
Posts: 73
Joined: Tue Dec 28, 2010 1:24 pm

Re: Has MikroTik given up on MetaROUTER?

Wed Apr 11, 2012 12:55 pm

I just purchased brand new RB450G for testing virtualization (Metarouters), and had to acknowledge that when i use 12V power supply, RB gouing to be more stable, without reboots/freezing.

Im using brand new Power Supply "Sunny 12V 2A 24W" http://www.sunny-euro.com/data/files/85 ... 353874.pdf

My uptime is just few hours, so, im going to report my uptime later.
Last edited by liquidcz on Wed Apr 11, 2012 3:00 pm, edited 1 time in total.
 
User avatar
janisk
MikroTik Support
MikroTik Support
Posts: 6283
Joined: Tue Feb 14, 2006 9:46 am
Location: Riga, Latvia

Re: Has MikroTik given up on MetaROUTER?

Wed Apr 11, 2012 2:49 pm

and PSUs are ok too? If older PSU is used, then that also could cause some problems as under load voltage drops to lower voltage than expected. Just some thoughts.

After 2 - 3 yers since these boards where given to me - capacitors went bad, after re-soldering - no crashes
 
barkas
Member Candidate
Member Candidate
Posts: 260
Joined: Sun Sep 25, 2011 10:51 pm

Re: Has MikroTik given up on MetaROUTER?

Wed Apr 11, 2012 5:07 pm

and PSUs are ok too? If older PSU is used, then that also could cause some problems as under load voltage drops to lower voltage than expected. Just some thoughts.
Strangely, only the ones with higher voltage seem to cause the reboots, while my cheap 12V power supply works so far.
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: Has MikroTik given up on MetaROUTER?

Wed Apr 11, 2012 8:18 pm

I have two RB450G, one at home and one hosted at a datacenter. The one at the datacenter is hardwired to a quite good 24V PSU, and did crash really often with MR, but this is a brand new board. I can't test with this board in the near future, and swapping the PSU isn't an option too.
The RB450G at home crashed even when being powered by a lab supply at 12V and current limit set to 4A, the power consumption was in the range of about 200mA.

Those strange readings reverged got seem to indicate that something is interferring with an A/D conversion, any details how this is implemented at the RB450G?
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: Has MikroTik given up on MetaROUTER?

Wed Apr 11, 2012 8:22 pm

After 2 - 3 yers since these boards where given to me - capacitors went bad, after re-soldering - no crashes
I did put quite some load with encryption on my RB450G some time ago, no crashes, so this somehow doesn't quite fit.
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: Has MikroTik given up on MetaROUTER?

Wed Apr 11, 2012 8:22 pm

What is the model number or brand of your 12V supply? How long is the dc cord?
I have used 3 or so different ones over the course of testing. The first one was a power supply taken from a new shipment of RB751U for North America (markings: Nalin NLB100120W1A). The second one was one that I stole from a Motorola SIP VoIP adapter (markings: Delta Electronics ADP-15ZB), but I had to be careful around it since the DC socket size was mismatched between the PS and the 450G, so if I wiggled the cable even a little, I risked cutting power to the board. This is the power supply I got 4 days of uptime with over the Easter weekend, though. The third one is one that I think originally came with a shipment of refurbished Motorola DOCSIS cable modems (SB51xx), but they were not the correct ones (the DC connector on the PS was too small for the DC jack on the cable modems) so we used them for other things. I will have to get the markings off of this one later.

Interestingly, I had my 450G reboot on me for the first time while hooked up to a 12V 1A or greater power supply; it happened over night last night. It was with the last of the three power supplies I listed. And it was even while it was in an air-conditioned room, which it wasn't in over the weekend when I achieved the 4 day uptime record. I am actually a bit suspicious of the power supply, since the first two measured pretty close to 12V on '/system health print' but this third one is measuring between 14.6V and 14.8V. So I will swap it out with a different one and see if it happens again.
I see really weird values in system health during a metarouter boot. Like bizarre values bouncing all over.
I checked, and I see the same thing, too! My values on the 450G are not as wild as yours, though. But I even see this on the 1100AH! Voltage dipped from 12.9V to 7.6V according to '/system health' when I booted up a MR imge on an 1100AH Rev. A.
and PSUs are ok too? If older PSU is used, then that also could cause some problems as under load voltage drops to lower voltage than expected. Just some thoughts.
Good thoughts...keep them coming. I doubt it is the PSU age, though, because I've tried a few different 24V ones and even brand-new ones cause it to reboot.

You say that MetaROUTER puts the system under quite a bit more load. Can you elaborate on that? I see CPU load peak when the MR is booting up, but after it is booted, I never see my CPU go above 10% utilization even with a couple of SIP calls running to Asterisk in the MR. But even when CPU load is this low, the 450G still reboots itself under certain circumstances with certain power supplies.

Are all new 450G shipments using better capacitors? Or should I be replacing the capacitors even on new 450G shipments? The one I am using for testing we got about a month ago. I can give you the serial number of the 450G board if that would be helpful.
After 2 - 3 yers since these boards where given to me - capacitors went bad, after re-soldering - no crashes
Very interesting. What power supply are you using with it?

Thanks,

-- Nathan

EDIT: Had my 1100AH "crash" for the second time. It happened just a few minutes after I tried loading it down with some calls again, and this time the watchdog did kick it. I'm restarting those calls, have a console attached up to it now, and will see what happens. This is disappointing...the 1100AH comes with a pretty beefy power supply.

EDIT 2: 1100AH rebooted itself again after about 2 hours of 4 continuously active calls. Serial console from one MT to the other just showed a login screen and no history...argh. Going to turn off the watchdog and see if the 1100AH is just freezing up like the 450G does, or if it is failing in a different way (e.g., kernel panic).
 
User avatar
janisk
MikroTik Support
MikroTik Support
Posts: 6283
Joined: Tue Feb 14, 2006 9:46 am
Location: Riga, Latvia

Re: Has MikroTik given up on MetaROUTER?

Thu Apr 12, 2012 8:39 am

both reported RB540G are on 0.8A@24V PSUs, that in tur are not very fresh.
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: Has MikroTik given up on MetaROUTER?

Thu Apr 12, 2012 9:04 am

both reported RB540G are on 0.8A@24V PSUs, that in tur are not very fresh.
Sounds like the same kind of power supply I'm using with my 450G that causes it to reboot constantly...DVE brand?

Do you think that capacitors on recently manufactured 450Gs still should be replaced, or are current batches using higher-quality caps? (Mine is serial # 33B601757055, if that tells you anything about when it was manufactured and whether or not it might have shipped with substandard caps.)

You might be interested to know that I am having a lot of problems with MetaROUTER on my 1100AH. It is failing in different ways than the 450G. The 450G would hang for a minute or two and then continue to work, unless the watchdog is enabled, in which case the watchdog would reboot the board before it had a chance to start responding again. My 1100AH, however, is sometimes not rebooting when watchdog is enabled, and IS sometimes rebooting when I have watchdog DISabled. If it is printing a crash report to the serial console, I still haven't been able to catch it, but I will keep trying. It is NOT generating an auto-supout, unfortunately.

The 1100AH has rebooted 3 times today. The 450G didn't reboot at all until I put the original 24V 800mA power supply back on it. Now it has rebooted twice.

-- Nathan
 
User avatar
janisk
MikroTik Support
MikroTik Support
Posts: 6283
Joined: Tue Feb 14, 2006 9:46 am
Location: Riga, Latvia

Re: Has MikroTik given up on MetaROUTER?

Thu Apr 12, 2012 10:22 am

new boards should have good capacitors on them and does not need replacement.

about RB1100AH - what you have configured there? try to check what you have set and if recreating this with original disabled on another MR causes the same problem. Also, you could send configuration over so i can try to set up the config locally.
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: Has MikroTik given up on MetaROUTER?

Thu Apr 12, 2012 10:44 am

Also, you could send configuration over so i can try to set up the config locally.
janisk,

I'm working on putting together a test suite for you. Basically, right now, the RB1100AH is doing almost nothing aside from running the MetaROUTER, as you can see from this '/export compact':
/interface bridge
add name=bridge1
/metarouter
add memory-size=128MiB name=ast-owrt-mr
/interface bridge port
add bridge=bridge1 interface=ether1
/ip dhcp-client
add disabled=no interface=bridge1
/metarouter interface
add dynamic-bridge=bridge1 type=dynamic virtual-machine=ast-owrt-mr
/system routerboard settings
set cpu-frequency=1333MHz
/system watchdog
set watchdog-timer=no
I did forget that I was overclocking to 1333MHz, so I'll try clocking that back down and see if that helps at all. But the router is under almost no load from the MR, so I'd be surprised if that were the problem (?).

Inside the single MR instance, I'm running my OpenWRT build that I mention in this thread: http://forum.mikrotik.com/viewtopic.php ... 00#p311681; I will put together some instructions on how to generate some activity within the MR. I recommend loading this image onto two different RouterBOARDs connected to each other, and then originate several SIP calls with looping audio between the two RBs. This usually gets one of the two RBs to crash after a couple of hours.

Thanks for staying engaged with this!

-- Nathan
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: Has MikroTik given up on MetaROUTER?

Thu Apr 12, 2012 12:13 pm

janisk,

Okay, here's the easiest way to set up a test that most closely mirrors mine, I think.

Take two RouterBOARDs that can run MetaROUTER. For example, 450G and 1100AH. :) Upgrade to 5.14, latest RouterBOOT, etc. '/system reset-configuration no-defaults=yes'

Grab my OpenWRT rootfs:

http://www.nconx.com/~nathan/ast-owrt-m ... s_mips.tgz (MIPS)
http://www.nconx.com/~nathan/ast-owrt-m ... fs_ppc.tgz (PowerPC)

Import into each RB once. I've been giving the instance 128MB of RAM, but I'm not sure if that's important or not. Assign one dynamic network interface to the MR. Make sure the MR on the first RB (e.g., 450G) is reachable from the MR on the second (e.g., 1100AH). Easiest way to do this would probably be to add one of the ethers on each RB to a bridge that the dynamic vif is in, and have a third RB that is acting as a DHCP server (my OpenWRT image will try to get IP via DHCP by default).

SSH into the MR running on the 450G: username 'root' password 'ast-owrt'. Edit the file /etc/asterisk/sip.conf, and add a SIP peer entry that points at the 1100AH at the very end of the file:
[rb1100ah]
type=peer
host=1.1.1.2
...where the IP on the "host=" line is the IP address that the MR on the 1100AH has. Then edit the same file on the 1100AH, and add a SIP peer entry pointing at the 450G in the same way:
[rb450g]
type=peer
host=1.1.1.1
Next, on both MRs, edit the file /etc/asterisk/extensions.conf, and go to line number 482, which should have a line that looks like this:
exten => t,1,Goto(#,1)
Change it so that it looks like this:
exten => t,1,Goto(s,restart)
This will ensure that once a call is established between the two MRs that the audio keeps looping and neither side hangs up after a period of inactivity.

Finally, reboot both MRs. Once they have both booted back up again, log into one of the MRs (it doesn't matter which one: both sides will be transmitting audio to the other, so it doesn't matter which one initiates the call), and place several calls from it to the other MR. I usually do about 10 simultaneous calls. For example, to place a call from the 450G to the 1100AH, SSH into the 450G's MR as username 'admin' password 'ast-owrt', and then run this command at the Asterisk console:
originate sip/rb1100ah extension s
This sends a SIP INVITE to the peer named 'rb1100ah', and then calls local extension 's' in the default context, which starts the demo audio file playback loop. You should be able to see vif1 on both the 450G and the 1100AH transmitting roughly 80kbit/s bidirectionally. (At times, you may see this go to 0 for a few seconds. That's because there is a point in the demo playback loop where it pauses and waits for input. When it doesn't hear any, it restarts from the top.)

You can see all active calls with this command:
core show channels
At this point, you will only see 1 entry, and it will summarize this as "1 active channels, 1 active calls" at the end. To create more calls, simply run the same 'originate' command as many times as you wish. For each time you run that command, 'core show channels' will show additional calls running.

At this point, just watch both RouterBOARDs and wait. :)

Thanks again for your help,

-- Nathan
 
peson
Trainer
Trainer
Posts: 183
Joined: Tue Jul 20, 2004 10:33 am
Location: Sweden

Re: Has MikroTik given up on MetaROUTER?

Thu Apr 12, 2012 2:16 pm

The 1100AH has rebooted 3 times today. The 450G didn't reboot at all until I put the original 24V 800mA power supply back on it. Now it has rebooted twice.
I'm having the same experience from my 1100AH Rev A. routers, but after disabling the watchdog it doesn't reboot anymore, at least for last 3d14h :-)
It's interesting to hear about that different PSU are affecting the stability, but why does it only happens when the MR is running?
If I stress a 450G or 1100AH Rev A with tons of traffic and CPU load it doesn't reboot.

Questions to MT staff:
- Why does it reboot when a MR and not when stressing the router?
- When watchdog is disabled and router stall, what happens inside the router?
- Why doesn't it create a supout file when watchdog reboot it?
- Can you share documents how the networking works between the host and guest in MR?
I've discovered some strange things, read more here:
http://forum.mikrotik.com/viewtopic.php ... 10#p302710

/Paul
Reboot is the last resort, try to find out what's wrong instead.
 
peson
Trainer
Trainer
Posts: 183
Joined: Tue Jul 20, 2004 10:33 am
Location: Sweden

Re: Has MikroTik given up on MetaROUTER?

Thu Apr 12, 2012 2:40 pm

new boards should have good capacitors on them and does not need replacement.

about RB1100AH - what you have configured there? try to check what you have set and if recreating this with original disabled on another MR causes the same problem. Also, you could send configuration over so i can try to set up the config locally.
Janis!
From your recommendation in my trouble ticket Ticket#2012012666000134, I've done the recreation of the configs.
I've also reported back, why keep asking for the same things in the forum, that's already been done in trouble tickets?

Please shine :-) a bit and report back if you find anything, my bet is a software/driver problem in the networking part of MR.
Why not setup a list of things to test and assign tasks to a list of people willing to "commit the force".
From my reading about MR. there are lots of people who have put lot of effort to get things working.
Many of those, like me, almost gave up testing, but since we are technicians we have a instinct to get things to work :-)
So, one man cannot do everything but many can do something. Let us collude on this

/Paul
Reboot is the last resort, try to find out what's wrong instead.
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: Has MikroTik given up on MetaROUTER?

Thu Apr 12, 2012 10:25 pm

Quick update:

I have set up 2 MetaROUTER "labs" since I posted my step-by-step "howto", and they are as follows:

1 RB450G connected to 1 RB433AH
1 RB1100 connected to 1 RB1100AH Rev. A

So far it has been about 12 hours since I started the first pair, and 8 hours for the second pair, and none of the devices have rebooted or crashed yet. Each pair has 10 active SIP calls running between them, and I set them up exactly as I described in my last post for janisk.

For the first pair (RB450G + RB433AH), I put the Delta Electronics 12V 1.25A power supply back on the 450G, and am running the 433AH with the DVE 24V 0.8A that causes the 450G to reboot.

For the second pair (RB1100 + RB1100AH), the 1100 is running at its factory default clock speed of 800/400 (CPU/RAM), I put the clock speed of the 1100AH back to its factory default of 1066/533 (CPU/RAM), and have watchdog disabled on both the 1100 and the 1100AH. If neither crashes or reboots for 48 hours, then I will re-enable watchdog on both.

I am secretly hoping that both the 1100 and 1100AH either crash or reboot, because I'd hate to think that my 1100AH cannot run reliably at 1333MHz. ;)

-- Nathan
 
peson
Trainer
Trainer
Posts: 183
Joined: Tue Jul 20, 2004 10:33 am
Location: Sweden

Re: Has MikroTik given up on MetaROUTER?

Fri Apr 13, 2012 2:48 am

I am secretly hoping that both the 1100 and 1100AH either crash or reboot, because I'd hate to think that my 1100AH cannot run reliably at 1333MHz. ;)

-- Nathan
I'm running both my 1100AH Rev A at the factory set speed. One has the watchdog disabled and the other has it enabled.
The one with the WD disabled keep runs and the other reboots.
The first has an ROS image that acts as a gateway to a management network and the other has an ROS image that doing nothing.
Sessions trough the management image ends like a communication failure, probably when the system "hangs" and recover.

So, I want be supprised that it wont reboot as long as you're having WD disabled.

/Paul
Reboot is the last resort, try to find out what's wrong instead.
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: Has MikroTik given up on MetaROUTER?

Fri Apr 13, 2012 3:53 am

The one with the WD disabled keep runs and the other reboots.
In my case, with the 1100AH running at 1.33GHz, it was sometimes rebooting with watchdog DISABLED, and sometimes completely hanging with watchdog ENABLED (watchdog did not kick in), requiring a powercycle.

Now that it's back at factory frequency, it has been up for 14 hours now (watchdog still disabled). This is why I suspect overclocking and MetaROUTER don't mix. :( Oh well, 1066MHz is still plenty fast. ;) It would be nice to have a spare RB1000 to play with, though...(I have some, but they are all in production and can't be used for "lab experiments")
So, I want be supprised that it wont reboot as long as you're having WD disabled.
Right, that would make sense. It would be acting how the 450G acts. What I was seeing, though, was that it would reboot even when watchdog was *disabled*. Probably because the CPU was unstable when overclocked and running under the load of MetaROUTER.

-- Nathan
 
User avatar
liquidcz
Frequent Visitor
Frequent Visitor
Posts: 73
Joined: Tue Dec 28, 2010 1:24 pm

Re: Has MikroTik given up on MetaROUTER?

Fri Apr 13, 2012 10:00 am

Well, now i have uptime 1D 6H without reboot/freezing. Im going to leave it running during the weekend.

My configuration is:
- brand new RB450G
- brand new power Supply Sunny 12V 2A 24W
- two metarouters OpenWrt-trunk, each metarouter connected by one dynamic interface to the local bridge
- each metarouter connected by console and running TOP command
- im connected to the each metarouter by SSH from another machine and running TOP command
- watchdog is disabled
 
barkas
Member Candidate
Member Candidate
Posts: 260
Joined: Sun Sep 25, 2011 10:51 pm

Re: Has MikroTik given up on MetaROUTER?

Fri Apr 13, 2012 10:44 am

I don't see watchdog disabled as a particularly useful testing scenario - I won't risk having one of those crash on me when it's in some datacenter, so watchdog will always be enabled in production environments.
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: Has MikroTik given up on MetaROUTER?

Fri Apr 13, 2012 11:08 am

I don't see watchdog disabled as a particularly useful testing scenario - I won't risk having one of those crash on me when it's in some datacenter, so watchdog will always be enabled in production environments.
Of course you wouldn't do that in production. The point of the test, though, is to gain a better understanding of what the source of the problem is in the first place. When we tested with the watchdog off, we learned that on the 450G at least, MetaROUTER wasn't directly causing the reboots -- the watchdog was rebooting the system when it became unresponsive. But we also learned that it only remains unresponsive for a relatively short period of time, and then it "wakes up" again. The system isn't crashing and there are no kernel panics happening. This is all useful information for the developers to know.

-- Nathan
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: Has MikroTik given up on MetaROUTER?

Fri Apr 13, 2012 11:09 am

I've been learning a fair bit about MetaROUTER in these most recent tests. In the process, I am becoming more and more convinced that MetaROUTER is harder on the CPU than RouterOS in general is *and* that most of the crashes and freeze-ups we are seeing are power-consumption related.

1. On MIPS, I believe that CPU utilization is being reported incorrectly by RouterOS. It is severely underreporting CPU usage when a MetaROUTER guest is running. I know this because inside the MR guest, when I started loading Asterisk down more and more, the CPU went up to 0% idle, and both the MR and the RouterOS host became very "sluggish", but RouterOS was only reporting 20-30% CPU load the whole time. Also, the network utilization of the vif1 interface plateaued at about 1.6Mbit/s when I hit about 20 calls or so...additional calls increased the sluggishness but did not show any additional traffic being moved -- a good sign that the CPU is bottlenecking things. Yet '/system resource print' was not showing this (and neither was '/tool profile', which seemed to be in agreement and claimed that they system was largely idle).

2. On PowerPC, I believe CPU utilization reported by RouterOS is closer to being accurate compared to MIPS. When I put a 1100AH under the same kind of load (20 simultaneous SIP calls on Asterisk) in the MR as I did with the MIPS guest (450G or 433AH), it shows 30-40% CPU load. I was able to add more calls on top of that until I got to about 40 or 50 simultaneous calls, at which point RouterOS showed 100% CPU and the vif1 utilization started to show signs of plateauing. However, '/tool profile' on PowerPC did not agree with '/system resource print'...it would show 70% idle (or more) while '/system resource' was claiming 90%+ CPU utilization. So '/tool profile' ought to be fixed to reflect MetaROUTER CPU usage.

3. The strange '/system health' numbers that the user reverged was seeing when he booted a MetaROUTER *also* occur when you increase the CPU load inside a MetaROUTER. So on my 450G, when I started to really hammer the CPU by adding additional active calls on Asterisk, my '/system health' numbers would go wild. Without any CPU load, I would normally see right around 12V and 48C, + or - 0.1v and 1C every now and then. When the MR is loaded down and really working the CPU, my voltage is swinging anywhere between 11 and 13, and my temperature will bounce all around from 52 to 60 to 54 to 59 to 50...just all over the place. Once I kill the CPU-hungry task in the MR, those numbers stabilize again. This *does not happen* if I load the CPU down in RouterOS with something like '/tool bandwidth-test 127.0.0.1 protocol=tcp': CPU goes to 100% but '/system health' values are stable.

4. Overclocking and MetaROUTER don't seem to mix. Of course, all hardware is different and certain batches of the same CPU have higher tolerances or fewer physical imperfections than others. But MetaROUTER seems to really push the CPU to the limits, even when you don't see it or realize it (see point #1 again...the CPU load that RouterOS is reporting to you isn't accurate!). Ever since putting my 1100AH back to 1066MHz from 1333MHz, I have not had a single freeze-up or reboot, and it has been almost 24 hours now. On my 433AH which has always been solid, I tried overclocking from 680MHz to 800MHz. It did not cause RouterOS to reboot or freeze-up, but whenever I started to pile on the calls in Asterisk, after about an hour or so, Asterisk would bomb out with a "Bus error". I put it back down to 680MHz and that problem disappeared, too. Been running for hours now processing 40 simultaneous calls at that speed. Interestingly, my 450G with the 12v supply has no problems running a loaded-down MetaROUTER at 800MHz. Rock solid! (I'm running the 433AH on the 24v, and am thinking about trying to overclock it again while it's on the 12v.)

So in conclusion, I suspect, based on these experiences, that the various crashes, reboots, and freeze-ups that I have been experiencing are a combination of CPU power requirements when running MetaROUTER, and irregularities with the power regulation circuitry on these RouterBOARDs which make them more efficient with certain power supplies/voltage inputs than they are with others. The 450G may run fine with stock 24v supplies with no MetaROUTER to push it, but has a hard time with anything less than a solid 12v supply when running MetaROUTER for whatever reason.

Also, don't overclock MetaROUTER hosts, at least in a production setting. :)

I'm still continuing to allow my lab tests to proceed, and will continue to report my results and observations here. If anyone has any ideas for other paces they'd like to see me put these RBs through, I'm all-ears.

-- Nathan
 
peson
Trainer
Trainer
Posts: 183
Joined: Tue Jul 20, 2004 10:33 am
Location: Sweden

Re: Has MikroTik given up on MetaROUTER?

Fri Apr 13, 2012 11:16 am


This is all useful information for the developers to know.

-- Nathan
I agree with Nathan, all information for the devs is useful.
That's why I want to put together a "task force" with knowledge from us and MT.
As I wrote in my reply:
http://forum.mikrotik.com/viewtopic.php ... 68#p311895

/Paul
Reboot is the last resort, try to find out what's wrong instead.
 
User avatar
janisk
MikroTik Support
MikroTik Support
Posts: 6283
Joined: Tue Feb 14, 2006 9:46 am
Location: Riga, Latvia

Re: Has MikroTik given up on MetaROUTER?

Fri Apr 13, 2012 12:33 pm

Questions to MT staff:
1- Why does it reboot when a MR and not when stressing the router? 
2- When watchdog is disabled and router stall, what happens inside the router?
3- Why doesn't it create a supout file when watchdog reboot it?
1. have no idea - if we new exactly that is the difference we could fix it
2. not know, there are no signs that anything has happened
3. it works completely differently from host routeros and that debug information cannot be easily recoverable. watchdog detects that nothing is responding and does kind of power reset. so no crash information is available, since kernel is required to write down the crash information, but if watchdog kicks in - there is no point to try to get that info as that is not possible.

for debugging i have special build and special device to get at least some useful information - or there is nothing extra that is visible from host.

edit:
forgot about communication - if you use virtual ethernet, then virtual interface is added that receives packets, other end points to MR. MR then receives the data over internal virtual interface.

if you assign static interface, then hooks are added to physical interface. Some packets can still be received by host the rest of them are sent directly to virtual interface for MR
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: Has MikroTik given up on MetaROUTER?

Sat Apr 14, 2012 10:37 am

Thanks to all, for joining in on this topic! I am in a big project right now, so I can't contribute that much.
As it shows MR isn't even stable on PPC and it seems Nathan is right about the powersupply.
What still doesn't fit, is the fact that we can't get a RB crash with high load not related to MR.
At this point I don't have any more theories from an electronics and embedded engineering point of view, as I don't know how MR is implemented,
There is some link missing between the powersupply and MR implementation, to make sense to this problem.

It also seems that MIkroTik, again, has hit a wall regarding a possible fix. :-(
janisk, sergejs do you have any information to report back from the devs?
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: Has MikroTik given up on MetaROUTER?

Sat Apr 14, 2012 11:10 pm

timberwolf,

I've got some more data points to share, and I'll do it by way of response to your post:
As it shows MR isn't even stable on PPC [...]
I would read my posts more carefully. What I said was that I suspected my PPC crashes on the 1100AH were due to me overclocking the CPU; I suspected this because the behavior of PPC when crashing/rebooting was different than what I was seeing on MIPS/450G. I have since put the 1100AH back down to the factory-set clock rate, fired up an 1100 running at its factory-set clock rate, I have them both running an MR that is communicating with the other one (50 simultaneous SIP calls between them!), and have not had a single lock-up for 2.5 days on the 1100AH, and the 1100 has never locked up ever. The 1100 is running at 100% CPU continuously, and the AH is near 100%.

I also went ahead and re-enabled watchdog on both the 1100 and 1100AH ahead of schedule, nearly 24 hours after firing up the latest test. Not a single lock-up or freeze-up has occurred that has resulted in watchdog kicking in and rebooting either unit.
[...] and it seems Nathan is right about the powersupply. What still doesn't fit, is the fact that we can't get a RB crash with high load not related to MR.
I believe the powersupply situation is unique to the 450G and a handful of other MIPSBE-based boards, and I strongly believe that this is somehow related to the fluctuating system health sensors that reverged observed. They *only* fluctuate when the MR is under extreme load (such as initial boot-up), and I can reproduce it by forcing the CPU use in the MR to 100% continuously. This is one of the reasons why I believe that for some reason the power draw of the CPU is more when running MR than when not (for whatever reason). The other reason I suspect this is because as I mentioned, I also did some overclocking tests on the MIPSBE boards I'm using (450G and 433AH). The 433AH started having weird crashes and was acting erratically when I overclocked its CPU, but the erratic behavior seemed to be limited to the MR and not the host. (Incidentally, the 433AH voltage health sensor does not fluctuate under load.) The 450G seemed more stable, at least when using my Delta Electronics 12v power supply. HOWEVER, about 18 hours after I started the test with the 450G overclocked, it finally rebooted itself.

So I have started another extended test, this time with neither the 450G or the 433AH overclocked. What I can tell you so far is this: First, I am now at over 24 hours uptime on both, with no hiccups and 20 continuous SIP calls being passed between them. I plan to let it continue to run this way over the weekend again. Second, the system health sensor numbers vary LESS with the CPU set back to the stock rate than they did when it was overclocked! When it was overclocked, as I mentioned in a prior post, I saw voltage and temp jumping all over the place. Now, with no overclocking, when the MR is under 100% load, the voltage is hanging out at around 11.4v (and will bounce back up to 12 when I shut off the test in the MR), and the temp is holding steady at around 47-48C, OCCASIONALLY jumping up to wild numbers (59-60C) before going back down again. But only very occasionally.

I also still contend that my 450G is more likely to reboot if the MR is under load than if it is doing nothing.
At this point I don't have any more theories from an electronics and embedded engineering point of view, as I don't know how MR is implemented, There is some link missing between the powersupply and MR implementation, to make sense to this problem.
Somehow, even though they both supposedly load the CPU completely, running an MR on either PowerPC or MIPS is harder on the CPU than (e.g.) '/tool bandwidth-test 127.0.0.1'. So when overclocking on either hardware platform, the real limit of your particular CPU die is revealed when running an MR guest.

Now, specifically, on certain MIPSBE boards such as the 450G that still seem to reboot when using certain power supplies even when you are NOT overclocking, here is my current hypothesis: janisk has shown screenshots of 450Gs running MRs in their lab with uptimes measured in WEEKS after they replaced the failed capacitors on their boards. They state they are no longer shipping capacitors on 450Gs that prematurely fail, BUT we do not know for sure that the capacitors that they used to replace the failed ones on their lab boards are the exact same ones that they are now using on new manufacturing runs of the 450G. They may have used much better caps on their lab boards, or at least ones more suited to task. My suspicion is that the reason why the same SoC on another board (433AH) runs stable at the factory clock rate with the same power supply that causes a 450G to reboot is that the design of the power regulation circuitry on the 450G possibly "cut some corners" (intentionally or unintentionally...not trying to pass judgment here) compared to the design of 433AH and other boards. Perhaps it is just the capacitors themselves: perhaps the ones that 450Gs are now shipping with won't fail and bulge or explode, but perhaps they are still not smoothing out the power flow to the CPU adequately. Fixing the problem could be as simple as using different caps in place of the ones that would routinely fail on older 450G boards. So I would be very interested to know exactly which capacitors they used as replacements on their 450G boards that they are using in the labs. I would get my hands on some (or on some with equivalent or better specs), and then try my hand at replacing the caps on on of my 450Gs with those.

-- Nathan

EDIT: It just occurred to me that there's something that I'm not sure anyone has tried yet: run MR on a 450G with a power supply that they routinely have reboots with, but UNDERCLOCK the CPU? Set it to, say, 400MHz instead of 680? Perhaps the reboots will magically stop? (Again, I realize you wouldn't want to run it this way in a production situation, unless your MR requirements were REALLY low and you didn't care if it ran underclocked or not. This is just a suggestion for a test.)
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: Has MikroTik given up on MetaROUTER?

Sun Apr 15, 2012 11:06 am

I would read my posts more carefully. What I said was that I suspected my PPC crashes on the 1100AH were due to me overclocking the CPU; I suspected this because the behavior of PPC when crashing/rebooting was different than what I was seeing on MIPS/450G. I have since put the 1100AH back down to the factory-set clock rate, fired up an 1100 running at its factory-set clock rate, I have them both running an MR that is communicating with the other one (50 simultaneous SIP calls between them!), and have not had a single lock-up for 2.5 days on the 1100AH, and the 1100 has never locked up ever. The 1100 is running at 100% CPU continuously, and the AH is near 100%.
Sorry I must have missed that point. Ok so MR seems to be stable on PPC, did you test this also with a ROS based MR?

A general observation:
You do all your tests with on OpenWRT instance running in MR right? The tests I, barkas and some of the other contributors did, as far as I know, are based on ROS based MRs. I don't know if this has any influence.

Regarding powersupply, fluctuating sensordata and capacitors.
We don't know how these points correlate, the sensor fluctuations could be caused by real voltage fluctuations OR by some readout problems due to wrong timing.
Also there hasn't been any insight offered by MT regarding the GPIO IRQs or affected batches with possibly weak capacitors.
EDIT: It just occurred to me that there's something that I'm not sure anyone has tried yet: run MR on a 450G with a power supply that they routinely have reboots with, but UNDERCLOCK the CPU? Set it to, say, 400MHz instead of 680? Perhaps the reboots will magically stop? (Again, I realize you wouldn't want to run it this way in a production situation, unless your MR requirements were REALLY low and you didn't care if it ran underclocked or not. This is just a suggestion for a test.)
I think barkas did that some time ago, with no positive effect.

Taking into account how much data and manpower has been delivered to MT(again), and how less feedback we got (again), I am cutting down time and effort until something worth my time is provided by MT. But at this point, I guess this thread will "end" like the ones before, silently or with one of the sentences "Buy a PPC based RB", "Buy a RB433AH", "Buy a RB493G" or "We don't know how to fix it" which we all know to well.
This is the cause, why I still support my suggestion to MT, to simply drop MR on MIPSBE.
And thinking of users, I would suggest, buy a Soekris 6501 or other well designed x86 and VT capable board, and try KVM based ROS instances there, that way hardware issues could be ruled out and a (hopefully better supported and stable) open source technology(KVM) is used.
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: Has MikroTik given up on MetaROUTER?

Sun Apr 15, 2012 10:53 pm

Ok so MR seems to be stable on PPC...
This is precisely my contention. My 1100 and 1100AH have now been operating for 3 days 8 hours straight under near 100% CPU load conditions passing about 3Mbit/s worth of continuous SIP calls between them, ever since I undid the overclocking on the AH (1333MHz -> 1066MHz). The 1100 (non-AH) never had any problems since day one as I never budged from the original 800MHz factory setting. Watchdog is enabled on both. Not a single hiccup to report.

Also, FYI, my 2 MIPSBE boards (450G and 433AH) have both just hit the 48 hour mark of uptime, passing about 1.5Mbit/s of continuous SIP and RTP traffic between them the whole time while maxing out the CPU (although you'd never know this by the way RouterOS reports CPU usage on MIPSBE, which as I theorized earlier is a bug, and one that I'd like to see fixed. :)). Watchdog enabled on both. Again, not a single hiccup ever since going back to factory clock rate on the CPU (800MHz -> 680MHz).
...did you test this also with a ROS based MR? [...] The tests I, barkas and some of the other contributors did, as far as I know, are based on ROS based MRs.
I did not, and I would be genuinely surprised if it made a difference; after all, you'd think that "RouterOS-within-RouterOS" would be more well-tested and thus more stable than "foreign-OS-within-RouterOS". But I'll humor you: since I have now passed over 3 days of uptime on the PPC boxes and am satisfied that they are stable, I will change the test on them so that I am running an OpenWRT+Asterisk MR AND a RouterOS MR side-by-side. I will also configure it so that all communication to and from the OpenWRT guest has to go through the RouterOS guest. I'm sure this will cut down on the number of simultaneous SIP calls I can make before the CPU maxes out, but I will do it for the sake of science. :)

I will be happy to reproduce this same test on the MIPSBE boards, but I want to see them surpass 72 hours of uptime first under the current test suite, so I will wait until tomorrow afternoon at the earliest before I change their configuration to match the PPC board tests I propose above.
the sensor fluctuations could be caused by real voltage fluctuations OR by some readout problems due to wrong timing.
Of course, you're correct. We don't know. All I can do is look at the available data I've collected in my tests as well as past evidence supplied by tests that you and others (including MikroTik staff!) have done, and try to form a hypothesis that fits that data. And to expand on my post from earlier, what I see suggests that there may, in fact, be two separate -- although interrelated -- problems, and we are lumping them together because the symptoms are so similar. We *assume* that all crashes or reboots are the result of the same problem for everybody, and I'm not sure I buy this.

In short, I think these are the two problems:

1) An unknown power regulation circuitry design flaw on 450G that negatively affects the stability of the CPU under certain conditions when it's under load

2) An unknown number of 450Gs that shipped with capacitors that are out-of-spec to begin with and which are also subject to premature failure

I read through the other (5-page) thread that took place a few months ago, and what I came away with after reading that is that my experiences don't necessarily match up with the experiences of others from a few short months ago.

In my experience, in general, I only have crashes with MetaROUTER on RouterBOARDs that I have overclocked. With one exception: I have seemingly unexplainable reboots only on 450G boards, and only if I use a power supply that exceeds 12V. And it seems to be very much keyed to the voltage and not total power output (watts). It seems like the farther I get from 12V, the more likely I am to encounter a reboot.

So my only problem with MetaROUTER on any RouterBOARD at this point is *only* on the 450G, and I've found a workaround for it that fixes it *for me*.

Now, if you read the old thread, you will come away thinking that -- unlike me -- most people that participated in it had problems regardless of what power supply they were using. Some saw *markedly* better results with 12V (in fact, some were exactly like me, and said that their problems were completely fixed by the change in PSU), some were *helped* by 12V but still saw the occasional reboot, and some saw absolutely no difference at all between 12V, 18V, 24V...whatever. janisk was one who saw no difference...at first! He claimed several times that it wasn't a PSU issue and that the frequency of reboots had no correlation to the power supply he was using! It rebooted no matter what kind of power he fed his board! But then something changed. He said he was forced to replace the capacitors on his test RB450G board because the original ones went sour, and ever since then it has not crashed.

Now what do you make of that?

The best theory I can come up with is that there is obviously a hardware design problem of some kind on the 450G that causes it to struggle if it is fed a voltage higher than 12V. This is common for everybody. But several months ago when the problems with MetaROUTER on the 450G were coming to light, there was an additional problem that exacerbated the situation: the defective capacitors! The probably were not working "in-spec" to begin with, even before they became visibly "swollen". And in that state, who knows what kind of crap power the CPU was being fed regardless of what power supply you had hooked up to the 450G!

As a result, for some people (who either didn't have bad capacitors to begin with or whose capacitors had not yet started to fail), switching to 12V fixed all their problems on the 450G (because it's a common problem for everybody), but for others whose capacitors were already breaking down inside, it didn't matter what power supply they used. So their MetaROUTER problems were caused by a related but *separate* issue.

In conclusion, I think most MetaROUTER issues so far can be traced back to a hardware problem, and not a software problem. And this is probably why it has been so hard to find a "fix" for it. There isn't any one "fix". At least not one that you can perform in software.

I could be WAY off-base here. But so far, it's the only theory I can come up with that seems to fit the available evidence.
I think barkas did [overclock] some time ago, with no positive effect.
Yeah, I found that in the past thread, too. Thanks. I will run some tests of my own (go back to 24V PSU on my 450G, verify reboots are back, and then start stepping down the CPU clock to see if they become less frequent or not). If my hypothesis so far proves to be correct, there are two possibilities: 1) the board barkas was using had capacitors on it that were already "too far gone", or 2) the power regulation issue on the 450G affects the CPU regardless of clock rate.
Taking into account how much data and manpower has been delivered to MT(again), and how less feedback we got (again), I am cutting down time and effort until something worth my time is provided by MT.
That is certainly your prerogative, and I can understand it...
But at this point, I guess this thread will "end" like the ones before, silently or with one of the sentences "Buy a PPC based RB", "Buy a RB433AH", "Buy a RB493G" or "We don't know how to fix it" which we all know to well.
That might be that they "don't know how to fix it", but I don't believe it is from a lack of effort on their part. For the longest time, the focus was on the software, because everybody thought that it just simply HAD to be a software problem. Again, I'm not convinced. Their test boards are seemingly non-symptomatic ever since having their capacitors replaced (which is why I'm still interested in knowing exactly WHAT capacitors they used on their lab boards).

I think MT should take some new, unmodified 450G straight off of the assembly line and add them to their lab, though, and see if they have the 24V reboot issue. If they do, replace their capacitors with ones equivalent to what they used on the test boards that no longer show symptoms.
This is the cause, why I still support my suggestion to MT, to simply drop MR on MIPSBE.
I would be extremely sad if they did this. The 450G is such a nice board at such a nice price...plenty of RAM, flash storage, and a fairly good CPU. It would make a killer router + IP PBX in one for the SMB market. Plus, the 450G is the only board you can buy now that is guaranteed to have 512MB of flash on it...the new 1100AH Rev. B has only 64MB.

-- Nathan
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: Has MikroTik given up on MetaROUTER?

Mon Apr 16, 2012 12:04 am

That might be that they "don't know how to fix it", but I don't believe it is from a lack of effort on their part.
Given the past and recent posts and their "tone" from janisk I really can't be sure about that. We don't know what they actually did, that's all one can say.
I would be extremely sad if they did this. The 450G is such a nice board at such a nice price...plenty of RAM, flash storage, and a fairly good CPU. It would make a killer router + IP PBX in one for the SMB market. Plus, the 450G is the only board you can buy now that is guaranteed to have 512MB of flash on it...the new 1100AH Rev. B has only 64MB.
My words and my feelings, and by now I am sure that MT has missed a 1000+ units opportunity for RB450G and/or RB1100AHx2, I can't go into more detail, but barkas already had some hint in his recents posts. Maybe sometime in the future MT will realize that not only beginners or non-professionals are using and or relying on their forum. I will continue to use MT products in my spare-time, but I definitely would have liked it to use them in my job too. Well I guess I will have to live with ALu, RAD, ADVA and Cisco, and I am happy that I didn't personally recommend MT...
 
peson
Trainer
Trainer
Posts: 183
Joined: Tue Jul 20, 2004 10:33 am
Location: Sweden

Re: Has MikroTik given up on MetaROUTER?

Mon Apr 16, 2012 1:51 am

Hmm...
PPC is stable?
450G is unstable?
This might be a big problem for all of us, some says this, some says that. Who are right?
MT staff? No they seem to struggle with the MR implementation.
We the testers and MT committed users? No, at least this is my opinion, we are testing this differently.
Due to my experience, 1100AH Rev A is not stable with ROS guests.

This is why I wrote my thoughts about MT taking some responsibility and set up a test/task force for doing the testing and problem solving.
I think they have a great opportunity to use our experience and testing ability for track the problems and correct them. Even if it's soft or hardware related.
As I said, collude together, but with MT in the drivers seat.

/Paul
Reboot is the last resort, try to find out what's wrong instead.
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: Has MikroTik given up on MetaROUTER?

Mon Apr 16, 2012 2:50 am

Given the past and recent posts and their "tone" from janisk I really can't be sure about that.
"Tone" in written language is such a difficult thing either to "transmit" or to interpret, especially when the people who are communicating with each other are often *all* using a language that is not their native tongue. He might not have been trying to say it the way you think he is saying it. :)
PPC is stable?
...in my experience, yes...
450G is unstable?
...for me, only when overclocked and/or paired with a power supply > 12v.
This might be a big problem for all of us, some says this, some says that. Who are right?
Probably both. :( I don't have your particular 1100AH, and you don't have mine. Who knows: if it is a hardware issue, maybe there is a problem with your board that my board doesn't have?

And MikroTik's problem is that they cannot reproduce the issue on their side. How can they fix something that they cannot reproduce?

Perhaps someone who is having the problem should set up a lab that crashes regularly, and then ship all of that hardware back to MT, and see if it happens for them on the same hardware. But who would be willing to do that? How badly do people want the problem fixed?
We the testers and MT committed users? No, at least this is my opinion, we are testing this differently. Due to my experience, 1100AH Rev A is not stable with ROS guests.
If you can describe for me exactly how yours is set up (or, better yet, send me '/system backup' of both the hosts and the guests), I will try to reproduce your setup on my end. Although you said that you had an 1100AH that just reboots even though the host and the guest are both doing nothing, right? If so, perhaps the test I told timberwolf that I would try (running OpenWRT and RouterOS guests in MetaROUTER side-by-side) will be relevant to your problem (again, assuming the problem is in the software itself).

-- Nathan
 
User avatar
janisk
MikroTik Support
MikroTik Support
Posts: 6283
Joined: Tue Feb 14, 2006 9:46 am
Location: Riga, Latvia

Re: Has MikroTik given up on MetaROUTER?

Mon Apr 16, 2012 10:29 am

replacement caps are Suscon 680microF(uF) 6.3V - ones that are used to produce them.

i will get one RB450G form warehouse and feed it with max power it can take and see if that will make
it reboot with guest running.

also, in alignment with NathanA observations - RB450G that crashed where running considerably hotter than one that did not, that have been an indication that router is about to start crash.

Also, will check what happens to CPU power when MR is at load.

editL btw, those are the same parts that are on RB450G i just received.
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: Has MikroTik given up on MetaROUTER?

Mon Apr 16, 2012 5:42 pm

janisk if I would know more about the dc/dc converter design and topology MT implemented at the RB450G, than I could possibly help. Otherwise it is impossible to come to any conclusion on the relationship of input voltage and the capacitors, as those should sit at the output side of the dc/dc converter(s), judging by the max. voltage of 6.3V.
Also there seem to be more than one dc/dc converter on the board, looks like three of them. I you or the devs could name any testpoints, then I could check the supply rails for any execessive ripple or drops using my testing equipment. If think the top candidate is the small dc/dc converter next to the CPU/SoC at the corner of the PCB.
 
peson
Trainer
Trainer
Posts: 183
Joined: Tue Jul 20, 2004 10:33 am
Location: Sweden

Re: Has MikroTik given up on MetaROUTER?

Mon Apr 16, 2012 9:32 pm

I don't have your particular 1100AH, and you don't have mine. Who knows: if it is a hardware issue, maybe there is a problem with your board that my board doesn't have?
I have two brand new boards acting the same.
If you can describe for me exactly how yours is set up (or, better yet, send me '/system backup' of both the hosts and the guests), I will try to reproduce your setup on my end.
We don't have IM in the forum, so how do I send it to you?
Although you said that you had an 1100AH that just reboots even though the host and the guest are both doing nothing, right? If so, perhaps the test I told timberwolf that I would try (running OpenWRT and RouterOS guests in MetaROUTER side-by-side) will be relevant to your problem (again, assuming the problem is in the software itself).
I have two 1100AH rev A side by side.
One of them acts as a router, the other is just a host for the guests.
Both of them reboots if I have MR ROS running on them and the watchdog is enabled.
The router reboots even if the MR only runs with virtual interfaces that is not connected anywhere, so from my point, the PPC is not stable.

/Paul
Reboot is the last resort, try to find out what's wrong instead.
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: Has MikroTik given up on MetaROUTER?

Mon Apr 16, 2012 10:07 pm

Well son of a gun...I made the changes to my config that I proposed yesterday (adding RouterOS MR alongside the OpenWRT MR...I have the RouterOS MR performing NAT for the OpenWRT MR), and it ran for 14 additional hours after that without incident. Then my RB1100 rebooted out of the blue! The RB1100AH is still chugging along and has 4d 8h uptime now.

I have restarted the test to see if it happens again, and if so, how long it takes (and whether it happens on both boards or just the 1100). I have found that with the extra RouterOS MR in place doing a bunch of unnecessary work :) (for science!), I can only do about half the number of simultaneous SIP calls between the 1100 and the 1100AH before the 1100 begins to peak out its CPU regularly (so 25 calls instead of 50).

If the 1100 continues to reboot randomly and the AH proves to still be solid, then part of me wonders if the watchdog didn't happen to kick in because the CPU was "too busy" on the 1100...so in order to make things "equal", I will try underclocking the AH, in which case both boards should have roughly the same performance and then both of their CPUs should load down after approximately the same amount of work. (peson, I understand that yours reboot even without load.)

peson, how long would you say it takes on average before one of your RB1100AH reboots?

I have also just completed implementing the same new test scenario (1 ROS MR + 1 OWRT MR) on my RB450G <-> RB433AH lab. I can do about 15 simultaneous SIP calls between them in this configuration before they both feel like they are starting to "bog down". They have both now been up for just over 3 days, so this should be interesting...

If I can show that my PPC boards are rebooting randomly while my MIPSBE boards are rock-solid, then I'm going to go out on a limb and say that the underlying cause for PPC trouble is unrelated to any underlying causes for problems people have experienced with MIPSBE.

-- Nathan
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: Has MikroTik given up on MetaROUTER?

Tue Apr 17, 2012 12:26 am

We don't have IM in the forum, so how do I send it to you?
Whoa, that's really weird...you used to be able to send private messages on this forum! When did that change? And why?

I have 2 spare RB1100 that I can reproduce your set-up with, and still leave my other 1100/1100AH "lab" doing what it's doing. You can send me your backups (change the passwords first before you make them :)) as attachments on an e-mail to me. nathana@fsr.com

Make sure to include both the backups of the hosts as well as the guests, and tell me which host each guest config should be loaded on. Also, I'm not sure if '/system backup' saves MetaROUTER guest config (RAM, disk space, interfaces, etc.) or not, so you might want to show me a '/metarouter export' for each host as well.

-- Nathan
 
peson
Trainer
Trainer
Posts: 183
Joined: Tue Jul 20, 2004 10:33 am
Location: Sweden

Re: Has MikroTik given up on MetaROUTER?

Tue Apr 17, 2012 12:46 am

how long would you say it takes on average before one of your RB1100AH reboots?
Everything between, 1-24 hours.
Will send you an export compact from the host and guests.
/Paul
Reboot is the last resort, try to find out what's wrong instead.
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: Has MikroTik given up on MetaROUTER?

Tue Apr 17, 2012 1:17 am

Everything between, 1-24 hours.
So never longer than 24 hours? And you've definitely not seen anything close to 100 hours?
Will send you an export compact from the host and guests.
I will watch for them and let you know of my results.

-- Nathan
 
peson
Trainer
Trainer
Posts: 183
Joined: Tue Jul 20, 2004 10:33 am
Location: Sweden

Re: Has MikroTik given up on MetaROUTER?

Tue Apr 17, 2012 1:25 am

So never longer than 24 hours? And you've definitely not seen anything close to 100 hours?
No, not with watchdog enabled, without watchdog, 8d5h and still running.
I will watch for them and let you know of my results.
Sent
I'm looking forward to hear about your results

/Paul
Reboot is the last resort, try to find out what's wrong instead.
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: Has MikroTik given up on MetaROUTER?

Tue Apr 17, 2012 5:47 am

I still haven't had a chance to reproduce peson's setup (I hope to in a few minutes, here), but I thought I'd post an update on my current PPC test rig:

The RB1100 rebooted a second time, so about 7 hours after the last reboot. The RB1100AH is *still* chugging along without an issue. Current uptime measured at just shy of 4d 16h.

Remember that the RB1100 had an uptime that kept up with the AH's uptime all weekend long, up until I added a second MR guest (running RouterOS instead of OpenWRT). So now both routers have an OpenWRT guest and a RouterOS guest.

It is interesting that the 1100 is rebooting and the AH is not (so far). I do wonder if it has something to do with the fact that the workload I have given to both routers ends up pegging the 1100 but not the AH. I was going to underclock the AH so that the same workload pegged it, too, and see if it also started rebooting under those conditions, but before I do that, I want to let the AH run a bit longer as-is.

In the meantime, I have started up the SIP call test a third time, but this time I'm limiting it to 15 simultaneous calls, which should prevent the 1100 from ever seeing 100% utilization. It will be interesting to see if the 1100 continues to reboot.

In other news, my MIPSBE hosts (450G and 433AH) are running the exact same test (2 guests on each host: 1 ROS, 1 OWRT, 15 simultaneous SIP calls between them) and have been doing so for the past 8 hours. Total uptime for both devices is 3d 9h. No reboots yet.

-- Nathan
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: Has MikroTik given up on MetaROUTER?

Tue Apr 17, 2012 4:14 pm

This is so...weird.

My RB1100 has crashed a third time, while I was asleep. But this time, watchdog did not kick in and reboot it. It was frozen solid. No response on serial terminal. I had to powercycle to bring it back. Not good. :(

My RB1100AH? Still has not missed a beat. Over 5 days 2 hours of uptime on it.

So I don't get it. The RB1100 should have been kicked by the hardware watchdog. But it wasn't. In fact, the non-overclocked RB1100 is acting suspiciously like the way the RB1100AH was acting when it was overclocked. Meanwhile, the AH is racking up uptime like nobody's business. It is enough to make one wonder whether there is a problem with this particular 1100. But only when running MetaROUTER?

I don't know what to think anymore.

Meanwhile, my MIPSBE boards are still being awesome. No reboots from them yet for 3 days 20 hours. The 450G is loving that 12v power supply, it would seem. Also, I have 2 other RB1100s running peson's configuration that he sent me. They are both about to hit 6 hours of uptime since I fired them up. I will continue to watch them closely.

One interesting difference between my RB1100 that is crashing and the other two RB1100s and my AH is that the RB1100 that is crashing always shows a very high CPU temperature (even if it starts up after being off for long enough to cool down): 60C. All of the others seem to settle at around 35C. Not sure it's relevant, but I thought I'd put it out there. I just figured that the sensor on the one showing an abnormally high temp. is miscalibrated. Also, it was showing a high temp. that entire weekend when it didn't crash and got to over 3 days of uptime. Again, the only difference is that it was only running 1 MetaROUTER (OpenWRT) before, and is now running 2 MetaROUTERs (OpenWRT and RouterOS).

-- Nathan
 
User avatar
liquidcz
Frequent Visitor
Frequent Visitor
Posts: 73
Joined: Tue Dec 28, 2010 1:24 pm

Re: Has MikroTik given up on MetaROUTER?

Tue Apr 17, 2012 9:36 pm

As i promised later this thread, i will share my results.

I have reach for more then 2 days uptime, later, ROS 6 beta1 was released, well i had to try it. ;-) So, now i have 2 days uptime again with ROS 6 beta1.

It seems more stable with power supply 12V, from my point of view.
 
User avatar
liquidcz
Frequent Visitor
Frequent Visitor
Posts: 73
Joined: Tue Dec 28, 2010 1:24 pm

Re: Has MikroTik given up on MetaROUTER?

Wed Apr 18, 2012 8:22 am

Now im running 4 metarouters, 2x OpenWRT and 2x ROS.
You do not have the required permissions to view the files attached to this post.
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: Has MikroTik given up on MetaROUTER?

Wed Apr 18, 2012 9:39 am

Update:

My 450G hasn't rebooted or locked up now for 4 days 13 hours, and I don't think it is going to anytime soon...not while it is running on that 12v power supply. (The 433AH is equally stable and has been up for just as long as the 450G, although it is running on the 24v power supply that gave the 450G fits!)

My 1100 has rebooted a fourth time. It seems to be rebooting every 7-8 hours almost like clockwork. I'm trying a couple of different things out, though, and will report on my success (or lack of) tomorrow.

My 1100AH has been up for 5 days 20 hours and shows no signs of quitting. It does not reboot even though it is configured identically to the 1100 that has been rebooting. (Of course, the AH is under less load since it is doing the same amount of work as the 1100 but has a beefier processor.)

The 2 1100s that I configured from the exports that peson sent to me have not crashed/rebooted/frozen, and in about 30 minutes they will have hit the 24 hour uptime mark. I am not convinced that they will exhibit any symptoms, but I will continue to watch them. (Once they hit 48 hours at around this same time tomorrow, I intend to give peson remote access to my test routers to have him confirm my results, and to look over their configuration in order to make sure that I didn't miss anything while setting them up.)

-- Nathan
 
peson
Trainer
Trainer
Posts: 183
Joined: Tue Jul 20, 2004 10:33 am
Location: Sweden

Re: Has MikroTik given up on MetaROUTER?

Wed Apr 18, 2012 9:56 am

The 2 1100s that I configured from the exports that peson sent to me have not crashed/rebooted/frozen, and in about 30 minutes they will have hit the 24 hour uptime mark. I am not convinced that they will exhibit any symptoms, but I will continue to watch them. (Once they hit 48 hours at around this same time tomorrow, I intend to give peson remote access to my test routers to have him confirm my results, and to look over their configuration in order to make sure that I didn't miss anything while setting them up.)
I noticed that the export disabled the watchdog, is it enabled in your config?
My 1100AH Rev. A (with watchdog disabled) have these resource and healt print out.

uptime: 1w2d14h16m6s
version: 5.12
free-memory: 1438284KiB
total-memory: 1555424KiB
cpu: e500v2
cpu-count: 1
cpu-frequency: 1066MHz
cpu-load: 24%
free-hdd-space: 459508KiB
total-hdd-space: 520192KiB
write-sect-since-reboot: 54821
write-sect-total: 60244
bad-blocks: 0%
architecture-name: powerpc
board-name: RB1100AH
platform: MikroTik

fan-mode: auto
use-fan: main
active-fan: main
voltage: 13.5V
temperature: 36C
cpu-temperature: 40C

/Paul
Reboot is the last resort, try to find out what's wrong instead.
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: Has MikroTik given up on MetaROUTER?

Wed Apr 18, 2012 10:25 am

I noticed that the export disabled the watchdog, is it enabled in your config?
Ugh, you're right! I did not catch that! Watchdog has been disabled this whole time, so the first 24 hours don't count. I have turned it back on on both.

For comparison, on these 2 RB1100s, the health stats are identical: 12.4v / 35C / 35C (power / temp / CPU). On my 1100 that is crashing in my other test, health stats are 13.3v (yes, nearly 1v higher) / 29C / 63C (!).

-- Nathan
 
User avatar
janisk
MikroTik Support
MikroTik Support
Posts: 6283
Joined: Tue Feb 14, 2006 9:46 am
Location: Riga, Latvia

Re: Has MikroTik given up on MetaROUTER?

Wed Apr 18, 2012 10:55 am

running tests on RB450G - with 28V PSU and 800MHz and 680MHz it crashes if metarouter is enabled.

running tcp BT to itself at full speed. Metarouter has static, dynamic and hardware port assigned. Older boards are working without any problem.

Without MR, it ran without problems, even at 28V and 800Mhz cpu freq.
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: Has MikroTik given up on MetaROUTER?

Wed Apr 18, 2012 11:17 am

Unbelievable! After nearly 6 days of uptime, my 1100AH FINALLY CRASHED! And I mean *crashed*. It does not respond to any input on the serial port. It didn't reboot even though watchdog is ENABLED on it. I am about to powercycle it.

So it would seem as though reality might be exactly opposite what I originally claimed: PPC MetaROUTER is problematic, and MIPSBE is stable! (Well, stable on most boards, and there's a workaround on others, like the 450G: use a different power supply.) Of course, I will continue to let both my MIPSBE test and my other PPC test continue on. Hopefully I will learn something from both of them. :)

-- Nathan
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: Has MikroTik given up on MetaROUTER?

Wed Apr 18, 2012 11:19 am

running tests on RB450G - with 28V PSU and 800MHz and 680MHz it crashes if metarouter is enabled. [...] Older boards are working without any problem. Without MR, it ran without problems, even at 28V and 800Mhz cpu freq.
Fascinating! Okay, so this means you have recreated our issue. Now, see if you can recreate our fix: keep running MetaROUTER, switch to a 12v power supply (of any amperage), and see if it stabilizes.

Thanks!

-- Nathan
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: Has MikroTik given up on MetaROUTER?

Wed Apr 18, 2012 12:07 pm

running tests on RB450G - with 28V PSU and 800MHz and 680MHz it crashes if metarouter is enabled.

running tcp BT to itself at full speed. Metarouter has static, dynamic and hardware port assigned. Older boards are working without any problem.

Without MR, it ran without problems, even at 28V and 800Mhz cpu freq.
Thank you very much, this is exactly what we are seeing. So you finally have a setup which behaves identically to ours. :-)
I checked the parts list of my setup in the datacenter, I mentionend earlier, the PSU I used is an 12V 3.5A model, which also powers an ALIX board.
So it runs on a 12V supply already, but recalling my initial tests with MR on this RB450G board, I got crashes every minute(!).
I am currently planning to upgrade this system to 5.14 this friday using netinstall, because of some config problems. Maybe I can conduct some tests.
 
barkas
Member Candidate
Member Candidate
Posts: 260
Joined: Sun Sep 25, 2011 10:51 pm

Re: Has MikroTik given up on MetaROUTER?

Wed Apr 18, 2012 12:09 pm

I tested 6.0 beta 1 on 18V, it crashes with MR, too. I will switch it back to 12V when I'm home again.
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: Has MikroTik given up on MetaROUTER?

Wed Apr 18, 2012 12:25 pm

...but recalling my initial tests with MR on this RB450G board, I got crashes every minute(!).
...how do the capacitors look on your board?
I tested 6.0 beta 1 on 18V, it crashes with MR, too. I will switch it back to 12V when I'm home again.
I am almost convinced this is a hardware problem at this point. Not fixable in software. Hope to be proven wrong, though. :)

-- Nathan
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: Has MikroTik given up on MetaROUTER?

Wed Apr 18, 2012 12:30 pm

...but recalling my initial tests with MR on this RB450G board, I got crashes every minute(!).
...how do the capacitors look on your board?
I conducted the tests with the board fresh out of the bag, the caps all looked good, like one would expect for a brand new board. I can check them friday if I got time left.
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: Has MikroTik given up on MetaROUTER?

Thu Apr 19, 2012 5:51 am

Update:

My RB1100 continues to reboot periodically. The AH hasn't had another complete lock-up episode yet since yesterday. (That one still baffles me...the AH was up for nearly 6 days without any problems, wasn't overclocked, and watchdog was on anyway.) The 2 RB1100 that I have peson's configuration running on have had watchdog enabled for the past 20 hours and have not had any problems, either.

As I mentioned before, I'm seeing some different '/system health' numbers between all of these devices. The AH and the 2 RB1100s running peson's configuration both have their CPU temps hover at around 35C. The 1100 that keeps crashing averages around 61C. The AH and the crashing 1100 show input voltages above 13v, while the 2 1100s that have not crashed yet both show input voltages at 12.4v.

So I've decided that tonight, I'm going to be performing some minor surgery...I'm going to transplant the power supplies from the 2 1100s running peson's config to my 1100AH and the crashing 1100, and I'll put the power supplies from those routers into the 2 stable (so far) 1100s. Then restart all tests, and watch them.

My 450G is still rocking the 12v power supply and has been continuously running 15 active SIP channels now for 5 days 9 hours. If it crashes, I'll be shocked, but my AH crashed after 6 days, so I'm not going to assume anything and will continue to just let it run...

-- Nathan

EDIT: My 1100AH hard-crashed again. What the heck.

EDIT 2: Well, switching the power supplies made no difference. The units that show closer to 13v input show the same voltage level regardless of power supply, and the units that show closer to 12v input always show 12v regardless of power supply. So those numbers must be determined by the resolution of the sensor, which is apparently rather crude. I sure wish I could understand why only the AH is hard-crashing. I'm half-tempted to remove the heatsink, scrape off the stock heat pad, and apply some new thermal grease.
 
User avatar
liquidcz
Frequent Visitor
Frequent Visitor
Posts: 73
Joined: Tue Dec 28, 2010 1:24 pm

Re: Has MikroTik given up on MetaROUTER?

Fri Apr 20, 2012 11:15 am

Im still running my testing rb450G, power supply Sunny 12V 2A, 4 metarouters (2x ROS + 2xOpenWRT) with connected console, running TOP command and ssh connection from external machine. Now i have 4d 14h uptime.
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: Has MikroTik given up on MetaROUTER?

Fri Apr 20, 2012 11:25 am

My 450G is at 6 days 15 hours now. Shows no signs of stopping. It will be most interesting to hear janisk's results as well, and any findings MikroTik can come up with to explain why the power supply seems to make a difference on that board!

My PPC test results are troubling to me, though. Different boards seem to act differently.

-- Nathan
 
User avatar
janisk
MikroTik Support
MikroTik Support
Posts: 6283
Joined: Tue Feb 14, 2006 9:46 am
Location: Riga, Latvia

Re: Has MikroTik given up on MetaROUTER?

Fri Apr 20, 2012 11:37 am

router is given to the lead developer of MetaROUTER>
 
peson
Trainer
Trainer
Posts: 183
Joined: Tue Jul 20, 2004 10:33 am
Location: Sweden

Re: Has MikroTik given up on MetaROUTER?

Fri Apr 20, 2012 12:48 pm

Im still running my testing rb450G, power supply Sunny 12V 2A, 4 metarouters (2x ROS + 2xOpenWRT) with connected console, running TOP command and ssh connection from external machine. Now i have 4d 14h uptime.
Is this with or without watchdog enabled?
My 1100AH Rev. A keeps running with the watchdog disabled.

Janis,
is the watchdog both ROS software based, BIOS firmware/hardware or both?
I keep hitting my head with the question, why does it not reboot when MR's is disabled?
The problem might be both software and hardware based

/Paul
Reboot is the last resort, try to find out what's wrong instead.
 
User avatar
janisk
MikroTik Support
MikroTik Support
Posts: 6283
Joined: Tue Feb 14, 2006 9:46 am
Location: Riga, Latvia

Re: Has MikroTik given up on MetaROUTER?

Fri Apr 20, 2012 1:02 pm

hardware watchdog on all recent (as in several years) RouterBOARD products
 
peson
Trainer
Trainer
Posts: 183
Joined: Tue Jul 20, 2004 10:33 am
Location: Sweden

Re: Has MikroTik given up on MetaROUTER?

Fri Apr 20, 2012 1:46 pm

hardware watchdog on all recent (as in several years) RouterBOARD products
Ok, so it's both soft- and hardware based?
From the Wiki:
This menu allows to configure system to reboot on kernel panic, when an IP address does not respond, or in case the system has locked up. Software watchdog timer is used to provide the last option, so in very rare cases (caused by hardware malfunction) it can lock up by itself. There is a hardware watchdog device available in all RouterBOARD PowerPC and Mipsbe models, which can reboot the system in any case.
Does /sys watchdog set watchdog-timer=(no/yes) controls both the software WD and the hardware WD?
Could this be changed, so that we can control both cases, IP not responding and system hang (kernel panic)?

/Paul
Reboot is the last resort, try to find out what's wrong instead.
 
User avatar
janisk
MikroTik Support
MikroTik Support
Posts: 6283
Joined: Tue Feb 14, 2006 9:46 am
Location: Riga, Latvia

Re: Has MikroTik given up on MetaROUTER?

Fri Apr 20, 2012 2:02 pm

MikroTik router uses hardware watchdog, as stated in your snippet, some mipsle have that too.

if you are interested on how that works here it goes:

there is watchdog software part that refreshes hardware watchdog timer, to delay the event that occurs if timer runs out. If nobody is refreshing the timer it runs out and watchdog reboots the router. It has to be refreshed frequently.

So you have some software that has to run to refresh the router, if OS has crashed or locked in other way it hardware watchdog will reboot the router.


the difference is - hardware watchdog will reboot device always, while software watchdog can lock up,.That is why all recent product series use hardware watchdog.
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: Has MikroTik given up on MetaROUTER?

Sat Apr 21, 2012 3:07 pm

the difference is - hardware watchdog will reboot device always, while software watchdog can lock up,.That is why all recent product series use hardware watchdog.
If I get this correct, then Nathans RB1100AH shouldn't have locked up completely, as he had the watchdog enabled, right?
 
Wazza
newbie
Posts: 39
Joined: Thu Oct 13, 2011 10:43 am

Re: Has MikroTik given up on MetaROUTER?

Sat Apr 21, 2012 4:06 pm

All,

I'd like to add a comment here...

I've purchased several RB1100AH units, with the epxress purpose of of using them for MetaRouters. None of them to be over taxed etc., just 4-8 MR's per unit for customer managed solutions...

The MR's are to use 1 dynamic interface (bridge on the host), and 1 static interface (vlan on host), and that's basically it. No routing protocols, just static routes. All have NAT (masqurade) setup, and a few firewall rules (filter). That's it. Nothing fancy... Ideally I'd like to use the MR's to provide SSTP / PPTP VPN's for end users, with no more than 5 concurrent user connections, but at this point that doesn't seem feasabile.

I have 3 of these setup, (a 4th in our lab), and all at this point with a single MR on them.

They come up, seem to work fine, and then without warning somewhere between 3 and 5 days after boot, they (the RB1100AH) reboot. No warning, nothing. In many cases the MR, which as obviously been restarted, doesn't "respond" quite right, and needs to be manually disabled, and rebooted several times before it finally comes back.

All are running 5.14.

At this point, I'm starting to seriously regret my choice on this as a solution. In theory this looks good, but clearly it's just not stable, and I'm not sure I want to put my business / customers through the reliablity headaches that we've opened ourselves up for.

I love the Mikrotik products, but this is another one of those things that just seems to have slipped through the cracks...

I look at the newly announce CloudCore router product with 36 core's and think that while I don't have the requirement to route 10+Mpps, such a product would be great for MetaRouters, but with the current experience, unless I get some documented fixes in things, I'm not going to risk it.

Just my 2c worth!

Warren
 
User avatar
normis
MikroTik Support
MikroTik Support
Posts: 24392
Joined: Fri May 28, 2004 11:04 am
Location: Riga, Latvia

Re: Has MikroTik given up on MetaROUTER?

Mon Apr 23, 2012 10:15 am

Wazza, you need to contact support. Send us your image, send us problem description, and steps how to reproduce problems you are facing. If we can repeat it, we can fix it.
No answer to your question? How to write posts
 
peson
Trainer
Trainer
Posts: 183
Joined: Tue Jul 20, 2004 10:33 am
Location: Sweden

Re: Has MikroTik given up on MetaROUTER?

Mon Apr 23, 2012 11:33 am

Wazza, you need to contact support. Send us your image, send us problem description, and steps how to reproduce problems you are facing. If we can repeat it, we can fix it.
Normunds!

This is the problem, we all contribute with facts and sending you supout files.
I've done this myself and haven't heard any report back of my ticket.
Please read my post in this tread about collude together with MT staff in the drivers seat.
Take the opportunity to have us support you with the MR problem, and it's better doing this in the forum than communicate with everyone who have the same issues. At least this is my opinion.
If you are using a physical ROS router/firewall for the Mikrotik office in Riga, replace the hardware with a RB1100AH and configure a MR to do the same job as the router/firewall doing today.
This will probably reproduce the problem we are facing.

I know that disabling the watchdog helps, but that is not a solution.
Tweaking configurations in special ways for MR other than normal working configuration is not a solution either.

/Paul
Reboot is the last resort, try to find out what's wrong instead.
 
User avatar
normis
MikroTik Support
MikroTik Support
Posts: 24392
Joined: Fri May 28, 2004 11:04 am
Location: Riga, Latvia

Re: Has MikroTik given up on MetaROUTER?

Mon Apr 23, 2012 11:36 am

if you didn't get an answer, paste your ticket number and I will check why. Maybe some experiment is being done, and the responsible person is waiting for result, before replying to you.
No answer to your question? How to write posts
 
peson
Trainer
Trainer
Posts: 183
Joined: Tue Jul 20, 2004 10:33 am
Location: Sweden

Re: Has MikroTik given up on MetaROUTER?

Mon Apr 23, 2012 12:21 pm

if you didn't get an answer, paste your ticket number and I will check why. Maybe some experiment is being done, and the responsible person is waiting for result, before replying to you.
My ticket:
Ticket#2012012666000134

What about my suggestion in collude together in an organized way?

/Paul
Reboot is the last resort, try to find out what's wrong instead.
 
User avatar
normis
MikroTik Support
MikroTik Support
Posts: 24392
Joined: Fri May 28, 2004 11:04 am
Location: Riga, Latvia

Re: Has MikroTik given up on MetaROUTER?

Mon Apr 23, 2012 12:35 pm

if you didn't get an answer, paste your ticket number and I will check why. Maybe some experiment is being done, and the responsible person is waiting for result, before replying to you.
My ticket:
Ticket#2012012666000134

What about my suggestion in collude together in an organized way?

/Paul
Latest reply was sent to you on 02/09/2012 14:35:28 and you have not responded to that email, so ticket is closed.
No answer to your question? How to write posts
 
peson
Trainer
Trainer
Posts: 183
Joined: Tue Jul 20, 2004 10:33 am
Location: Sweden

Re: Has MikroTik given up on MetaROUTER?

Mon Apr 23, 2012 1:20 pm

if you didn't get an answer, paste your ticket number and I will check why. Maybe some experiment is being done, and the responsible person is waiting for result, before replying to you.
My ticket:
Ticket#2012012666000134

What about my suggestion in collude together in an organized way?

/Paul
Latest reply was sent to you on 02/09/2012 14:35:28 and you have not responded to that email, so ticket is closed.
I will email you about this.
Reboot is the last resort, try to find out what's wrong instead.
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: Has MikroTik given up on MetaROUTER?

Mon Apr 23, 2012 11:11 pm

If I get this correct, then Nathans RB1100AH shouldn't have locked up completely, as he had the watchdog enabled, right?
This is what I'm concerned about, but even moreso, that it is only happening (so far) on the AH. Truly makes me wonder if there is something wrong with this specific AH. Has anybody ever tried replacing the stock heat pad on the CPU with something like Arctic Silver 5? Is it worth the hassle?

I haven't had any of my RB1100s or the AH powered on and operating for a few days now as I haven't had time to really give the "lab" the attention it needs. I hope to take some time to experiment some more this week.

In the meantime, my 450G has hit 10 days of uptime on the 12v power supply while under constant CPU load. The 433AH that it has been exchanging data with and is identically configured has, of course, also not had a single problem and has been up just as long. I'm calling these both stable. I am still very eager to hear what the MetaROUTER developer(s) find with the power supply issue on the 450G.

-- Nathan
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: Has MikroTik given up on MetaROUTER?

Sat Apr 28, 2012 4:24 pm

sergejs janisk
Any news to report from the MT developers?
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: Has MikroTik given up on MetaROUTER?

Wed May 02, 2012 4:04 pm

This is what I'm concerned about, but even moreso, that it is only happening (so far) on the AH. Truly makes me wonder if there is something wrong with this specific AH. Has anybody ever tried replacing the stock heat pad on the CPU with something like Arctic Silver 5? Is it worth the hassle?
I don't know but I don't think this CPU is so critical regarding thermal power.
Also this shouldn't affect the watchdog timer, otherwise I would call the design a failure.
In the meantime, my 450G has hit 10 days of uptime on the 12v power supply while under constant CPU load. The 433AH that it has been exchanging data with and is identically configured has, of course, also not had a single problem and has been up just as long. I'm calling these both stable. I am still very eager to hear what the MetaROUTER developer(s) find with the power supply issue on the 450G.
It is suspiciously silent on the side of MT...
What would it mean for MT if they had a design error in all recent RB450G and eventually some other boards?
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: Has MikroTik given up on MetaROUTER?

Thu May 03, 2012 8:26 am

Also this shouldn't affect the watchdog timer, otherwise I would call the design a failure.
I guess it would depend on how hardware watchdog is implemented. Does it just ground out a reset pin on the CPU, or does it briefly cut power to it and other parts of the board as a whole? If the former, what if the CPU is already in a sorry state physically (overheating or whatnot)? Perhaps simply "instructing" the SoC to restart might not be enough. (Note that I am not an EE, and I don't know how a hardware watchdog like this might typically be implemented.)
It is suspiciously silent on the side of MT...What would it mean for MT if they had a design error in all recent RB450G and eventually some other boards?
To be fair, the new evidence on the 450G side of things only came to light recently, so I for one am willing to give them some more time on this. If it really is a design flaw on the board, the MetaROUTER devs (who are surely working more on the software-side of things) are probably going to need to put their heads together with the hardware folks to figure this one out.

I've not had the time recently to poke at PPC stuff again (I'll get to it...really!), but I've given the 450G some actual light duty: for a week now, it's been terminating my personal calls, and hasn't skipped a beat.

-- Nathan
 
User avatar
janisk
MikroTik Support
MikroTik Support
Posts: 6283
Joined: Tue Feb 14, 2006 9:46 am
Location: Riga, Latvia

Re: Has MikroTik given up on MetaROUTER?

Thu May 03, 2012 10:43 am

since rb433AH and RB540G has the same CPU and one is supposed to crash and other is not - compared the similarities and differences regarding electrical chains of CPU - made changes to RB450G to mimic RB433AH - no luck, have to look elsewhere. One idea down, more on the list.
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: Has MikroTik given up on MetaROUTER?

Thu May 03, 2012 8:15 pm

janisk
Thank you for the update. Please keep us posted.
since rb433AH and RB540G has the same CPU and one is supposed to crash and other is not
all Can you confirm this? I can't recall it exactly, but I think there where negative reports for RB433AH too.
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: Has MikroTik given up on MetaROUTER?

Fri May 04, 2012 12:02 am

all Can you confirm this? I can't recall it exactly, but I think there where negative reports for RB433AH too.
I have never had my AH crash on me, and I have had it running in parallel with my 450G during all of these tests. It's even running off the same 24v power supply that gives more than one of my 450Gs fits. Perhaps previous reports were prior to recent firmwares/OS?

-- Nathan
 
peson
Trainer
Trainer
Posts: 183
Joined: Tue Jul 20, 2004 10:33 am
Location: Sweden

Re: Has MikroTik given up on MetaROUTER?

Fri May 04, 2012 6:51 pm

all Can you confirm this? I can't recall it exactly, but I think there where negative reports for RB433AH too.
I have never had my AH crash on me, and I have had it running in parallel with my 450G during all of these tests. It's even running off the same 24v power supply that gives more than one of my 450Gs fits. Perhaps previous reports were prior to recent firmwares/OS?

-- Nathan
I've wrote about my 493AH before and it runs 4 ROS guests with MPLS, it runs on 5.9 and the uptime for today is 21 days.
There is no traffic, except the internal communication between the guests
It's running off a 18V PSU over PoE

/Paul
Reboot is the last resort, try to find out what's wrong instead.
 
telepro
Frequent Visitor
Frequent Visitor
Posts: 63
Joined: Sun Apr 03, 2011 7:50 pm

Re: Has MikroTik given up on MetaROUTER?

Sun May 13, 2012 12:25 am

I have 4 production systems based on 433AH which as of today have respectively 10, 11, 25, and 53 days of uninteupted operation. i know i have had one of these systems up for greater than 120 days before it was rebooted (for some other reason). This sytem has ROS 5.6 and a single additional metarouter environment operating, with a non-ROS, OpenWRT system running our application in it. i do not remember an unexplained wathcog restart (or detected freeze) of this configuration. This particular environment seems quite stable for us.

FYI: Porting this same configuration to the 751G has not been as stable, with watchdog restarts occuring at intervals ranging between 1 and 7 days. Moving to later ROS releases (through 5.14) has not resulted in significantly better stability. Continuing to track down the initiating event in restarts in this environment....

Has there been additional results from the Mikrotik internal testing mentioned earlier in this thread (mid-April, ...)?
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: Has MikroTik given up on MetaROUTER?

Sun May 13, 2012 10:05 am

his sytem has ROS 5.6 and a single additional metarouter environment operating, with a non-ROS, OpenWRT system running our application in it. i do not remember an unexplained wathcog restart (or detected freeze) of this configuration. This particular environment seems quite stable for us.
It has shown that ROS metarouters negatively affect the stabillity even when OpenWRT metarouters run fine.
Has there been additional results from the Mikrotik internal testing mentioned earlier in this thread (mid-April, ...)?
Unfortunately no.
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: Has MikroTik given up on MetaROUTER?

Sun May 13, 2012 2:21 pm

It has shown that ROS metarouters negatively affect the stabillity even when OpenWRT metarouters run fine.
In my case, before swapping out the power supply, my 450G was just as likely to lock up while running an OWRT MR as it was running an ROS MR. After swapping the power supply, it ran for 2 weeks without incident while running 1 OWRT guest AND 1 ROS guest simultaneously, and the OWRT guest was forced to send all traffic through the ROS guest (bridged the single vif from the OWRT guest to one of the ROS guest's 2 vifs: one faced OWRT, the other faced the host)! And it was busy sending traffic: there were (on average) 15 active bi-directional RTP streams flowing between my 450G and my 433AH that was configured identically (1 OWRT guest running Asterisk + 1 ROS guest with all Asterisk traffic passing through it for days on end). Oh, and the 433AH was being powered that entire time with the 24v supply that the 450G would demonstrably crash on while running any MR guest.

So I (and others) have no problems with 450G after swapping power supplies, and so far there hasn't been one bad word said about any of the 4xxAH boards in this thread recently, either. Not saying that my experience is the only one that counts...I'm just telling you how it's looking from my perspective.

Now, on the PPC side of things, the jury's still out, but it did seem like adding an ROS guest in the mix reduced stability. However, I'm not yet convinced that it wouldn't have eventually crashed while just running the OWRT guest. Either way, there is a problem on PPC. I really need to find some more time to do additional stress-tests...

-- Nathan
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: Has MikroTik given up on MetaROUTER?

Thu May 17, 2012 9:06 pm

I had to rebuild my system located at the datacenter, so I just setup the RB450G located there with a very minimal configuration.
The powersupply at this location outputs 12V to the RB450G and an Alix 2c3, the RB450G reports 13V.

At this moment there is only a single MR running with one dynamic vif. I configured a static IP on both ends and let the guest ping the host. We will see how stable this setup tends to be. If it appears to be stable, I will further enhance the configuration of the host and guest as time allows.
 
barkas
Member Candidate
Member Candidate
Posts: 260
Joined: Sun Sep 25, 2011 10:51 pm

Re: Has MikroTik given up on MetaROUTER?

Sat May 19, 2012 11:14 pm

I still have random crashes with the 12V power supply. Not as often as before that, but still every 2 days on average.
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: Has MikroTik given up on MetaROUTER?

Wed May 23, 2012 7:38 pm

Ok, I can confirm the behaviour barkas is seeing, still reboots. :-(
And that even with nothing more than pings running from the MR to the outside.

So maybe barkas and I got boards which are more on the weak side than that from NathanA. That would be 2 boards for me and one for barkas that won't be stable even when powered with 12V or 13V.
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: Has MikroTik given up on MetaROUTER?

Fri May 25, 2012 12:14 pm

So maybe barkas and I got boards which are more on the weak side than that from NathanA.
I will pull a couple more RB450Gs out of stock and replace the one I'm currently using with a different one, and see if stability varies from board to board.

-- Nathan
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: Has MikroTik given up on MetaROUTER?

Fri May 25, 2012 4:20 pm

I will pull a couple more RB450Gs out of stock and replace the one I'm currently using with a different one, and see if stability varies from board to board.
Thanks, last idea I got. :(
Without any further words from MT, I by now doubt that I will ever run a MR on one of my RB450G boards, which is a shame because MR is the only thing which justifies buying a RB450G over a cheaper RB750GL in my eyes.
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: Has MikroTik given up on MetaROUTER?

Sat May 26, 2012 9:45 am

Well...crap.

timberwolf, you may be on to something. I pulled out a second brand-new 450G from stock. It came in the same batch as the one I have that has been 100% stable on 12v power, and has a serial # and MAC address that is very, VERY close to the good board, so they were manufactured very close together, possibly as part of the same batch. Every single component on both boards is identical: NAND, RAM, capacitors, etc.

But this second board is absolutely not stable with MetaROUTER. In fact, it is the least stable board I have run across so far. But I managed to gather some interesting data from it! (Warning: this post will probably end up being pretty lengthy.)

I configured it *identically* to the other board, both on the host as well as in the MR guest. On both 12v and 24v power, this board runs fine right up until I enable the MR. After I do that, the watchdog reboots it a few seconds later. After that it gets caught in a vicious reboot cycle: host boots up, starts up MR guest, which starts booting, and then the host does the usual MR "freeze up" thing and then watchdog kicks the host. I had a ping running to the 450G on my Windows laptop during this: sometimes it would boot up and respond to 3 pings before locking up, and sometimes it would respond to 30. But it never lasted longer than that! This is one board that never lasts "maybe a couple hours" before the problem happens: it happens almost IMMEDIATELY.

If I am quick on the draw, I can log in and disable the MR before watchdog reboots it.

I started doing some more experiments, though: first I started off with 12v power, using the very same power supply that the "good" board works just fine with. Then I tried switching to 24v power and it reacted the same way.

After that, I tried 24v power, but decided to also try underclocking the CPU. I set it to 400MHz and rebooted. Underclocking actually helped. It lasted longer before rebooting, but still would reboot within 15 minutes (more often than not it would lock up and reboot between 2-5 minutes).

If you will recall, I have made the observation that there are actually 2 types of lock-ups: ones that affect the whole device (host and all guests), and some that only affect the guest/MR. While on 24v power, I experienced the latter form once, and I noticed some interesting things. First, though, I should establish a normal working baseline for comparison: when there is either no MR running or when the MR is running fine, the 'system health' stats are usually pretty accurate, and the host is responsive to network requests directed at it (e.g., ping). It was 'normal' to see 'system health' show input voltage around 23.4v, and temperature around 49-50C, but you may recall that others have noticed that when the MR is "under load", the 'system health' stats have a tendency to swing around wildly. Also, when pinging the 450G host from the Windows laptop, I was seeing <1ms response times consistently on every ping response.

When the guest MR "locked up" and became completely unresponsive (both via network and on the console), but the host continued to respond, I noticed these extremely odd things:

- CPU load was in the single-digits (1-3%), so no load.
- 'system health' was stuck showing 4.6v input, and temperature of 27C (just moments before, it was showing 23.3v/49C)
- pings to the host showed a very odd pattern! It looked like this:
Reply from 192.168.0.2: bytes=32 time=10ms TTL=64
Reply from 192.168.0.2: bytes=32 time=9ms TTL=64
Reply from 192.168.0.2: bytes=32 time=8ms TTL=64
Reply from 192.168.0.2: bytes=32 time=7ms TTL=64
Reply from 192.168.0.2: bytes=32 time=6ms TTL=64
Reply from 192.168.0.2: bytes=32 time=5ms TTL=64
Reply from 192.168.0.2: bytes=32 time=4ms TTL=64
Reply from 192.168.0.2: bytes=32 time=3ms TTL=64
Reply from 192.168.0.2: bytes=32 time=2ms TTL=64
Reply from 192.168.0.2: bytes=32 time=1ms TTL=64
Reply from 192.168.0.2: bytes=32 time=10ms TTL=64
Reply from 192.168.0.2: bytes=32 time=9ms TTL=64
Reply from 192.168.0.2: bytes=32 time=8ms TTL=64
Reply from 192.168.0.2: bytes=32 time=7ms TTL=64
Reply from 192.168.0.2: bytes=32 time=6ms TTL=64
Reply from 192.168.0.2: bytes=32 time=5ms TTL=64
Reply from 192.168.0.2: bytes=32 time=4ms TTL=64
Reply from 192.168.0.2: bytes=32 time=3ms TTL=64
Reply from 192.168.0.2: bytes=32 time=2ms TTL=64
Reply from 192.168.0.2: bytes=32 time=1ms TTL=64
...do you see it? Latency was jumping to 10ms, and then going down by exactly 1ms every second until it reached 1ms and then would jump BACK to 10ms again, and re-start the descent. Very odd. I should mention that there was no network traffic load on the 450G other than my pings, and that my computer and the 450G were both plugged into the same switch.

I wanted to get a supout snapshot of this, so I ran '/sys supout' while it was happening. This, of course, generated CPU load on the host, and the minute it did that, the pings all went back to <1ms response times again AND 'system health' normalized! The guest MR was still unresponsive. Once the supout was finished being generated, and the CPU was again doing next-to-nothing, pings started doing the cyclical jittery latency thing again, and 'system health' started showing erroneously-low values again as well. A few seconds after this, the host locked up and watchdog rebooted it. (I have the supout still, if support thinks it would be helpful at all to look at.)

This episode was interesting for a few reasons:

1) Normally, even if only the guest locks up, it only lasts for a minute or two, and then "unfreezes" itself, much like the host does if you disable watchdog. In this case, this whole episode transpired over about a 10-minute period, and the guest never again became responsive.

2) The bizarre pattern to network latency between the host and my laptop as well as the incorrect numbers under 'system health' both remained UNTIL I told the host to do something that generated some load on the CPU. As long as the CPU was being loaded down by the host, those two oddball symptoms were not observable.

I was not able to repeat this another time; every other episode resulted in the host locking up, which just ended up kicking off the watchdog. I have not tried to see what would happen on this board with watchdog disabled yet. I would guess that it would be more prone to staying locked up for longer periods of time than most other boards, if my experience with the guest locking up is any indication. In fact, I would not be surprised if, after locking up, it remained that way indefinitely until a reboot.

So at this point I had tried 3 combinations:

1) 12v @ 680MHz
2) 24v @ 680MHz
3) 24v @ 400MHz

#1 and #2 seemed to be equally unstable, and #3 was still unstable but less so. I wanted to try one more combination: 12v @ 400MHz. And, guess what: this board so far is stable at these settings. I've had it running for 3.5 hours at this point and neither the host nor the guest have locked up at all. I have also tried rebooting the host and guest several times, and both come up just fine every time. No reboot cycle ensues.

So, barkas and timberwolf: on your boards that do not run stable even at 12v, would you be so kind as to also try underclocking to 400MHz while continuing to use 12v power? Obviously this is not a good solution, but it would be interesting to see if the "weaker" boards out there suddenly become stable when their CPUs are underclocked (== drawing less power and/or outputting less heat?). I know that others have claimed they tried underclocking in the past to no effect, but I don't know that anyone until now has actually tried underclocking while also changing the power supply. Based on my experience today, it seems both can have an effect separately, and have a greater effect when combined together.

I still wonder if this is a capacitor problem. I'm half-tempted to have one of the guys on-staff replace the capacitors on this board for me, and see if it suddenly becomes stable. (The other half of me wants to hold onto this board, maybe so that it can be shipped back to Latvia for analysis, since it is a uniquely extreme example that is VERY easy to reproduce the problem on.) Like I said, it's a brand-new board and the capacitors look perfectly fine, and are the same brand between the "good" board and the "weak" board (brown Su'scon). But I have my suspicions, based on janisk's experience with his original test board becoming completely stable even @ 24v after he replaced the capacitors that went bad.

I have a couple more boards at my disposal for testing, some that came with different capacitors from the factory (black Panasonic/Matsushita), and others that had bad caps that spoiled (green Su'scon) and that we repaired on our own with new caps (brown Nichicon), as well as a couple more brand-new boards from the same batch that the "good" board and the "weak" board both came from. I will keep you all updated on my progress.

-- Nathan
Last edited by NathanA on Sat May 26, 2012 10:24 am, edited 1 time in total.
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: Has MikroTik given up on MetaROUTER?

Sat May 26, 2012 10:13 am

NathanA, thank you very much for putting that much effort into this.
I may try 400MHz@12V if my time allows, but I am not very tempted to do so at the moment.
This is because, even if the behavior changes it still won't help me or others having problems with this board. The only who can contribute any usefull input by now is, in my eyes, MT.

Let me ask you a question, do you have the feeling, that there might ever be a solution from MT?
I would like to think so, but this thread is identical to every MR thread we had before in two points:
1.) Very much information provided by users, confirming the problem and trying to narrow it down.
2.) Very less information from MT.
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: Has MikroTik given up on MetaROUTER?

Sat May 26, 2012 12:19 pm

Let me ask you a question, do you have the feeling, that there might ever be a solution from MT?
How should I know? I'm not a prophet. :P
I would like to think so, [...]
So would I. The way that I look at it, though, is that it is in my best interest -- our best interest -- to work together and with MikroTik to get these problems solved and fixed. MetaROUTER as a concept is brilliant: virtual machine support on little single-board computers? Awesome! Furthermore, I have actually seen MetaROUTER working and working well, and through that experience I've seen its potential. I'm also convinced at this point that MetaROUTER software is solid* and that what we are seeing is a hardware issue, otherwise why would different instances of the same board model act differently?

The fact is that even if the problem never gets solved, at $99USD, there is no other product in its class like the 450G. What other SBC out there for under $100 USD has 0.5GB of flash storage, a plurality of gigabit interfaces, and has a CPU and OS that can run virtual machines? So my desire is for it to become an awesome MetaROUTER platform. If that wish never materializes, there isn't another product I can "switch" to that will be able to fill its shoes. It's either the 450G or nothing.** So why not try to work on the problem?

-- Nathan

* At least, it is solid on MIPSBE. Jury is still out on PPC.
** If you know of alternatives that I'm not aware of, I'm all-ears.
Last edited by NathanA on Sat May 26, 2012 12:25 pm, edited 1 time in total.
 
broadband
just joined
Posts: 21
Joined: Mon Aug 15, 2011 9:38 pm

Re: Has MikroTik given up on MetaROUTER?

Sat May 26, 2012 12:24 pm

My guess is that the problem could be switching (buck) regulator's transient response time or less likely output ripple. Increasing core frequency, using more on-chip resources may require faster transient response at the power pins of the Atheros (and for other SoC as well) .So it is worth checking power supply requirements in the data sheet. Besides, decoupling (ceramic) capacitor placement and its quality (expecting xr7 not xr5) is very important as well. Transient response time is also the function of input voltage (12v, 24v) of the switching regulator.

Best regards

Ali
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: Has MikroTik given up on MetaROUTER?

Sat May 26, 2012 2:15 pm

So why not try to work on the problem?
I totally agree with you on the potential of MR. But to keep my answer short, WE are not working on the problem, we are fumbling arround, providing information to MT without getting anywhere. And as long as I don't get any proof that MT is working on the problem, I won't waste my time.
 
reverged
Member Candidate
Member Candidate
Posts: 270
Joined: Thu Nov 12, 2009 8:30 am

Re: Has MikroTik given up on MetaROUTER?

Sat May 26, 2012 11:47 pm

So why not try to work on the problem?
I totally agree with you on the potential of MR. But to keep my answer short, WE are not working on the problem, we are fumbling arround, providing information to MT without getting anywhere. And as long as I don't get any proof that MT is working on the problem, I won't waste my time.
I'd have to agree with timerwolf here. This has been a very one sided, thankless (from MT) investigation.
I specifically chose the 450G for a project some time ago based on MR. Played with it for hours and gave up.
Now, I just plug a cheap OpenWRT box into a 750GL!
All I need is something that can make ssl gets and snmp queries. No horsepower required.

We have nothing but theories and empirical data from the limited things we can test.

Nathan: You are doing a lot of good work and you have finally gotten to the point where you have found a 450G that acts as others (myself included) have seen for some time. There is the very old, long thread (circa Oct 2009) http://forum.mikrotik.com/viewtopic.php?f=15&t=35800 describing this problem with the 450G. That thread is full of power supply theories, tests and failures.

2-1/2 years later there is no result from Mikrotik. Not a clue about the cause of this problem.

They continue to deny being able to reproduce the problem, perhaps because it is affecting only some of the boards.
I get that. I'm an EE and I know how that happens. But you pulled a second board from your stock and saw this problem.
Others have reported the problem, sent supout, etc, etc.

MT tells you to 'tweak' your config or disable packages, etc.
That is complete work around, if it works, and comes with no explanation.
It is not a solution to a problem.
The only solution to this problem is one that comes with a sane explanation.

Or maybe MT has no clue where to start. I get that too.

Or MT knows what the problem is and refuses to fix it. This would be sad.

Or the MR guru has died or departed MT.

Or.....the list goes on and on.

So what happens next? There needs to be an action step.
Does Nathan ship his kit to Latvia? Do they run it at the same V/F? Wikipedia tells me Latvia is 220V/50Hz.
Does MT send Nathan a shipping label and commercial invoice documents (if required)?
Does MT send him a replacement 450G or 2 for his efforts?

It's long past the time for MT to get in the game and request hardware from those able to reproduce the problem or declare MR a "box of chocolates" feature on the 450G.
 
peson
Trainer
Trainer
Posts: 183
Joined: Tue Jul 20, 2004 10:33 am
Location: Sweden

Re: Has MikroTik given up on MetaROUTER?

Mon May 28, 2012 9:10 am

MT tells you to 'tweak' your config or disable packages, etc.
That is complete work around, if it works, and comes with no explanation.
It is not a solution to a problem.
The only solution to this problem is one that comes with a sane explanation.

Or maybe MT has no clue where to start. I get that too.

Or MT knows what the problem is and refuses to fix it. This would be sad.

Or the MR guru has died or departed MT.

Or.....the list goes on and on.
First, I've tried a 12V adapter for one of mine 450G, it doesn't reboot as frequently as with the 24V adapter, but it still does.
Today it have been running for 3 days.
I'm in Sweden, so we have a 220V/50HZ supply.

And so to my headace, the fact that MT doesn't respond.
I've invited them to collude togehter with us, instead of having us fumbling around and try things to "solve" our problems, read my posts http://forum.mikrotik.com/viewtopic.php ... 50#p312300 and http://forum.mikrotik.com/viewtopic.php ... 00#p313421
I want to solve this, either it's hardware or software issue. I think it's a combination.

/Paul
Reboot is the last resort, try to find out what's wrong instead.
 
User avatar
janisk
MikroTik Support
MikroTik Support
Posts: 6283
Joined: Tue Feb 14, 2006 9:46 am
Location: Riga, Latvia

Re: Has MikroTik given up on MetaROUTER?

Mon May 28, 2012 10:18 am

Hi NathanA that is the behaviour i am seeing on RB450G when i see the freezes. Ping times coming down from some certain value to a normal value (0.3, 0.5 ms) when freeze happens and not a lot of packets are going to router after a while all of the packets gets replied to on linux/Mac you can actually see that all ICMP requests gets replied to at the same time. If there are a lot of packets around on the network (a lot of broadcasts, or you just bash target router with /tool traffic-generator) buffers fill up and later ICMP messages are missing. That is a problem i am working on right now.

EDIT: if watchdog is enabled router is rebooted in a moment, so i see this only with watchdog disabled.
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: Has MikroTik given up on MetaROUTER?

Mon May 28, 2012 12:37 pm

janisk,

Actually, what I was seeing was different. I think you misunderstood my explanation. I understand what you are talking about: when the host (450G) freezes, if you ping the host, it doesn't respond while it is frozen, but it still "queues up" the requests, and then the responses all come back at once after it "un-freezes". I have seen this, too.

But this is not what was happening this time. Like I said, there are 2 different types of "freezes". The most common "freeze" is when the host freezes up, which is what you are talking about. But the second kind of "freeze" is when only the guest MetaROUTER freezes and the host 450G is still responsive. When this happens, the host still responds! I can ping the host, WinBox session doesn't disconnect, I can look at logs, change settings, watch interface utilization, etc. But if I open up the MetaROUTER console for the guest, and try to type something in there, nothing happens. And if I ping the guest, I get no response. Then, 2 minutes later (give or take), the guest "un-freezes" and starts responding normally again to everything: console and networking.

When this second kind of "freeze" happens, then hardware watchdog does not engage, because the host is still responsive; only the guest is not responsive!

And when this kind of "freeze" happens with the guest, then pings to the host work, but the round-trip ping times to the host have a very strange pattern to them: they jump up to a strange number (like 10ms), and then respond 1ms faster every second. Once ping times to the host get back down to 1ms, then it jumps back up to 10ms again and the pattern repeats. So the host is still responsive during this, but something is slowing down its response time imperceptibly; however, you can see it in the ping jitter. When this happens, the guest is not busy with a task and it is not loading down the CPU (CPU usage is between 1-3% on the host).

When the guest finally "un-freezes", then pings to the host return to normal and the jitter is gone.

I documented this observation in hopes that it might help the developers to better understand what is going on with this problem, because I believe that the 2 kinds of "freezing" are in fact related: they are just 2 different symptoms of the same problem. The reason I say this is because when I do something to work around the problem (switch power supplies, underclock CPU), both kinds of freezes stop happening.

Also, I was wondering if you have any thoughts or comments on the other interesting part of my post: that I found a 450G board that is really easy to reproduce the problem on? On this board, with normal settings (680MHz), even with 12v power supply, I can make the host freeze and watchdog kick the board in under 30 seconds. I can repeat this 100% of the time on this particular board. It is a brand-new board with healthy-looking capacitors, and it works fine as long as MetaROUTER is not being used on it. So I now have 2 boards: 1 board that freezes up occasionally (between 15 minutes and a few hours) on 24v power but works 100% reliably on 12v power, and 1 board that freezes just a few seconds after starting MetaROUTER guest on it, every time, on both 12v and 24v power when CPU is at default clock (680MHz). But this same board becomes stable on the 12v power supply when underclocked to 400MHz!

Because I have 2 boards that were manufactured very closely together and are visually indistinguishable from each other, this suggests that there is a physical hardware problem causing this issue on the 450G, and that some 450G boards are more susceptible (or sensitive) to the problem than others. The question is why: what is different between the 2 450G boards that have very close serial numbers and the same components on them?

The other question is whether this second board would be useful to you guys since the problem is repeatable on it 100% of the time and it only takes seconds for it to happen. It might be helpful for you guys to have it as an aid to you while working on the problem since you don't have to wait around for hours to see if a change you made fixes the issue or not: it literally crashes 30 seconds (at most) after boot, every single time.

-- Nathan
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: Has MikroTik given up on MetaROUTER?

Mon May 28, 2012 2:42 pm

this all looks grim
Don't say that. :( There has to be an answer. We know this because 433AH with same SoC is rock-solid. We're missing something...
[...] voltage supplied to CPU was stable and within acceptable margins.
Even if they are acceptable, have you compared against what you see on a 433AH? Is the observed voltage range delivered to CPU "tighter" on that board, perhaps? When a 450G is about to crash or has crashed, do you see any unusual fluctuations in any live measurements?

I'm sure your team has already checked a lot of this stuff. My job right now I guess is to ask all of the obvious questions in hopes I accidentally hit upon something that hasn't been tried yet. :?

In other news, I've started up my PPC MetaROUTER lab again. I began with the 2 RB1100s that I originally had peson's test setup on. I can get both boards to crash/reboot every 4-8 hours or so if I load down the CPU in the MetaROUTER. Based on my success with underclocking on the 450G, I'm trying a few similar things on the RB1100. I'll let you all know how it turns out.

-- Nathan
 
User avatar
janisk
MikroTik Support
MikroTik Support
Posts: 6283
Joined: Tue Feb 14, 2006 9:46 am
Location: Riga, Latvia

Re: Has MikroTik given up on MetaROUTER?

Mon May 28, 2012 3:17 pm

yes, RB433AH was used as example and reference was the actual CPU reference foe voltage you have to supply to the CPU for it to work properly. Deviation is in mV and it stays within the margins. Unfortunately i did not write down the actual value since measurements was done by electrical engineer that actually checks that stuff and i was just monitoring him.
 
User avatar
janisk
MikroTik Support
MikroTik Support
Posts: 6283
Joined: Tue Feb 14, 2006 9:46 am
Location: Riga, Latvia

Re: Has MikroTik given up on MetaROUTER?

Mon May 28, 2012 3:21 pm

some time there where a question about what GPIO is - it is used for health monitoring (voltage and temperature)
 
User avatar
janisk
MikroTik Support
MikroTik Support
Posts: 6283
Joined: Tue Feb 14, 2006 9:46 am
Location: Riga, Latvia

Re: Has MikroTik given up on MetaROUTER?

Mon May 28, 2012 4:00 pm

another thought - how many ethernet interfaces you have linked? Is there any difference when more or less than usual test ports are linked? What port/ports you are using.
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: Has MikroTik given up on MetaROUTER?

Mon May 28, 2012 6:27 pm

some time there where a question about what GPIO is - it is used for health monitoring (voltage and temperature)
janisk
Have you and the devs tried disabling all functions/drivers using GPIO, to see if it makes a difference?
There have been some reports of strange voltage and temperature readings in conjunction with MR.
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: Has MikroTik given up on MetaROUTER?

Mon May 28, 2012 11:21 pm

I agree -- it would be interesting to disable hardware monitoring entirely and see if has any effect on the issue. Remember that hardware monitoring between the 450G and the 433AH is different: 433AH only has voltage readout, while 450G adds temperature. I don't believe the 433AH voltage jumps around during MR use but I will double-check.

About ethernet ports: I have been using between 1 and 3 ethernet ports at any given time, and it does not seem to matter how many I have linked or which one(s) I am using. The board that crashes every 30 seconds does it whether or not I only have 1 thing plugged in, and does it if it is plugged into ether1 or ether2-5.

About PPC testing: last night, I set the memory speed on both 1100 routers to 333MHz instead of 400MHz. This of course also had the effect of setting the CPU to run at 666MHz instead of 800MHz because FSB went from 200MHz to 166MHz. Results so far are looking promising as they have both been continuously running for over 13 hours, and they have been running at 100% CPU and exchanging data with each other this whole time. If this ends up working, I will bump the memory speed back to 400MHz and then reduce CPU to 600MHz so that I can try to determine whether it is the memory speed reduction or the CPU core speed reduction that stabilized it.

-- Nathan

EDIT: Update: even with the underclocking, I'm still getting reboots on the PPC side. I will continue to experiment.
 
User avatar
janisk
MikroTik Support
MikroTik Support
Posts: 6283
Joined: Tue Feb 14, 2006 9:46 am
Location: Riga, Latvia

Re: Has MikroTik given up on MetaROUTER?

Tue May 29, 2012 3:42 pm

if possible monitor guest memory state. Maybe problem is completely in other place.
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: Has MikroTik given up on MetaROUTER?

Tue May 29, 2012 3:57 pm

if possible monitor guest memory state. Maybe problem is completely in other place.
Are you talking about on PPC or MIPS? Regardless, on both, I have been allocating 128MB of memory to guests, and the guests are not coming close to using it up...I have been watching. At most about 30MB is in use and the remainder of it is free. Even if the guest was using up all memory allocated to it, that should not cause the *host* to behave erratically.

Also, on PPC, I don't believe the root cause of the problem is the same. I've been running more tests. On RB1100, unlike the RB450G, the watchdog is *more* likely to kick the router when I underclock further: I set CPU to lowest possible clock (333MHz), told the MetaROUTER to make itself busy, and watchdog would kick off anywhere between 2-45 minutes.

So I disabled the watchdog. It lasted much longer: I got almost to 2 hours. During that time, the router never "locked up" for 0.5-2 minutes at a time, like the 450G does. This makes me wonder if the watchdog on PPC is somehow being triggered by "false positives" when MetaROUTER is running?

Also, I believe there are 2 separate problems with MetaROUTER on PowerPC. The first is that watchdog is kicking the router when it shouldn't for some reason. But even when watchdog is off, the router will either crash and hang (requiring me to pull power), or reboot itself! This is what happened after 2 hours with watchdog off ('/system watchdog set watchdog-timer=no') on one of my RB1100s which was still underclocked to 333MHz.

Furthermore, when it reboots itself, it still says "(cause 1)" in the logs, even though watchdog is off! What does "(cause 1)" actually mean? Is it possible this can refer to reboots outside of ones caused by the watchdog? Or is this possibly an indication that the hardware watchdog is not really disabled?

As far as the 450G goes, I still think it would be interesting to see if you guys can make a build of MIPSBE RouterOS that doesn't include any hardware monitoring. If it were a separate package I would just try to disable/uninstall it, but it isn't. We keep asking ourselves what the differences are between the 450G and the 433AH (SoC is the same...so is it the power chain? the switch chip? etc.), and one obvious difference that hasn't been explored is the difference in hardware monitoring: 433AH is voltage only, and 450G has temp and voltage. Also, 450G kicks off hundreds of GPIO interrupts per second (which you told us is related to the hardware monitoring) whereas 433AH does not, even though it has voltage monitoring. So perhaps there is a difference in how the monitoring is implemented between the 433AH and the 450G, and this is somehow contributing to the problem.

-- Nathan
 
User avatar
janisk
MikroTik Support
MikroTik Support
Posts: 6283
Joined: Tue Feb 14, 2006 9:46 am
Location: Riga, Latvia

Re: Has MikroTik given up on MetaROUTER?

Wed May 30, 2012 8:53 am

i will try locally without monitoring turned off, as it is not that easy to make npk with that change.
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: Has MikroTik given up on MetaROUTER?

Wed May 30, 2012 9:55 am

i will try locally without monitoring turned off, as it is not that easy to make npk with that change.
Sorry, I don't understand what you are saying. How could you try without turning monitoring off, when the test would be to actually turn monitoring off?
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: Has MikroTik given up on MetaROUTER?

Wed May 30, 2012 12:40 pm

Okay, I am on top of the world right now. You know why?

'cause I fixed my board. And in the process, I have confirmed that the problem is somehow related to health monitoring.

Okay, "fixed" isn't exactly the right term. I found a workaround. I decided that if MikroTik could not build us a test version of RouterOS that had health monitoring disabled, I was going to find a way to disable it myself. And so I did.

I don't think I should go into step-by-step details of what I did, because MikroTik would probably not appreciate that. But I'm hoping that my discovery can help them zero in on a fix, now that they know where they should be looking. And perhaps in the meantime, until they find a fix, they might decide to re-think giving us users the option to disable health monitoring. Or, perhaps, disable health monitoring if a MetaROUTER is running.

In summary, though, what I discovered is that MikroTik fortunately did not link the drivers to the health monitoring into their main kernel file, but kept it as a separate kernel module file. The file name is voltage.ko. I gained access to the yaffs2 rootfs file system on the NAND, and deleted that file. (If someone else wants to try the same thing, I'm afraid I am going to have to leave that as an exercise to the reader.)

This absolutely solved the problem.

Remember my second RB450G? The one that gets stuck in a reboot loop, and reboots every 30 seconds if there is a MetaROUTER on it? Well, I got 2 more boards out, and found a second one that behaves like it, as well as a second "good" board that seems to work fine on 12v power, like my first one. So far, I am 2 for 4 on boards.

So I took my original "bad" board, NetInstalled it with 5.16 to start with a clean slate, imported my OpenWRT image into it again, and it immediately started "reboot looping" like before. I sat there watching it for 10 minutes rebooting over and over and over and over again...

I then removed the voltage.ko file, and booted the board back up. Here is what I can tell you after doing this:

1) '/system health print' now returns nothing.
2) '/system resource irq print' now shows no GPIO IRQs firing off anymore! It's not even on the list!
3) My board has completely stopped rebooting.

I have shut down and rebooted the MetaROUTER several times, have generated CPU activity within the MetaROUTER, and have even rebooted the 450G a few times with the MetaROUTER enabled. It is solid now. No more lock-ups, no more reboots. I can't explain it because I still don't know enough about how health monitoring and MetaROUTER both work. I also can't explain why some boards are more "sensitive" to whatever this conflict is than others are; we've already established that there definitely is a difference (physical or otherwise) between boards. But the problem *IS* fixed on my board, and it has been fixed through software.

Oh, and did I mention that it is running on a 24v power supply with the CPU running @ 680MHz? :shock:

I will be doing some more stress-testing on this board tomorrow. I am highly confident that I will find that my MetaROUTER instability problems are cured, even after extended testing. I will report back later.

-- Nathan
 
User avatar
janisk
MikroTik Support
MikroTik Support
Posts: 6283
Joined: Tue Feb 14, 2006 9:46 am
Location: Riga, Latvia

Re: Has MikroTik given up on MetaROUTER?

Wed May 30, 2012 1:23 pm

i am seeing something similar, just a caution - what will happen of something else will start to generate interrupts. anyway message is relayed to devs.
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: Has MikroTik given up on MetaROUTER?

Wed May 30, 2012 1:43 pm

i am seeing something similar, just a caution - what will happen of something else will start to generate interrupts.
Yes, I wonder about this, too. I suspect that timberwolf was right and that we've got some kind of deadlocking situation occurring, and it is a matter of timing. Minute physical differences between board components are perhaps causing slight timing variances between boards with regard to whatever the root cause is. That's the only way I can explain how it seems to vary between boards, and also why input voltage and CPU clock speed might both affect it.

In any case, I plan to run the board through its paces tomorrow, which will involve having 2 RB450Gs on 24v power @ 680MHz (both with this "fix" implemented) both generating traffic to each other from within the MetaROUTERs on each board, and I will leave them to do that for hours. That should exercise the switch chip, which also seems to generate plenty of interrupts all by itself. :)

I think I will also try removing voltage.ko from my RB1100s as well, and see if that makes any difference at all on that platform.
anyway message is relayed to devs.
Awesome, thanks. Let us know when you receive word back from them.

-- Nathan

EDIT: I just had a thought...say that the problem with MetaROUTER is with a generic deadlocking on interrupt handling. On PPC, I just realized that I originally wasn't having any trouble with my tests until I introduced a second MetaROUTER on each RB1100. Originally I was only running a single OpenWRT MetaROUTER. Then other people on this thread suggested that maybe the problem was with RouterOS MetaROUTERs, so I made a RouterOS MR on each RB1100 that all traffic from the OpenWRT MR had to go through. What if the problem isn't with RouterOS MR, but with having more than one MR? Each MR instance has its own set of 3 IRQs: vm, xfs, and xdev. If you have 2 MRs running, then there are 6 extra interrupt lines (3 x 2). If the problem is interrupt deadlocking, then it seems like having > 1 MetaROUTER on a RouterBOARD would increase your likelihood of a lock-up/reboot.
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: MetaROUTER questions

Wed May 30, 2012 4:32 pm

First I would like to state, that I am not pleased by the fact that MikroTik changed the Subject line of this thread!
EDIT: And I therefore restored the original subject.

NathanA
Great work, I think I know what you did. ;-)
I am a little disappointed that you had to come up with this and that MT couldn't conduct this obviously very easy test.
But I think janisk is also right, what you probably did is just lower the changes of this deadlock happening.

janisk
I still hold up the thesis, that something is wrong with either your interrupt service routines or the interrupt controller of the SoC itself.
I also understand that debugging this code is not an easy task, but you really have only two sane choices here:
1.) Dive into the code, down to assembly instructions if necessary(and I think it is).
2.) Surrender MR on this type of boards.
The debugging needs not to be done on this specific board, as there seems to be problem with the debugging interface(JTAG?) if I recall one of your posts correctly, it can be done on every board which uses this SoC. If you then can't find an error in any non-health-monitoring related ISR you most probably got one in the health monitoring code. Maybe you can just start there as I would expect some assembly code in this module anyway, which might be the cause of this problem if you are lucky.
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: MetaROUTER questions

Wed May 30, 2012 11:03 pm

Great work, I think I know what you did. ;-)
Heh, well, it's not exactly rocket science. ;)
I still hold up the thesis, that something is wrong with either your interrupt service routines or the interrupt controller of the SoC itself.
It's possible that the general interrupt servicer was actually written by the SoC manufacturer, or a contractor of the manufacturer, and not MikroTik directly...I could be wrong. But I note that the IRQ code for the RB500 SoC has a copyright notice by an embedded systems software company on it. Perhaps RB4xx interrupt handler code came directly from Atheros, and they may need to be brought in on this discussion.

-- Nathan
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: MetaROUTER questions

Wed May 30, 2012 11:34 pm

It's possible that the general interrupt servicer was actually written by the SoC manufacturer, or a contractor of the manufacturer, and not MikroTik directly...I could be wrong. But I note that the IRQ code for the RB500 SoC has a copyright notice by an embedded systems software company on it. Perhaps RB4xx interrupt handler code came directly from Atheros, and they may need to be brought in on this discussion.
Well thats the point (again), speculating won't get us anywhere as there are many possible implementations for an ISR.
But I must confess that this thought, about third party software outside of MT's control, crossed my mind about 2 years ago when MT first stopped doing anything about the well known MR instabillity on these boards...

What really upsets me at this point:
Changing the subject of this thread to "MetaROUTER questions" looks like a maneuver to move this thread and this problem out of perception again. I don't know why anyone at MT can't just commit that they are never gonna fix this bug on the RB450G, cause it looks as this is exactly where we are heading.
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: Has MikroTik given up on MetaROUTER?

Thu May 31, 2012 7:07 am

Well it obviously hasn't been a full 24 hours yet, but I've been running my 2 "unstable" 450G boards through a stress-test for about 5 hours now, and they have both been rock-solid after implementing this fix (removing voltage.ko).

Here is the configuration I've got on both; they are both set up identically:

- RouterOS 5.16, and latest RouterBOOT (2.39)
- Clocked @ 680MHz, running on 24v 0.8a power supplies.
- 2 MetaROUTERs each: 1 RouterOS, 1 OpenWRT + Asterisk
- RouterOS MetaROUTER is bridging Asterisk traffic to host, which is then NATting it.
- 30 continuous, simultaneous calls running between Asterisk instances.
(This is chewing up the CPU and generating switch chip interrupts.)
- Bidirectional TCP bandwidth test between hosts
(also chewing up the remainder of the CPU and generating switch chip interrupts.)

So they are both generating load on the CPU both inside and outside the MetaROUTER, and generating network traffic both inside and outside the MetaROUTER; in fact, traffic is originating from one MetaROUTER (OpenWRT), passing through a second (RouterOS), and then being NATted by the host before hitting the other 450G, which is doing the exact same thing. CPUs are at 100% load. Switch chips are generating interrupts at a rate way faster than GPIO ever counted up. And still, no instability has even been hinted at. Mind you that both of these boards couldn't remain running with MetaROUTER before without locking up and rebooting after between 5-30 seconds, unless they were both underclocked as well as undervolted.

I won't declare victory yet and will let this test run for a few days. But 5 hours is longer than I would expect these boards to run if there were still a problem.

-- Nathan
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: Has MikroTik given up on (MIPSBE/RB450G) MetaROUTER?

Thu May 31, 2012 6:21 pm

I just received the following warning from normis:
This is a warning regarding the following post made by you: viewtopic.php?f=15&p=319494#p319494

I'm sorry but the thread title is misleading, JanisK is trying to help, and the issue is nearly resolved now.
As I am not allowed to answer this message, I will have to do so here:

First I can't confirm that the issue is nearly resolved, the only progress so far was due to NathanA "hacking" an production image.
Second, even if this issue would be resolved, the subject and title of this thread would still not be "MetaROUTER questions".
I changed the title to better reflect the board and or system which still isn't able to run MR.

At this point I would like some input about this political topic, and conformation that I am not totally nuts, from the other contributors in this thread.

@normis
I am always open to discussions, but I am clearly not the one here which is acting unreasonable here.
 
liteforce
newbie
Posts: 44
Joined: Sun Aug 16, 2009 8:06 pm

Re: MetaROUTER stability issues on certain MIPSBE boards

Fri Jun 01, 2012 12:46 am

Hi folks,

I've been lurking on this topic since it was created.

We have a number of RB1100 and RB1000 devices acting as core routers on our network; while we didn't buy the devices solely for MetaROUTER functionality, it was deemed a worthy enough feature to enable for the purpose of running small, single function, OpenWRT instances which would suffice where space/power was at an absolute premium and we could afford to sacrifice some CPU/RAM on the router to handle the task.

This was a big mistake on our part.

It was possible to get an OpenWRT instance to take the router down hard simply by running 'wget -O /dev/null http://some.random.url/' repeatedly - not a very nice thing to happen to a core router supporting a few hundred customers.

We also managed to duplicate the problem by running a RouterOS MetaROUTER, creating a simple no-firewall router with two interfaces, routing traffic from a PC running the exact same 'wget -O /dev/null' test through it; so we have virtualized MikroTik code, running in a supported MikroTik environment on MikroTik designed hardware - to rename the thread from the original title reeks of arrogance that the problem is not one for MikroTik to resolve and I applaud NathanA for his doggedness and determination in order to try and find the cause of these MetaROUTER issues which have made the use of this feature in production almost impossible.

I have also tried the RB450G - the only mipsbe RouterBOARD I own besides an unstable RB750 that was suffering from dodgy capacitors while I was testing this out - and strangely enough, I couldn't get the RB450G to crash at all regardless of whether it was running a RouterOS MetaROUTER or an OpenWRT MetaROUTER; the RB750 crashes were probably due to the bad capacitors but I'll be happy to test again now I've replaced them.

So, to the original topic author (timberwolf), I would kindly request that the thread title be updated to include PPC RouterBOARDs as well rather than be specific to the mipsbe RouterBOARDs - it might very well be two different issues plaguing both architectures and while NathanA is focusing his efforts on one particular model of RouterBOARD, I'm hoping that he uncovers something which will make the MikroTik devs look at the code again in a different light as I suspect it is going to turn out to be something so stupidly simple that MikroTik may be embarrassed when it is finally solved.

normis/janisk: We have and will continue to use MikroTik hardware as I personally believe that it suits our purposes perfectly and unlike the big vendors such as Cisco, you are willing to listen and engage with your customers in a personal manner via means such as this forum - that should not change - don't issue warnings to valued contributors who are bringing new information to the table, without financial recompense to themselves, with the only aim being to improve your products for the betterment of your own reputation and your standing amongst those customers who would love a stable implementation of MetaROUTER on the identified devices.

Regards,
Terry Froy
Spilsby Internet Solutions
http://www.spilsby.net/
 
User avatar
normis
MikroTik Support
MikroTik Support
Posts: 24392
Joined: Fri May 28, 2004 11:04 am
Location: Riga, Latvia

Re: MetaROUTER stability issues on certain MIPSBE boards

Fri Jun 01, 2012 9:26 am

I am always open to discussions,
You can keep the topic title as you like, I just thought it was a better name. The goal of this technical forum is solve issues and keep to the technical aspects of networking.
No answer to your question? How to write posts
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: MetaROUTER stability issues on certain MIPSBE boards

Fri Jun 01, 2012 10:39 am

I am always open to discussions,
You can keep the topic title as you like, I just thought it was a better name. The goal of this technical forum is solve issues and keep to the technical aspects of networking.
I did choose the title very carefully and the around 4000 views and 4 pages of posts seem to confirm that I did right. I don't deny that it is/was a provoking title from your(MT) point of view, but this problem is arround since at least 2.5 years and has always been played down or ignored by MT.
I really would like to focus on solving this problem WITHOUT any political games like changing the thread title or statements like "there is no problem", ok?
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: MetaROUTER stability issues on certain MIPSBE boards

Fri Jun 01, 2012 10:47 am

Well, now that the drama is (hopefully) over, I thought I would mention that my experiment has been running continuously for 32 hours now, and has been rock-solid the entire time. Switch interrupt requests are still firing off at a rate about 3x what the GPIO interrupt requests were being generated at, but there is no instability.

I will implement this fix on all 450Gs going forward until MikroTik has an official fix. I hope the official fix comes soon, because the one downside to my "fix" is that it will make RouterOS version upgrades impossible to do unless I am physically at the device.

Making the same modification to a PowerPC RouterBOARD for testing purposes is proving to be more difficult, but I will continue to work at it.

Terry (liteforce): That's very interesting that you found that PowerPC boards were more likely to crash when you generated network traffic (wget). I will try to use this technique when I load-test RB1100s as I continue my experiments; I've been having trouble finding a pattern to the MetaROUTER instability on my RB1100, and I haven't run across a PowerPC RouterBOARD that is as predictably unstable as the two 450Gs I now have, which makes testing both difficult and time-consuming (because it may take hours or even days before I know whether or not something I've changed has made any difference). Approximately how long would it take after you started your looping 'wget' test before you would see your 1100 crash and/or reboot?

Also, I would point out to you that a 750, with or without good capacitors, is going to be a very poor MetaROUTER platform. Believe me: I've tried. There just simply isn't enough RAM on the thing. You practically need at least 16MB to do anything useful or interesting, and RouterOS *requires* 16MB at the minimum anyway. I suspect there is overhead with MetaROUTER itself, so if you even have just 1 MetaROUTER on a 750, and you give it 16MB of RAM, that's half the RAM on the device, not counting overhead. I have tried this on a 750, and it just doesn't work...the host will slow to a crawl as it quickly runs out of memory, then the kernel OOM-killer will start wreaking havoc with essential RouterOS processes, and eventually watchdog will kick in. I've gotten a 750 stuck in a reboot-loop this way and had to reset to defaults to get it back.

-- Nathan
Last edited by NathanA on Fri Jun 01, 2012 11:01 am, edited 1 time in total.
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: MetaROUTER stability issues on certain MIPSBE boards

Fri Jun 01, 2012 10:59 am

So a short summary at this point in the thread.


MIPSBE
-------
We did various tests on RB450G boards, most of them were conducted by NathanA. The conclusions for this board so far are:
1.) The powersupply does have an influence but isn't the cause.
2.) Disabling hardware monitoring by a hack seems to improve the stability, assumably because of much lowered IRQ load.
3.) It seems as there has been no significant progress by the MT devs, maybe with point 2 there will be.
UPDATE: It more and more seems to narrow down to the GPIO ISR(s) as NathanA reports that high IRQ load from the switch chip doesn't seem to cause issues.

Other MIPSBE based boards are more stable, but boards which show many or high frequency GPIO IRQs seem also to be unstable.
I must confess that I can't recall by now which boards are user-reported stable and which not. Maybe someone can fill in.

PPC
----
NathanA conducted tests on RB1100 and RB1100AH boards, with only the later showing some instability issues.
NathanA maybe you could write a short summary?

liteforce reports that there are issues with RB1000 and RB1100 too.

@litefore
I will include PPC in the thread title. Thanks for your input.
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: MetaROUTER stability issues on certain MIPSBE boards

Fri Jun 01, 2012 11:34 am

1.) The powersupply does have an influence but isn't the cause.
In addition, I would add that the CPU clock speed also has an influence but is also not the cause. And certain boards that appear physically the same and even were manufactured within the same week (maybe even the same day!) as each other behave differently. Putting all of this together further suggests a timing issue of some kind that makes a deadlock more or less possible, given certain circumstances.
2.) Disabling hardware monitoring by a hack seems to improve the stability, assumably because of much lowered IRQ load.
I have jumped to conclusions too soon at other points in this thread, and I feel that the part highlighted in bold might be similarly premature. The fact is that with the 2 boards I have that reboot within 5-30 seconds of booting up, the only rapid-fire interrupts at that point (after bootup) came from the health-related GPIO lines...I had not yet gotten to the point of introducing CPU load or network traffic yet since all I had done at that point was imported my MR image and started it up! Once I disabled hardware monitoring and set up my load-tests, the interrupts being generated since then by the switch chip have been 3x higher on average than the GPIO interrupts ever were (on account of the network traffic being generated), and yet my boards are still stable 32 hours later. So the number of interrupts being generated may not actually have anything to do with it. But, who knows: timing-related deadlocks can be such an unpredictable phenomenon, as has already been demonstrated...

(EDIT: I see that you already edited your own post and acknowledged the observation about the switch IRQs. :) )
I must confess that I can't recall by now which boards are user-reported stable and which not. Maybe someone can fill in.
So far, it seems like most of the 4xxAH-series are stable: 433AH, 493AH (possibly with the exception of 411AH, but I don't believe anyone has tested it extensively yet). The 4xxG-series are the ones with the problems (493G I believe was also reported to be unstable). If I had to hypothesize, I would think that any MIPS board that monitors more than one health resource (voltage, temperature, and others) are most likely to be unstable, while those that monitor either a single resource (voltage-only) or no resources are most likely to be stable. But this is just an assumption at this point. ;)
NathanA conducted tests on RB1100 and RB1100AH boards, with only the later showing some instability issues. NathanA maybe you could write a short summary?
Gladly. It doesn't appear to be only the 1100AH boards at this point. For a while, I thought it was only my particular AH board since others had not reported problems, and because my AH board was hard-crashing even when hardware watchdog was enabled. And then before that I was convinced that my AH board was only crashing because I had overclocked it, and was stable after I returned the CPU to the factory-set clock rate (because it had run 5 days straight at one point without a problem).

But at this point I can reproduce crashes and reboots on 3 different 1100 boards as well as my AH board. The instability seems to increase in likelihood when I add additional MetaROUTERs...my initial 5 day record on my AH was accomplished when it was only running a single MetaROUTER (OpenWRT), and the crashes and reboots started to happen after adding a second MetaROUTER (RouterOS). At this point, it seems to happen every few hours (on average, I'd say about 8, but it can range from 2 to 20 or more). Also, I will add that on some boards I disabled watchdog, and I still saw boards reboot with "(cause 1)" being given as the reason in the logs (same "cause" that is generated by a watchdog reboot).

Inspired by my success with the MIPS boards, I tried similar techniques on the 1100 boards to no avail. I can't undervolt them any more than they already are because they ship with 12V power supplies and 12V is the lowest documented voltage that the 1100-series can accept (unlike most MIPS boards which can go down to 8V). Underclocking both the RAM and CPU as far down as they would go seemed to increase the frequency of reboots, as they would happen between 2 minutes and 2 hours! Keep in mind, though, that I was always generating both network and CPU load on these boxes. I have not yet tried to underclock them, boot up a couple of MetaROUTERs, and then just let them sit idle.

So in summary, at this point I would say that the PowerPC reboots do feel different than the 450G ones, especially given that 1) they can spontaneously reboot even when watchdog is disabled, 2) they can hard-crash when watchdog is enabled, 3) CPU underclocking does not help and in fact may make things worse, and 4) all boards so far can take hours to crash and/or reboot at stock settings with a moderate CPU and network load. It feels extremely random so far.

Additionally, I will add that although I know that PowerPC RouterOS also has its own version of voltage.ko, it might not work in the same way on this system. There is no "GPIO" IRQ that shows up on 1100-series boards. I will note, though, that it looks like the same Xilinx CPLD that is on the 450G -- and which I can tell you has some involvement in the health monitoring on that board -- is also present on both the 1100 and the 1100AH "Mark I", which is interesting. Of further interest, though, is the fact that this CPLD appears to no longer be present on the 1100AH "Mark II", and if I'm not mistaken, we have gotten reports of MetaROUTER-related instability on these boards, too. (I have none to test with, sadly.)

I still want to try to remove voltage.ko from an 1100 and see what effect that has, if any. It may take me a while to do on my own. If I had MikroTik's cooperation, it could happen significantly faster, and I could proceed with further testing rather than trying to solve the problem of finding a way to implement the same hack on this board. But I'll get it done one way or the other. :)
liteforce reports that there are issues with RB1000 and RB1100 too.
The RB1000 is interesting...it has no hardware health monitoring capability at all. This may further go to prove that the cause for MetaROUTER instability on the PowerPC boards is completely different from the MIPS boards, and is entirely unrelated (unless it is demonstrated that both are due to a generic interrupt handling routine happening at a higher level, and the health monitoring was just one vector). I have wanted to conduct tests on an RB1000 too, but alas, I don't have any that aren't in production to experiment with. (Actually, I have one, but it turns out it has problems of its own and is definitely defective.)

-- Nathan
 
User avatar
janisk
MikroTik Support
MikroTik Support
Posts: 6283
Joined: Tue Feb 14, 2006 9:46 am
Location: Riga, Latvia

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Fri Jun 01, 2012 12:13 pm

how often you see crashes on PPC RouterBOARDs? I have 2 RB1000 sitting around with MetaROUTER running on each of them (just one on each)

but:
uptime: 14w2d21h20m22s
uptime: 2w3d40m59s

both rebooted due to RouterOS version change

fetching new RB1100AH model to run 8 MR there
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Fri Jun 01, 2012 12:17 pm

how often you see crashes on PPC RouterBOARDs?
As I mentioned, when I run my load-test suite on my pair of 1100s, it can happen between 2-20 hours, sometimes longer. But usually under 24 hours.

I suspect (although I have not yet proven) that the chance for a crash increases if...

1) You are running 2 or more MetaROUTERs
2) They are actively busy (whether CPU load or network activity is the trigger, I do not yet know)

janisk, if you are interested, I will publish instructions on how to reproduce my RB1100 "lab". :) It will be similar to the instructions I gave earlier when I was testing 450Gs, but updated to use 2 MetaROUTERs (1 custom OpenWRT and 1 RouterOS) as well as my more recent custom build of OpenWRT which now includes Asterisk 1.8 instead of 1.4.

-- Nathan
 
User avatar
janisk
MikroTik Support
MikroTik Support
Posts: 6283
Joined: Tue Feb 14, 2006 9:46 am
Location: Riga, Latvia

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Fri Jun 01, 2012 12:30 pm

while your tests seem to be reasonable, i need more controlled environment when i test and report problems, so i have to use RouterOS as a guest system, so there is no thoughts that maybe that other guest caused a crash. my plan is to create 8 guests and run bandwidth through them. And generate cpu load via network load.
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Fri Jun 01, 2012 12:48 pm

so i have to use RouterOS as a guest system, so there is no thoughts that maybe that other guest caused a crash
Guests should not be able to cause the host to crash. If they do, that is a RouterOS MetaROUTER bug.

If Windows crashes inside of VMware hypervisor, and VMware hypervisor reboots, is that Windows' fault? If Adobe Photoshop causes Mac OS X to kernel panic, is that Photoshop's fault?

-- Nathan
 
User avatar
janisk
MikroTik Support
MikroTik Support
Posts: 6283
Joined: Tue Feb 14, 2006 9:46 am
Location: Riga, Latvia

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Fri Jun 01, 2012 12:58 pm

if there where kernel panic, we would see it, but there isn't one. Board is killed in some other way then. If log says about power failure (cause 1) it could be due to watchdog being unhappy about something. Anyway - waiting for the router to arrive, lets see what test results will bring up.
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Fri Jun 01, 2012 1:02 pm

if there where kernel panic, we would see it, but there isn't one. Board is killed in some other way then. If log says about power failure (cause 1) it could be due to watchdog being unhappy about something. Anyway - waiting for the router to arrive, lets see what test results will bring up.
Fair enough. But if you can't make it reboot with just RouterOS guests, or if you do, and the devs fix that problem, but it still continues to reboot for me with watchdog disabled while using OpenWRT MetaROUTERs, I'm going to file another bug report, because that should not happen. ;)

Also, remember that it reboots for me with (cause 1) when watchdog is disabled.

-- Nathan
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Fri Jun 01, 2012 7:01 pm

janisk
What about packages for a RB450G with hardware monitoring(voltage.ko) disabled? I would invest some time to test those on my RB450G boards, maybe barkas would also try, AFAIK he has still a ticket open about this issue. You would then get some more feedback outside your own test setup. I agree with you however, that this might not be the root cause, although it looks so in NathanAs setup with many switch chip IRQs/second.
 
ferywu
just joined
Posts: 22
Joined: Fri Feb 17, 2012 6:24 pm

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Sun Jun 03, 2012 7:45 pm

Nathan,
did you mean we have to erase voltage.ko located at
fil nx lib/modules/2.6.35/misc/voltage.ko 1337932414
or we can modify this rc script ?
fil ex etc/rc.d/run.d/S08voltage 1337930734
rather than delete voltage.ko
i found that from dumping npk file with script from http://routing.explode.gr/node/96
if this script also can unpack and repack again, then no need to boot with openwrt ramdisk to access yaffs2 partition from nand, in order to remove voltage.ko
anyone interest to improve http://routing.explode.gr/sites/default ... cripts.zip ?

timberwolf,
i agree as the fast workaround, MT dev should provide disable option for voltage.ko from loading as kernel module, if hacking the module to run properly with metarouter take some time.
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Sun Jun 03, 2012 10:32 pm

ferywu,

I am removing the voltage.ko file, not the startup script. I tried removing the startup script that loads the module first (actually, I just took the execute permission bits off of it), but the module was still loaded by some other part of the system, so the script is apparently pointless. Versions of RouterOS prior to 5.x didn't even have that startup script, so I'm not sure what its purpose is since it appears that whatever auto-load mechanism they were using before is still present. Thus, the kernel module has to be completely removed.

I am not modifying NPK files before install; I am netbooting a kernel + ramdisk and then mounting the yaffs partition and modifying the filesystem directly.

-- Nathan
 
ferywu
just joined
Posts: 22
Joined: Fri Feb 17, 2012 6:24 pm

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Sun Jun 03, 2012 11:07 pm

for npk script , we also found that someone has added support to unpack
--- dumpnpk.py 2008-02-17 19:02:28.000000000 +0700
+++ dumpnpk2.py 2012-06-04 02:54:05.000000000 +0700
@@ -48,6 +48,9 @@

import sys
import zlib
+import os
+import os.path
+import stat

from struct import pack, unpack
from time import ctime
@@ -135,3 +138,25 @@
if type == 129:
type = "fil"
print type, perm, k["file"], tim
+ filename=k["file"]+"_test"
+#now write the dirs and files
+#sometimes the files have a / in front of them and we can't have that so lets just strip it,
+#just keep in mind that some file paths are absolute and some are not
+ filename_len=len(filename)
+ filename_len-=1
+ filename_temp=filename[ :-filename_len]
+ if filename_temp=="/":
+ filename=filename[1: ]
+#create the dirs
+ dir = os.path.dirname(filename)
+ if dir:
+ print "dir = ",dir
+ try:
+ os.stat(dir)
+ except:
+ os.mkdir(dir)
+#create the files
+ FILE = open(filename,"w")
+#FILE write data?????
+ FILE.close()
+ print "length of data = ", len(k["data"])
for the first time i can unpack any npk but system*.npk
after i change the indentation, everything ruined, any modification only able to create and unpack var/pdb folder
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Sun Jun 03, 2012 11:36 pm

ferywu
Thanks for your effort, but modifying npk files is outside the scope of this thread and not in MTs interest I guess.
We want an official supported MT solution not an unsupported work-around.
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Mon Jun 04, 2012 8:03 am

RB1100 update:

Over the weekend, I believe I finally spotted a pattern to the crashes. They still occur somewhat randomly, but I can now predict *which* of my 2 RB1100s will crash within a 24-hour period, based on what each one is doing.

Remember that with my current test setup, I have 2 RB1100s connected together, and each RB1100 is running 1 OpenWRT MetaROUTER with Asterisk, and 1 RouterOS MetaROUTER. Both OpenWRT instances are sending data to each other through the RouterOS instances. Asterisk running on top of OpenWRT is being instructed to open up ~50 simultaneous IVR calls to the other Asterisk, and they are both sending audio to each other constantly. Because they are configured identically to loop through the same set of audio files that they play back to the other side, both the total throughput as well as the PPS on send and receive are roughly symmetric during the test (about 4.5Mbit/s constantly sent and received simultaneously, at around 2500pps in any one direction). 50 simultaneous calls also easily puts the CPU at 100% the entire time, so this way I'm putting strain on both network load and CPU load.

What will typically happen during a test is that one RB1100 or the other will reboot in the middle of the test. When that happens, I typically restart the test after I notice that it has stopped because one of them has rebooted. Well, over Friday night, one of the RB1100s rebooted while I was asleep, so I didn't get around to restarting the test until the next morning, several hours later. And what I noticed when I logged in is that the RB1100 that rebooted actually ended up rebooting 4 times overnight, not just once. The other RB1100 rebooted 0 times. (I added a script under the scheduler to my test setup that generates a file every time an RB1100 boots up; it waits long enough for the NTP client to set the clock so that the timestamp of the file reflects when it booted up, and as a result I not only know how many times it rebooted by the number of files generated, but what the exact interval was between reboots.)

Another thing that would be helpful for you all to know is that when Asterisk restarts on one RB1100 because it has rebooted, in the default configuration, the other Asterisk does not realize that the first Asterisk has stopped. That's because audio for a call set up by SIP is typically RTP, which runs over UDP transport, and which in turn of course has no built-in reliability (retransmit or timeout) mechanisms. And by default, Asterisk doesn't have RTP timeout checking enabled. So the Asterisk running on the RB1100 that didn't reboot is oblivious to the fact that the other Asterisk is no longer listening, and that all 50 calls really are no longer valid. So it continues to send audio to the other IP address even though it is not getting any audio back from the other side.

This tells you a few things about the states of both RB1100s:

1. The CPU and *transmit* network load on the RB1100 that didn't reboot continue to remain the same after the other RB1100 reboots.
2. The *receive* network load on the RB1100 that didn't reboot is virtually at 0, because the other RB1100 isn't transmitting anymore.
3. The CPU and *transmit* network load on the RB1100 that DID reboot is lower: it's no longer transmitting because it rebooted and Asterisk restarted.
4. The *receive* network load on the RB1100 that DID reboot is still the same, because the other RB1100 didn't reboot and its Asterisk continues to send over the same amount of audio as before.

Also, you should know that the CPU continues to remain somewhat busy (~50% instead of 100%) on the RB1100 that did reboot, probably because the host and the RouterOS MetaROUTER are still having to process all of the packets coming from the other RB1100 that didn't reboot.

Based on this, I hypothesized that the problem cannot be due to CPU (because the one that is still at 100% CPU is not rebooting) and cannot be due to network transmits (because the one that is still sending 2500pps is not rebooting), so it must be rebooting due to *receiving* network traffic. Think about it: the one that rebooted first by chance continued to reboot multiple times even though I did not restart the test on it. It was just sitting by idly while the other end continued to pound it with 2500pps-worth of UDP traffic, but it rebooted 4 times while the other RB1100 that was sitting at 100% CPU for now 12 hours that was transmitting 2500pps never rebooted.

To test this theory, I ran 2 further tests.

The first test was that I made each RB1100 switch roles: I made Asterisk on the RB1100 that was rebooting begin to transmit again, and I stopped Asterisk on the "stable" RB1100 from transmitting any more. It took some time before the first reboot occurred (about 8 hours), but eventually the RB1100 that wasn't rebooting before started to reboot (5 times over about 20 hours), and the other RB1100 that was rebooting before stopped rebooting completely.

So far, so good. The final test was to make sure that this didn't have anything to do with the OpenWRT MetaROUTER. So I shut down the OpenWRT MetaROUTER and created a second RouterOS MetaROUTER. I forced the second RouterOS MR to send all traffic through the first RouterOS MR, the same way that OpenWRT was previously configured. Then I ran a one-way TCP bandwidth test to the second RouterOS MR, through the first RouterOS MR, from the other RB1100 (on the host, not one of its MetaROUTERs). Sure enough, the RB1100 that was transmitting to the other never rebooted, but the RB1100 that was receiving the traffic rebooted again. And again. And again. (Unfortunately, RouterOS bandwidth test, unlike my Asterisk test, will quit when one side reboots, so I had to keep manually restarting the bandwidth test every time the receiving RB1100 would reboot itself. These reboots were still spaced 2-3 hours apart.)

I would say that all of this data seems to corroborate my theory: if a MetaROUTER on an RB1100 is doing nothing but transmitting, it will not reboot. But if a MetaROUTER on an RB1100 is *receiving* data over the network, it will reboot. It may take a while and the timing still seems random (it can still easily last for a few hours), but there is a definite pattern there that is not related to CPU load or transmitted packets...just received packets.

Also, after I came to this conclusion, I realized that this theory also fits very nicely with liteforce's experience that his RB1100s are likely to reboot if he executes a large, looping download test with wget. Running wget means you are doing a download (receive), not an upload (transmit), which fits my experiences exactly. Of course, when you run either a RouterOS TCP bandwidth test or an HTTP download with wget, you are using TCP which will cause the transmitting side to also receive ACKs from the opposite side. Because the RB1100 that was doing the transmit on the TCP bandwidth test never rebooted, it would also seem that you increase your chances of a reboot as you pull more data (ACKs are pretty minimal, traffic-wise).

Because of this, I very much doubt removing voltage.ko from an RB1100 would make any difference. It would still be interesting to try, but at this point it really does seem as though the RB1100 problem is related to the network layer.

-- Nathan
 
ferywu
just joined
Posts: 22
Joined: Fri Feb 17, 2012 6:24 pm

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Mon Jun 04, 2012 8:57 am

ferywu
Thanks for your effort, but modifying npk files is outside the scope of this thread and not in MTs interest I guess.
We want an official supported MT solution not an unsupported work-around.
nevermind, i just thought this temporarily workaround as nathan say that he should physically facing the router to boot openwrt ramdisk, mounting nand then delete voltage.ko,
i hope with that script upgrade next MT version via remote is okay too, off course with modified npk. if MT devs still took long time to fix this metarouter issue.

still, i also wait for official fix from MT devs.
 
ferywu
just joined
Posts: 22
Joined: Fri Feb 17, 2012 6:24 pm

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Mon Jun 04, 2012 9:19 am

Nathan,

From your experiment, i read that if RB1100 receive packet, do reboot.
Maybe it's related to virtual interface module or port flapping?

normis said in the other thread, dated may 25, 2012, http://forum.mikrotik.com/viewtopic.php ... 86#p318887
stated metarouter issue on PPC fix with latest bios version,
5.17 maybe ? or 5.18rc1 ?
can you check this out ?

thank you.
 
User avatar
normis
MikroTik Support
MikroTik Support
Posts: 24392
Joined: Fri May 28, 2004 11:04 am
Location: Riga, Latvia

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Mon Jun 04, 2012 9:29 am

Nathan,

From your experiment, i read that if RB1100 receive packet, do reboot.
Maybe it's related to virtual interface module or port flapping?

normis said in the other thread, dated may 25, 2012, http://forum.mikrotik.com/viewtopic.php ... 86#p318887
stated metarouter issue on PPC fix with latest bios version,
5.17 maybe ? or 5.18rc1 ?
can you check this out ?

thank you.
I actually was responding to something else in that thread, to this:
nathana: doesn't have to do with the NAND size. Apparently the problem is that in certain situations RouterOS on the 1100AH is accidentally configured to load the multi-CPU kernel meant for the 1100AHx2
No answer to your question? How to write posts
 
User avatar
janisk
MikroTik Support
MikroTik Support
Posts: 6283
Joined: Tue Feb 14, 2006 9:46 am
Location: Riga, Latvia

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Mon Jun 04, 2012 9:30 am

btw you can look at
 /tool traffic-generator
http://wiki.mikrotik.com/wiki/Manual:To ... _Generator

you do not have to make loop if you do not want to see the return traffic, just blast other end with packets that you can tailor to what size/rate you want.

so you do not have to restart /tool bandwidth-test again if other end crashes. Also, on what ports you are doing that on RB1100AH?
 
ferywu
just joined
Posts: 22
Joined: Fri Feb 17, 2012 6:24 pm

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Mon Jun 04, 2012 9:47 am

I actually was responding to something else in that thread, to this:
nathana: doesn't have to do with the NAND size. Apparently the problem is that in certain situations RouterOS on the 1100AH is accidentally configured to load the multi-CPU kernel meant for the 1100AHx2
ok, sorry me.

let's wait again.
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Mon Jun 04, 2012 9:57 am

btw you can look at
 /tool traffic-generator
Neat! Thanks for telling me about this! I did not know it existed. I will play with it.
Also, on what ports you are doing that on RB1100AH?
Oh, sorry, I forgot to mention that. I am only using ether1 on both units. So, yes, it is going through the first switch chip. What ports are connected directly to the SoC? ether13? Perhaps I should try running the same test again, but using ether13.

Also, janisk, I am curious if you have had time to test yourself. You said you were getting some devices and were preparing to test them? Have you seen any reboots yet on your end?

-- Nathan

EDIT: Sorry, I also should mention for the sake of clarification that for my tests, I am using RB1100, not AH. I do have a single 1100AH "Mark I" but I wanted to try to get consistent results with the other boards first, rather than "mix and match". I will test with the AH later as well.
Last edited by NathanA on Mon Jun 04, 2012 10:31 am, edited 1 time in total.
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Mon Jun 04, 2012 10:20 am

On PPC it somehow sounds like some receive DMA function goes crazy, maybe some setup error or memory overflow/allocation error?

janisk
Still no updates for MIPSBE/RB450G? How hard can it be to at least build a testing release for us to try and provide feedback to you?
By now already two forum members managed to do this, can't be that hard for your devs themself.
 
peson
Trainer
Trainer
Posts: 183
Joined: Tue Jul 20, 2004 10:33 am
Location: Sweden

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Mon Jun 04, 2012 2:36 pm

Oh, sorry, I forgot to mention that. I am only using ether1 on both units. So, yes, it is going through the first switch chip. What ports are connected directly to the SoC? ether13? Perhaps I should try running the same test again, but using ether13.
Nathan:
Look in "/int eth po print" and you will see which switch the ports are connected to.
In my 1100AH rev 1 I have the following config:

> int eth swi pr
Flags: I - invalid
# NAME TYPE MIRROR-SOURCE MIRROR-TARGET SWITCH-ALL-PORTS
0 switch2 Atheros-8316 none none
1 switch1 Atheros-8316 none none
int ethernet swi po pr
Flags: I - invalid
# NAME SWITCH VLAN-MODE VLAN-HEADER
0 ether6 switch2 fallback leave-as-is
1 ether7 switch2 fallback leave-as-is
2 ether8 switch2 fallback leave-as-is
3 ether9 switch2 fallback leave-as-is
4 ether10 switch2 fallback leave-as-is
5 ether1 switch1 fallback leave-as-is
6 ether2 switch1 fallback leave-as-is
7 ether3 switch1 fallback leave-as-is
8 ether4 switch1 fallback leave-as-is
9 ether5 switch1 fallback leave-as-is
10 switch1_cpu switch1 fallback leave-as-is
11 switch2_cpu switch2 fallback leave-as-is

So Ether11-13 are not in any switch configuration.
It would be interesting to see what happens when the switch group is really configured as a switch with masterport.
I'm sorry that I cannot help in testing right now, my time is limited :-(. I still think there is drivers or hardware problem with the Atheros switch chip. :-)

Janis:
If I look in the ethernet interface table, it says that the ether2 and ether3 is in slave mode.
/int ether print
 9  S ether2     1500 00:0C:42:99:17:7C enabled    none    switch1    
10  S ether3     1500 00:0C:42:99:17:7D enabled    none    switch1    
But the export says no master port:
/int ether export
set 9 arp=enabled auto-negotiation=yes bandwidth=unlimited/unlimited \
    disabled=no full-duplex=yes l2mtu=1598 mac-address=00:0C:42:99:17:7C \
    master-port=none mtu=1500 name=ether2 speed=100Mbps
set 10 arp=enabled auto-negotiation=yes bandwidth=unlimited/unlimited \
    disabled=no full-duplex=yes l2mtu=1598 mac-address=00:0C:42:99:17:7D \
    master-port=none mtu=1500 name=ether3 speed=100Mbps
If I try to change it with:
/int ether set 9 master-port=ether1
It says:
already enslaved
No, I haven't reset the configuration, because it is in a production env.
It runs two MR RouterOS, it's the one that I already have a ticket about.
/Paul
Reboot is the last resort, try to find out what's wrong instead.
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Mon Jun 04, 2012 2:42 pm

I still think there is drivers or hardware problem with the Atheros switch chip. :-)
Not sure what you're basing that on. The 8316 that is on the 1100 and 1100AH "Mark I" is the same switch chip that is on the 450G, and my 450Gs are still not rebooting after my "fix". Also, the switch chip on the 1100AH "Mark II" was changed to the 8327. The MIPS problem and the PPC problem are quite clearly different at this point. (Though maybe there is a problem just with Atheros switch support on PPC? I suppose it is possible.)

I will continue to run tests, including using a port that is not attached to a switch chip.

-- Nathan
 
User avatar
janisk
MikroTik Support
MikroTik Support
Posts: 6283
Joined: Tue Feb 14, 2006 9:46 am
Location: Riga, Latvia

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Mon Jun 04, 2012 2:54 pm

it is bit offtopic, but ports could be parts of bridge interface

And NathanA, i got the router this morning and currently testing with 10K pps of 128byte size. tomorrow i plan to add extra guests, currently running only 2, one is receiving the stream and the other is idling
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Fri Jun 08, 2012 1:50 pm

I haven't had much time to play with RB1100s this week, but I did make some changes, and have been getting some confusing results. I got rid of my OpenWRT guests and decided to try with just RouterOS guests on both in order to try to come up with a scenario that predictably crashes in a way that janisk might be able to reproduce. I also switched over to using '/tool traffic-generator'.

Unfortunately, unlike when I'm using a mix of OpenWRT and RouterOS, when I have 2 RouterOS guests running on both RB1100s, I'm getting reboots on both RB1100s, not just the one transmitting.

In addition, I have also seen...

- RouterOS MetaROUTERs reboot with "(cause 1)"! (In other words, the RB1100 didn't reboot...only the MetaROUTER itself did!)
- Traffic-generator traffic from a MetaROUTER suddenly stopping, and then re-starting.
- RB1100s continuing to reboot with "(cause 1)" even though the watchdog is disabled

I'll continue to run various tests to see if I can come to a conclusion on a pattern; I still think there is something to my network transmission observation from before since the results of my OpenWRT + RouterOS tests were too predictable to be coincidental.

janisk, how have your tests been going? Have you had any "(cause 1)" reboots occur on your boards yet?

-- Nathan
 
User avatar
janisk
MikroTik Support
MikroTik Support
Posts: 6283
Joined: Tue Feb 14, 2006 9:46 am
Location: Riga, Latvia

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Fri Jun 08, 2012 2:28 pm

tests are running fine and i have similar results to what you are having.

I used traffic-generator to send 10k pps to the guest. It did not matter how many guests i had but with this config
/metarotuer add memory-size=32 name=tst-guest
/interface bridge add
/interface bridge port add bridge=bridge1 interface=ether11
/metarouter interface add type=dynamic dynamic-bridge=bridge1 virtual-machine=tst-guest
configuration on the guest - ip address and default route

sometimes guest crashes, sometimes whole host.

results are already delivered to devs.
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Fri Jun 08, 2012 2:35 pm

tests are running fine and i have similar results to what you are having [...] results are already delivered to devs.
Alright, good to know I and my other friends here are not crazy and that it isn't just our boards doing this. :) Thanks.

-- Nathan
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Fri Jun 08, 2012 7:03 pm

janisk
Two questions, should we include RB2011 in this thread and whats the status for RB450G/MIPSBE?
 
User avatar
janisk
MikroTik Support
MikroTik Support
Posts: 6283
Joined: Tue Feb 14, 2006 9:46 am
Location: Riga, Latvia

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Mon Jun 11, 2012 12:05 pm

currently rRB540G status is unchanged - there is unresolved problem.

RB2011 can be added here if you wish so. I have added similar setup as on RB1100AH for RB2011 without any packet load it did not reboot/crash as on RB450G even without interfaces added you could see manifestations of the problem. (Uptime of rb2011 and running guest was over 2 weeks.
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Mon Jun 11, 2012 7:52 pm

currently rRB540G status is unchanged - there is unresolved problem.
Ok, so you don't think it would be worth gathering some results from the field, with disabled hardware monitoring?
RB2011 can be added here if you wish so. I have added similar setup as on RB1100AH for RB2011 without any packet load it did not reboot/crash as on RB450G even without interfaces added you could see manifestations of the problem. (Uptime of rb2011 and running guest was over 2 weeks.
I will try to keep an eye on the RB2011 thread, and include the information here should there be more reports.
 
User avatar
janisk
MikroTik Support
MikroTik Support
Posts: 6283
Joined: Tue Feb 14, 2006 9:46 am
Location: Riga, Latvia

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Tue Jun 12, 2012 8:46 am

Ok, so you don't think it would be worth gathering some results from the field, with disabled hardware monitoring?
i have router that started to crash just before NathanA discoveries, that is running Metarouter guest (maybe even some hours longer than NathanA's router) and developers have access to it. Router have not crashed since changes where done.
I will try to keep an eye on the RB2011 thread, and include the information here should there be more reports.
sounds good.
 
guy1with1mr1problems
newbie
Posts: 43
Joined: Wed Jun 13, 2012 12:34 pm

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Wed Jun 13, 2012 1:12 pm

We are also having metarouter problems. On RB800 (PPC), it reliably crashes when transmitting too much text (like dmesg or opkg list) over the metarouter virtual console (either winbox or over ssh in mikrotik). Otherwise, it was stable. On RB411AH (MIPS) though, it crashes about 20 seconds after connecting to it and doing anything. Also, the virtual network interfaces never come up automatically. Most of the time, the crashes are hardware (the router crashes along with the vm and reboots). But sometimes just the VM becomes unrechable over both ssh and virtual console and just doesn't respond until disabled and enabled.

We used your default OpenWRT images - by the way the package links are wrong in both - opkg is unsuable. So now we are using images from http://openwrt.wk.cz . In both default and the second ones, there are exactly the same problems.

Both boards were using decent power supplies over ethernet.
 
User avatar
janisk
MikroTik Support
MikroTik Support
Posts: 6283
Joined: Tue Feb 14, 2006 9:46 am
Location: Riga, Latvia

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Wed Jun 13, 2012 1:33 pm

on mips board you can try the NathanA trick, problem on ppc boards is under ivestigation
 
guy1with1mr1problems
newbie
Posts: 43
Joined: Wed Jun 13, 2012 12:34 pm

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Thu Jun 14, 2012 11:28 am

on mips board you can try the NathanA trick, problem on ppc boards is under ivestigation
Thank you !

For future readers: that means mounting the yaffs2 filesystem of router os (I used openwrt running from RAM booted over network.) and renaming the 'voltage.ko' module.

This worked like a charm for some reason ! It is now running for about a day and requesting phpinfo() from an embedded apache + php(cgi) server didn't drop a single request.

By the way we had another observation before I tried the fix: underclocking the RB411AH CPU to 300MHz helped a lot - from about 2 minutes uptime it got to about 30.

So the only unresolved issue that remains (that we don't mind that much now that we know) is that on the PPC board, metarouter crashes every time too much text is transmitted over metarouter console (we just use ssh all the time now anyway - we use the metarouter console only to setup initial networking now.)
 
User avatar
janisk
MikroTik Support
MikroTik Support
Posts: 6283
Joined: Tue Feb 14, 2006 9:46 am
Location: Riga, Latvia

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Thu Jun 14, 2012 2:42 pm

about PPC - using ssh it does not reboot? only when console output? If you are using ssh how long do your guest OSes stay up?
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Thu Jun 14, 2012 3:34 pm

For future readers: that means mounting the yaffs2 filesystem of router os (I used openwrt running from RAM booted over network.) and renaming the 'voltage.ko' module.

This worked like a charm for some reason ! It is now running for about a day and requesting phpinfo() from an embedded apache + php(cgi) server didn't drop a single request.
I am still hoping for a official MT version. :-(
 
guy1with1mr1problems
newbie
Posts: 43
Joined: Wed Jun 13, 2012 12:34 pm

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Fri Jun 15, 2012 1:03 pm

about PPC - using ssh it does not reboot? only when console output? If you are using ssh how long do your guest OSes stay up?
Yes exactly. I think our record was about two days? But we didn't try any longer, it didn't crash.
On the other hand - EVERY time we did dmesg or opkg list (after fetching package list with opkg update) on the metarouter console, it crashed immediately. The same commands ran just fine via ssh, even connecting to mt first and then using /tool ssh to ssh into the openwrt vm.
Last edited by guy1with1mr1problems on Fri Jun 15, 2012 1:07 pm, edited 1 time in total.
 
guy1with1mr1problems
newbie
Posts: 43
Joined: Wed Jun 13, 2012 12:34 pm

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Fri Jun 15, 2012 1:05 pm

about PPC - using ssh it does not reboot? only when console output? If you are using ssh how long do your guest OSes stay up?
Yes exactly. I think our record was about two days? But we didn't try any longer, it didn't crash.
On the other hand - EVERY time we did dmesg or opkg list (after fetching package list with opkg update) on the metarouter console, it crashed. The same commands ran just fine via ssh, even connecting to mt first and then using /tool ssh to ssh into the openwrt vm.
Yeah also I forgot to mention we tried on two of those RB800 boards - one very old hw rev with some power stability issues and one brand new. On both it behaved exactly the same.
 
spite
just joined
Posts: 7
Joined: Sat Jun 16, 2012 5:23 am

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Sat Jun 16, 2012 5:35 am

I have been lurking in these kinds of threads for some kind after initially purchasing RB1100AH's to complete firewall virtualisation for our customers. Sadly, we've had no end of 'reboots' due to MetaRouter, all at random intervals etc. Given the work Nathan has described I thought I would shed some light on what I have experienced.

Before I installed MetaRouters on this Rb1100 it ran for 120 days without a crash, was forwarding between 10 and 50mbit/s all day every day.

MR Environment Details:

RB1100AH, using standard untagged interface (ether13) for upstream connection and then local bridging from the WAN ports on the MetaRouters to the host.
The LAN ports for the MetaRouters are typically a VLAN Interface on the host RB1100 on Port 12 facing our switching network.

Guests
- some are standard Firewalls with just MASQ and a few inbound NAT.
- some have been more advanced for testing with BGP etc
- some have had VPN servers on them for terminating other L2TP tunnels from sites

No matter which guests I was running, there were random reboots. Sometimes it was 3 days sometimes it was 2 weeks.

I have tried every RouterOS from 5.4 to Current.

I have logged numerous Support faults which all ask me to basically reconfigure the machine from scratch. If I don't reconfigure and just disable all the MetaRouters it stops crashing.


I had a customer ask me to deploy a temporary firewall for them while they replaced their original over a few weeks. The customer's setup was a standard firewall with 2 inbound NAT. They had a constant BitTorrent flow of about 4mbit/s INBOUND to the MetaRouter from the internet. This caused the host RB1100 to reboot almost every 15 - 60 minutes. I had to redeploy them on a x86 Virtualised host to get around the immediate reboot issue.

Given the instability the box has 2 MetaRouters left on it that I intend to migrate to x86 because it still reboots randomly between 1 and 14 days.

If this problem could be solved I would be really happy and hopefully this post can at least shed some weight onto the argument of higher traffic loads increase the frequency of reboot.
 
reverged
Member Candidate
Member Candidate
Posts: 270
Joined: Thu Nov 12, 2009 8:30 am

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Sun Jun 17, 2012 3:12 am

For future readers: that means mounting the yaffs2 filesystem of router os (I used openwrt running from RAM booted over network.) and renaming the 'voltage.ko' module.

This worked like a charm for some reason ! It is now running for about a day and requesting phpinfo() from an embedded apache + php(cgi) server didn't drop a single request.
I am still hoping for a official MT version. :-(
I tried the method NathanA used and I can only see /lost+found when I mount rootfs. No other files.....?

Am I missing something? I tried several different elf files and they all do the same thing.
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Tue Jun 19, 2012 5:40 am

I tried the method NathanA used and I can only see /lost+found when I mount rootfs. No other files.....?
There is a trick to this. The NAND "partition table" doesn't exist on the NAND itself, as it would on a traditional hard drive. Instead, it is hard-coded into the kernel. I thought this was strange when I first discovered this, but okay... In any case, some versions of OpenWRT use a different version of the partition table layout in their kernel than what MikroTik uses in RouterOS...they increased the size of the boot partition so that it can hold a larger kernel image, which means that the offset of the data partition is off compared to MikroTik's. So you need to make sure that you netboot a kernel that has a matching NAND partition table to what MikroTik uses.

Another thing I discovered is that if certain versions of YAFFS (maybe all versions?) don't recognize the filesystem structure in the block device you try to mount with it, it will automatically initialize the partition at mount time. What this means is that if you use a kernel with the wrong NAND partition layout, attempting to mount a partition will in essence cause it to be re-formatted (without prompting you...basically an involuntary mkfs), and you will lose everything on it. I did this the first time, after which I also only saw 'lost+found'. At first, I figured what you did...that it was just a problem with the image I was using; I didn't think it had actually made any changes. But when I tried to boot RouterOS again, it was no longer there, and I had to reinstall with Netinstall. Fortunately, the license key was not lost...perhaps it is stored in the boot partition somewhere.

So if you only saw 'lost+found' and then tried to netboot a few other images without trying to boot RouterOS again, it may be that some of the images you tried to boot with later did in fact have a "correct" partition table in their kernel, but the first image you used had already formatted the RouterOS partition on the NAND for you, and you just don't know it yet because you haven't tried booting RouterOS since then.

re-Netinstall RouterOS on your device if necessary, and then try booting with this image:

http://www.nconx.com/~nathan/openwrt-ar ... tramfs.elf

It's pretty much an untouched copy of OpenWRT "Backfire" 10.03.1 built for MIPS, and it seems to work on RB450G and uses the original MikroTik partitioning layout "out-of-the-box". I think it was after Backfire's release that OpenWRT increased the size of the boot partition from 4MB to 6MB.

-- Nathan
 
reverged
Member Candidate
Member Candidate
Posts: 270
Joined: Thu Nov 12, 2009 8:30 am

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Tue Jun 19, 2012 7:39 am

AHH! That makes sense.

I did have to netinstall before I shipped it out. Now I have to find another 450G.....

Thanks.
 
User avatar
janisk
MikroTik Support
MikroTik Support
Posts: 6283
Joined: Tue Feb 14, 2006 9:46 am
Location: Riga, Latvia

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Tue Jun 19, 2012 9:12 am

hopefully soon enough you (obviously not days) will not need to do that any more to run MetaROUTER on RB450G
 
guy1with1mr1problems
newbie
Posts: 43
Joined: Wed Jun 13, 2012 12:34 pm

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Tue Jun 19, 2012 1:52 pm

Well after testing a PHP app we developed for apache running on Metarouter (rb411ah), my experience is that it is extremely unstable. Sometimes for no apparent reason php crashes mid-script and EVERY program executed on the vm from that point on segfaults until vm reboot. Can't post dmesg because dmesg itself segfaults, lol. Tried running from flash and also from ram - same behavior. Pretty sad. Unfortunately this is a grey area and it is uncertain if this is metarouter's fault or openwrt. If the vm kernel didn't need any patches, I'd try a normal distribution and report back.

On PPC - we didn't have this problem, strangely enough. Everything we threw at it just worked. Only difference is there was lighttpd with fastcgi available for ppc so we used that.

Could I humbly request that you add support for some well-known and supported MIPS/PPC serial port controller? (like 8250 would be on x86 for example). You could maybe use some code qemu already has - they emulate several devices.

Or, at the very least, could you please make sure your default OpenWRT images that you recommend in your wiki are actually working? As in: can install packages and it doesn't segfault-working.
 
User avatar
janisk
MikroTik Support
MikroTik Support
Posts: 6283
Joined: Tue Feb 14, 2006 9:46 am
Location: Riga, Latvia

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Tue Jun 19, 2012 2:38 pm

if possible, check memory usage of the guest. Try to increase the RAM available for the guest as i know that apache uses a lot of ram if compared to lighttpd
 
guy1with1mr1problems
newbie
Posts: 43
Joined: Wed Jun 13, 2012 12:34 pm

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Tue Jun 19, 2012 2:44 pm

if possible, check memory usage of the guest. Try to increase the RAM available for the guest as i know that apache uses a lot of ram if compared to lighttpd
1) there are lots of free RAM there
2) OOM killer should kill the offending process and NOT cause the entire system to start segfaulting like crazy.
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Tue Jun 19, 2012 2:52 pm

guy1with1mr1problems,

I believe the OpenWRT images that MikroTik built were intended for demonstration purposes only, and not for actual production use. They certainly did not build all of the packages and create a package repository, and at this point their images are quite out-of-date. They distributed their patches in hopes that their customers would build their own images for themselves or to share with others. I would recommend you use more recent images prepared by fellow forum members such as liquidcz...details can be found in the "Metarouter images" sticky thread at the top of this forum. He has a package repository and the whole works.

janisk's suggestion to increase available memory to the MR makes sense: I have experienced similar weird crashes after what seemed like an OOM (out-of-memory) condition within the MetaROUTER instance itself, and I don't believe it's MetaROUTER's fault. One thing I discovered, for example, is that when copying large files to/from the MR over the network via something like scp, the kernel VM cache would grow to an obscene size and sometimes fill up the available free memory completely. The cache is supposed to "make way" for other applications that need RAM if memory is full, but I have seen it fail to do so. Perhaps it is a kernel bug; after all, 2.6.31 is a bit long-in-the-tooth these days. I would try an "echo 3 > /proc/sys/vm/drop_caches" the next time you see this happen within your OpenWRT MR. "echo" should be built into your shell, which is already running, so I doubt you'll see a segfault when you attempt to do that.

-- Nathan

EDIT: You posted your response just as I was about to post mine. You say there is lots of free RAM available. You mean that you can verify that there is lots of free RAM at the time you see the segfaults occurring? How can you know that for certain when any attempt to check it (e.g., "top") would cause a segfault according to you?
 
guy1with1mr1problems
newbie
Posts: 43
Joined: Wed Jun 13, 2012 12:34 pm

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Tue Jun 19, 2012 2:57 pm

guy1with1mr1problems,
...
I have experienced similar weird crashes after what seemed like an OOM (out-of-memory) condition within the MetaROUTER instance itself
...
Actually that makes perfect sense ! i gave the VM so much memory that there wasn't too much left over for Mikrotik. I'll try reducing VM memory and report back. Thanks !
 
guy1with1mr1problems
newbie
Posts: 43
Joined: Wed Jun 13, 2012 12:34 pm

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Tue Jun 19, 2012 3:10 pm

I can see right away that indeed it helped. If Mikrotik doesn't have some free ram in reserve for itself, it causes very strange behavior in metarouter VMs - disappearing files, segfaulting programs, etc. I now stressed out the VM to severe OOM conditions (many php processes running silex and using lots of ram) and it behaved like a normal linux system would. Perfect !
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Fri Jun 22, 2012 8:27 am

hopefully soon enough you (obviously not days) will not need to do that any more to run MetaROUTER on RB450G
Any update or rough estimate you could give us?
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Tue Jun 26, 2012 8:38 am

janisk
Any news? Are any fixes/changes implemented in ROS 5.18? The changelog doesn't mention MR related fixes, so I didn't try on my own.
 
User avatar
janisk
MikroTik Support
MikroTik Support
Posts: 6283
Joined: Tue Feb 14, 2006 9:46 am
Location: Riga, Latvia

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Wed Jun 27, 2012 9:14 am

no changes have been made public yet.
 
guy1with1mr1problems
newbie
Posts: 43
Joined: Wed Jun 13, 2012 12:34 pm

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Fri Jun 29, 2012 2:26 pm

Metafs autodetection for mount is broken. When I try to mount a loop file, kernel detects the filesystem incorrectly as metafs (actually is squashfs) and mounts the metafs that is already mounted as root read/write without slightest hesitation.

I think metafs should only be considered to be autodetected if mounted from 'none' and it shouldn't allow itself to be mounted rw more than once.
I think a good source for inspiration for the autodetection code would be probably tmpfs - should be similar in principle.

Also: when umounting such second mounted metafs, the kernel reliably segfaults every time.
When I bind the loop file via losetup and them mount -t squashfs, it works perfectly.
 
User avatar
janisk
MikroTik Support
MikroTik Support
Posts: 6283
Joined: Tue Feb 14, 2006 9:46 am
Location: Riga, Latvia

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Fri Jun 29, 2012 3:04 pm

what and where exactly you are mounting that causes you these issues on MetaROUTER?
 
guy1with1mr1problems
newbie
Posts: 43
Joined: Wed Jun 13, 2012 12:34 pm

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Fri Jun 29, 2012 3:49 pm

what and where exactly you are mounting that causes you these issues on MetaROUTER?
The umount problem is easy to replicate, just do in openwrt:
# mkdir /tmp/test
# mount -t metafs none /tmp/test

-- this mounted the metafs that is already mounted rw, that shouldn't be allowed to happen.

And then:
# umount /tmp/test
segfault

The squashfs is not so easy, you have to compile your own kernel with squashfs support for the guest vm.

Then:
mount -o loop /some/squashfs.loopfile /tmp/somewhere

this mounts metafs instead of squashfs from /dev/loop0 - causing the root metafs filesystem to be mounted again, rw on /tmp/somewhere.

The problem is only with the autodetection, because manually doing losetup and then:
mount -t squashfs /dev/loop0 /tmp/somewhere
mounts the squashfs correctly.

Summary:
- Metafs shouldn't be mountable readwrite more than once
- Kernel filesystem autodetection should never return metafs as detected filesystem
 
User avatar
janisk
MikroTik Support
MikroTik Support
Posts: 6283
Joined: Tue Feb 14, 2006 9:46 am
Location: Riga, Latvia

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Fri Jun 29, 2012 4:32 pm

is this into OpenWRT guest, if it is not, i cannot comment on internals of the RouterOS.
 
guy1with1mr1problems
newbie
Posts: 43
Joined: Wed Jun 13, 2012 12:34 pm

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Fri Jun 29, 2012 5:18 pm

is this into OpenWRT guest, if it is not, i cannot comment on internals of the RouterOS.
sounds like I'll have to post the patch here myself, once I have a little time.
 
guy1with1mr1problems
newbie
Posts: 43
Joined: Wed Jun 13, 2012 12:34 pm

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Fri Jun 29, 2012 5:25 pm

Found the root of the metafs autodetection problem:

/ # cat /proc/filesystems
nodev sysfs
nodev rootfs
nodev bdev
nodev proc
nodev sockfs
nodev pipefs
nodev anon_inodefs
nodev tmpfs
nodev inotifyfs
nodev devpts
metafs
squashfs
nodev ramfs
nodev autofs


metafs should have nodev too.
 
guy1with1mr1problems
newbie
Posts: 43
Joined: Wed Jun 13, 2012 12:34 pm

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Fri Jun 29, 2012 5:32 pm

This should do the trick, I'll recompile and confirm:
--- old/fs/metafs/inode.c	2012-06-29 16:31:48.331049440 +0200
+++ new/fs/metafs/inode.c	2012-06-29 16:32:31.652047941 +0200
@@ -841,7 +841,6 @@
 	.name		= "metafs",
 	.get_sb		= mfs_get_sb,
 	.kill_sb	= kill_block_super,
-	.fs_flags	= FS_REQUIRES_DEV,
 };
 
 static void init_once(void *foo)
edit: improved

At the very least, I think you'll then need to modify kernel config then:

CONFIG_CMDLINE="init=/etc/preinit rootfstype=metafs root=none"

(didn't need the root specification before because it for some reason always found metafs)



[/code]
 
guy1with1mr1problems
newbie
Posts: 43
Joined: Wed Jun 13, 2012 12:34 pm

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Fri Jun 29, 2012 6:00 pm

This should do the trick, I'll recompile and confirm:
...
edit: improved

At the very least, I think you'll then need to modify kernel config then:

CONFIG_CMDLINE="init=/etc/preinit rootfstype=metafs root=none"

(didn't need the root specification before because it for some reason always found metafs)



[/code]
Confirmed working - now shows metafs as 'nodev' and autodetection is not broken anymore. And exactly as I said - if you forget: rootfstype=metafs root=none in kernel parameters, it will refuse to boot with 'unable to mount root'.

So the only issue that remains now is being able to mount metafs more than once and the segfault when umounting it.
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Sun Jul 08, 2012 11:20 am

Lets not get to OT in this thread, which is about stabillity issues with the MetaROUTER feature itself.
Which by now still aren't fixed even on some PPC boards.

So MT staff, what about news? You claimed to be able to provide a fix "soon" on june 19.
 
guy1with1mr1problems
newbie
Posts: 43
Joined: Wed Jun 13, 2012 12:34 pm

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Wed Jul 11, 2012 1:07 pm

Lets not get to OT in this thread, which is about stabillity issues with the MetaROUTER feature itself.
Which by now still aren't fixed even on some PPC boards.

So MT staff, what about news? You claimed to be able to provide a fix "soon" on june 19.
Request for comment seconded.
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Mon Jul 16, 2012 4:23 pm

In 3 days exactly one month passed without any news on progress. So its relatively safe to assume that MT is back to the old habit of ignoring this problem altogether for maybe another year or so.
What a shame.
 
User avatar
nz_monkey
Forum Guru
Forum Guru
Posts: 1825
Joined: Mon Jan 14, 2008 1:53 pm
Location: Straya
Contact:

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Mon Jul 16, 2012 11:17 pm

I doubt they are ignoring it. Possibly they are working on a fix for 6.x release
http://thebrotherswisp.com/ | Mikrotik MTCNA, MTCRE, MTCINE | Fortinet FTCNA, FCNSP, FCT | Extreme Networks ENA
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Tue Jul 17, 2012 9:20 am

Thats why they talked about "soon" and are ignoring any request for comment in this thread.
Sure...
 
guy1with1mr1problems
newbie
Posts: 43
Joined: Wed Jun 13, 2012 12:34 pm

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Wed Jul 18, 2012 1:22 pm

Thats why they talked about "soon" and are ignoring any request for comment in this thread.
Sure...
Yeah, it frustrates us too because we have to keep renaming voltage.ko modules inside mips routerboards. By the way not easy on boards without official serial port. We have a 3.3V to RS232 level converter on the way, then I'll post a way to access the bootloader in routerboards without RS232 so that others can make the voltage.ko fix too.
 
peson
Trainer
Trainer
Posts: 183
Joined: Tue Jul 20, 2004 10:33 am
Location: Sweden

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Wed Jul 18, 2012 2:41 pm

Thats why they talked about "soon" and are ignoring any request for comment in this thread.
Sure...
Yeah, it frustrates us too because we have to keep renaming voltage.ko modules inside mips routerboards. By the way not easy on boards without official serial port. We have a 3.3V to RS232 level converter on the way, then I'll post a way to access the bootloader in routerboards without RS232 so that others can make the voltage.ko fix too.
Instead of RS232, have you looked into?
/sys routerb settings set boot-device=try-ethernet-once-then-nand
Reboot is the last resort, try to find out what's wrong instead.
 
guy1with1mr1problems
newbie
Posts: 43
Joined: Wed Jun 13, 2012 12:34 pm

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Wed Jul 18, 2012 4:40 pm

Thats why they talked about "soon" and are ignoring any request for comment in this thread.
Sure...
Yeah, it frustrates us too because we have to keep renaming voltage.ko modules inside mips routerboards. By the way not easy on boards without official serial port. We have a 3.3V to RS232 level converter on the way, then I'll post a way to access the bootloader in routerboards without RS232 so that others can make the voltage.ko fix too.
Instead of RS232, have you looked into?
/sys routerb settings set boot-device=try-ethernet-once-then-nand
No, we didn't realize that, thanks a lot ! Still a method to save those routerboards in case of firmware failure and ability to debug a native openwrt on them will be nice.
 
barkas
Member Candidate
Member Candidate
Posts: 260
Joined: Sun Sep 25, 2011 10:51 pm

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Sun Jul 22, 2012 11:55 am

hotfix update please
 
guy1with1mr1problems
newbie
Posts: 43
Joined: Wed Jun 13, 2012 12:34 pm

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Mon Jul 30, 2012 5:52 pm

because of constant problems with Metarouter, we're trying to run custom programs alongside router os. Whenever we put a native mips binary in there and modify an init script to run it, something removes it before it runs. Is this some intentional protection? What is Mikrotik official standpoint on running custom programs inside router os? Thanks for any information.
 
User avatar
janisk
MikroTik Support
MikroTik Support
Posts: 6283
Joined: Tue Feb 14, 2006 9:46 am
Location: Riga, Latvia

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Wed Aug 01, 2012 9:46 am

it is not possible to run 3rd party code dirrectly as a part of RouterOS. You have to have MetaROUTER guest or have to use other OS on the hardware directly to run them.
 
guy1with1mr1problems
newbie
Posts: 43
Joined: Wed Jun 13, 2012 12:34 pm

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Wed Aug 01, 2012 9:58 am

it is not possible to run 3rd party code dirrectly as a part of RouterOS. You have to have MetaROUTER guest or have to use other OS on the hardware directly to run them.
Allright, so then I guess you're also willing to release all the modifications to wireless drivers (including for example nstreme) as open source to comply with the GPL license linux is under? We have to use them, routeros performs much better than regular embedded linux on all cards we tested. But Metarouter would be acceptable, if it were completely stable. Any news on that front ?
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Sat Aug 11, 2012 11:09 am

Is anyone still believing that MikroTik has NOT given up on MetaRouter?
 
barkas
Member Candidate
Member Candidate
Posts: 260
Joined: Sun Sep 25, 2011 10:51 pm

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Sat Aug 11, 2012 2:42 pm

It seems to take quite a while for such a simple hotfix.
 
guy1with1mr1problems
newbie
Posts: 43
Joined: Wed Jun 13, 2012 12:34 pm

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Sat Aug 11, 2012 3:42 pm

It seems to take quite a while for such a simple hotfix.
Are you kidding? You think making a hypervisor is simple ? Allright then, show us your own mips hypervisor, then you can talk. The first thing I thought when MT released metarouter was: 'wow, MikroTik have some serious balls'. But nothing can possibly excuse the behavior towards LEGITIMATE CUSTOMERS like that. Even giving a freaking explanation or maybe even taking it to a next level and being polite would make a world of difference. When I call them (not even easy to find the number), they ask me whether I bought RB from them directly and if not, they send me to hell, even though our local retailer is not able to solve any simplest problem. What kinda behavior is that?
 
guy1with1mr1problems
newbie
Posts: 43
Joined: Wed Jun 13, 2012 12:34 pm

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Sat Aug 11, 2012 3:46 pm

and by the way, when you release kernel patches for metarouter guests, you can at least expect they will be developed, so the least you can do is to set up some open source infrastructure (maybe a git repo? or even a stupid html page with changes and e-mail to send patches to) - and no -forum pages don't count, requiring registration is EXTREMELY impolite. I sent you a patch right here and in another topic that fixes a serious problem for anyone who wants to mount filesystems inside guests and what do you do? You play totally stupid and accuse me of tampering with MT internals !
 
User avatar
janisk
MikroTik Support
MikroTik Support
Posts: 6283
Joined: Tue Feb 14, 2006 9:46 am
Location: Riga, Latvia

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Mon Aug 13, 2012 9:34 am

nobody has given up on this problem.

about possible patches - you can send them to support@mikrotik.com and make it clear that we can use your code freely.
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Mon Aug 13, 2012 11:43 am

Hey janisk,

I'm curious about any possible headway on the PPC side of things...has anything more been determined about PowerPC reboots? Does the PowerPC issue appear to be related at some level to the MIPSBE problem?

Thanks,

-- Nathan
 
User avatar
janisk
MikroTik Support
MikroTik Support
Posts: 6283
Joined: Tue Feb 14, 2006 9:46 am
Location: Riga, Latvia

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Mon Aug 13, 2012 3:37 pm

developers are working on the issue. Unfortunately i cannot report anything more on the issue regarding PPC arch boards.
 
barkas
Member Candidate
Member Candidate
Posts: 260
Joined: Sun Sep 25, 2011 10:51 pm

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Tue Aug 14, 2012 2:02 pm

It seems to take quite a while for such a simple hotfix.
Are you kidding? You think making a hypervisor is simple ? Allright then, show us your own mips hypervisor, then you can talk. The first thing I thought when MT released metarouter was: 'wow, MikroTik have some serious balls'. But nothing can possibly excuse the behavior towards LEGITIMATE CUSTOMERS like that. Even giving a freaking explanation or maybe even taking it to a next level and being polite would make a world of difference. When I call them (not even easy to find the number), they ask me whether I bought RB from them directly and if not, they send me to hell, even though our local retailer is not able to solve any simplest problem. What kinda behavior is that?
Simple hotfix as in taking out the hardware monitoring module in the interim.
 
rmichael
Forum Veteran
Forum Veteran
Posts: 718
Joined: Sun Mar 08, 2009 11:00 pm

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Tue Aug 21, 2012 11:06 pm

I'm testing MetaROUTER using liquidcz basic image (http://openwrt.wk.cz/trunk/mr-ppc/openw ... sic.tar.gz) with no extra packages installed (I just want to use it for dnsmasq).

It can run fine for days but whenever I try to logread or go to /dev/log the RB1100AH freezes and only power cycle restores it. Is this a known issue?

Can someone recommend a better image just for running dnsmasq?

PS.: I just read this thread from the beginning...I had no idea MetaROUTER can affect the whole router. Does Atheros SoC have a MMU? Why is it possible to crash hypervisor with a guest?
 
rmichael
Forum Veteran
Forum Veteran
Posts: 718
Joined: Sun Mar 08, 2009 11:00 pm

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Wed Aug 22, 2012 3:35 am

It seems to take quite a while for such a simple hotfix.
Are you kidding? You think making a hypervisor is simple ? Allright then, show us your own mips hypervisor, then you can talk. The first thing I thought when MT released metarouter was: 'wow, MikroTik have some serious balls'. But nothing can possibly excuse the behavior towards LEGITIMATE CUSTOMERS like that. Even giving a freaking explanation or maybe even taking it to a next level and being polite would make a world of difference. When I call them (not even easy to find the number), they ask me whether I bought RB from them directly and if not, they send me to hell, even though our local retailer is not able to solve any simplest problem. What kinda behavior is that?
Simple hotfix as in taking out the hardware monitoring module in the interim.
It appears that ver 5.20 on RB1100AH stopped reporting temperature and wonder if it's MetaROUTER related...
 
guy1with1mr1problems
newbie
Posts: 43
Joined: Wed Jun 13, 2012 12:34 pm

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Wed Aug 22, 2012 8:56 am

I'm testing MetaROUTER using liquidcz basic image (http://openwrt.wk.cz/trunk/mr-ppc/openw ... sic.tar.gz) with no extra packages installed (I just want to use it for dnsmasq).

It can run fine for days but whenever I try to logread or go to /dev/log the RB1100AH freezes and only power cycle restores it. Is this a known issue?

Can someone recommend a better image just for running dnsmasq?

PS.: I just read this thread from the beginning...I had no idea MetaROUTER can affect the whole router. Does Atheros SoC have a MMU? Why is it possible to crash hypervisor with a guest?
I'd recommend cross linux from scratch, if you know what you're doing. Eglibc is really extremely fast on the target. Getting the kernel is easy - just fetch the same version openwrt uses from kernel.org and apply only the metarouter patch and you're good to go. We are using this to bootstrap Gentoo for metarouter now, since there is no softfloat stage1 image we could find.
http://trac.cross-lfs.org/
 
guy1with1mr1problems
newbie
Posts: 43
Joined: Wed Jun 13, 2012 12:34 pm

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Wed Aug 22, 2012 9:00 am

I'm testing MetaROUTER using liquidcz basic image (http://openwrt.wk.cz/trunk/mr-ppc/openw ... sic.tar.gz) with no extra packages installed (I just want to use it for dnsmasq).

It can run fine for days but whenever I try to logread or go to /dev/log the RB1100AH freezes and only power cycle restores it. Is this a known issue?

Can someone recommend a better image just for running dnsmasq?

PS.: I just read this thread from the beginning...I had no idea MetaROUTER can affect the whole router. Does Atheros SoC have a MMU? Why is it possible to crash hypervisor with a guest?
I'd recommend cross linux from scratch, if you know what you're doing. Eglibc is really extremely fast on the target. Getting the kernel is easy - just fetch the same version openwrt uses from kernel.org and apply only the metarouter patch and you're good to go. We are using this to bootstrap Gentoo for metarouter now, since there is no softfloat stage1 image we could find.
http://trac.cross-lfs.org/
Oh, also forgot to mention: don't even think about using qemu for the native compilations. We tried, it segfault like crazy on many things, it's unusable for mips, apparently. We use distcc though, works really well.
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Wed Aug 22, 2012 10:12 am

PS.: I just read this thread from the beginning...I had no idea MetaROUTER can affect the whole router. Does Atheros SoC have a MMU? Why is it possible to crash hypervisor with a guest?
Because MetaROUTER isn't a hypervisor based virtualization environment. Judging from the patches, it's more like a paravirtualized approach.
 
User avatar
janisk
MikroTik Support
MikroTik Support
Posts: 6283
Joined: Tue Feb 14, 2006 9:46 am
Location: Riga, Latvia

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Wed Aug 22, 2012 4:04 pm


It appears that ver 5.20 on RB1100AH stopped reporting temperature and wonder if it's MetaROUTER related...
no it has no connection to MetaROUTER, there is a bit different problem that will be resolved in future releases.

Also, "voltage.ko" "fix" is only confirmed on mipsbe routers.
 
User avatar
janisk
MikroTik Support
MikroTik Support
Posts: 6283
Joined: Tue Feb 14, 2006 9:46 am
Location: Riga, Latvia

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Thu Aug 30, 2012 9:30 am

good news everyone, there has been some progress regarding problem that MetaROUTER guests where causing some severe problems on RB450G and similar mipsbe Routerboards. We hope that issue is addressed in this development build that you can try out and see if it worked for you:
http://www.mikrotik.com/download/share/ ... .21rc1.npk
 
Wazza
newbie
Posts: 39
Joined: Thu Oct 13, 2011 10:43 am

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Thu Aug 30, 2012 9:48 am

That is good news... I'll be testing this shortly...
Having said that, I really need some stability on the PowerPC boards...
Any news on that front?

Thanks,

Wazza
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Thu Aug 30, 2012 11:09 am

Thanks for the update, I will test as soon as my time allows.
 
User avatar
janisk
MikroTik Support
MikroTik Support
Posts: 6283
Joined: Tue Feb 14, 2006 9:46 am
Location: Riga, Latvia

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Thu Aug 30, 2012 1:28 pm

some time ago there was someone saying that using liquidcz basic image he could hang RB1100AH using logread command.

in this build that problem should go away:
http://www.mikrotik.com/download/share/ ... .21rc1.npk

it is newer build, however there are no additional mipsbe changes and previous build can be used.
 
guy1with1mr1problems
newbie
Posts: 43
Joined: Wed Jun 13, 2012 12:34 pm

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Thu Aug 30, 2012 3:20 pm

some time ago there was someone saying that using liquidcz basic image he could hang RB1100AH using logread command.

in this build that problem should go away:
http://www.mikrotik.com/download/share/ ... .21rc1.npk

it is newer build, however there are no additional mipsbe changes and previous build can be used.
Thanks for the great news ! I expect we'll be testing it heavily tommorow :)
 
barkas
Member Candidate
Member Candidate
Posts: 260
Joined: Sun Sep 25, 2011 10:51 pm

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Thu Aug 30, 2012 8:20 pm

Thanks for the update, only I was at 6.0beta1 and downgrading renamed my network interfaces to ether6-10. Strange.
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Fri Aug 31, 2012 1:37 am

Great news on both fronts, MIPS and PPC! I will be testing these shortly as well.

I am rather curious to know what changed. :)

-- Nathan
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Fri Aug 31, 2012 11:53 am

I conducted some simple tests, so far it looks very promising.

I did a very simple setup for now:
1 bridge in host with a single IPv4 address.
2 Metarouters, each with a dynamic interface on that bridge and single IPv4 address.

I then pinged:
- Each MR from the host in a Winbox Terminal
- The host from each MR startet over MR console also over Winbox
- Each MR from the other, startet over an SSH Session inside a Winboy Terminal.

I noted two things:
1.) The MR Console in Winbox freezes after a while with no impact on the MR itself.
2.) Ping roundtrip time is a little strange:
-Host to MR is usually 1ms
-MR to Host also usually 1m
-MR to MR is usually 2ms
-Sometimes all those roundtrip times goe up to 10,30 or sometimes even 60ms.

Both "issues" didn't affect MR stabillity by now, but the second one just feels a little strange.
 
vk7zms
Member Candidate
Member Candidate
Posts: 227
Joined: Thu Jun 29, 2006 3:01 am
Location: Hobart, Tasmania
Contact:

MetaROUTER stability issues on certain MIPSBE and PPC boards

Fri Aug 31, 2012 2:25 pm

How long did you run test for?
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Fri Aug 31, 2012 2:27 pm

I started the tests yesterday at arround 18:00 CEST. They are still running(I hope). ;-)
 
User avatar
janisk
MikroTik Support
MikroTik Support
Posts: 6283
Joined: Tue Feb 14, 2006 9:46 am
Location: Riga, Latvia

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Fri Aug 31, 2012 4:05 pm

I noted two things:
1.) The MR Console in Winbox freezes after a while with no impact on the MR itself.
2.) Ping roundtrip time is a little strange:
-Host to MR is usually 1ms
-MR to Host also usually 1m
-MR to MR is usually 2ms
-Sometimes all those roundtrip times goe up to 10,30 or sometimes even 60ms.

Both "issues" didn't affect MR stabillity by now, but the second one just feels a little strange.
this is how it will stay, since each router has its time when to do something and host has to serve the guests. And as you noted that is not interfering with data passing through the router. Devs say that this is how it is and better round-trip time can be achieved only by using faster CPU. However MR1 to MR2 speed seems reasonable.
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Fri Aug 31, 2012 4:11 pm

I noted two things:
1.) The MR Console in Winbox freezes after a while with no impact on the MR itself.
2.) Ping roundtrip time is a little strange:
-Host to MR is usually 1ms
-MR to Host also usually 1m
-MR to MR is usually 2ms
-Sometimes all those roundtrip times goe up to 10,30 or sometimes even 60ms.

Both "issues" didn't affect MR stabillity by now, but the second one just feels a little strange.
this is how it will stay, since each router has its time when to do something and host has to serve the guests. And as you noted that is not interfering with data passing through the router. Devs say that this is how it is and better round-trip time can be achieved only by using faster CPU. However MR1 to MR2 speed seems reasonable.
Ok, but do you have any idea why there are so big deviations like 1ms vs 60ms? I would expect that those times stay relative stable. I am not concerned about the absolute amount of time just the big deviations.
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Sat Sep 01, 2012 4:03 am

Okay, for MIPS, I have good news and bad news:

I took one of our "special" 450Gs that normally reboots every 30 seconds after importing my Asterisk OpenWRT image, and installed 5.21rc1 on it.

The good news is that it stopped rebooting!

The bad news is that this board still freezes up. The difference now is that it recovers a lot faster, so Watchdog never kicks in to reboot it. It seems to stop responding both to ethernet and to serial console every 6-7 seconds, and manages to recover after 6-7 seconds. It is very steady and very regular.

Another 450G board that didn't reboot nearly as often when running 5.20 or below seems much more stable on 5.21rc1, but I even observed that it froze for about 5-6 seconds at one point. It also recovered before Watchdog could kick in.

So it seems as though the fix in 5.21rc1 doesn't stop the freezing up, but it does make it recover a lot faster, and so it prevents Watchdog from going into a reboot cycle.

If I remove voltage.ko from either of these boards, the freezing stops happening completely. On the first board that freezes every 6 seconds or so, without voltage.ko, I achieved 90+ day uptime without dropping any pings to it before I rebooted it to upgrade it to 5.21rc1.

Progress has obviously been made, but it's not fixed yet. :)

-- Nathan
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Sat Sep 01, 2012 10:21 am

Nathan, did you test if those freezes also happen when running ROS based MetaROUTERS?
I didn't notice them by now, but I only run 2 ROS instances.

EDIT: Ok, I can confirm your observation, I got random hangs of 4-5 seconds which recovered themself withou watchdog intervention. So I also conclude: much better but not fixed.
 
User avatar
janisk
MikroTik Support
MikroTik Support
Posts: 6283
Joined: Tue Feb 14, 2006 9:46 am
Location: Riga, Latvia

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Mon Sep 03, 2012 9:26 am

AFAIK since a lot of stuff is happening in OS all the time, and guest and host has to do these things all the time, so each gets time-slice when to do stuff. If ping comes in wrong time it can miss current time-slice and get to the other one, hence the deviation.

About freezing - any configuration details would be helpful. What interfaces are used, how many of them. Is load on the router required or problems will appear even without load?
 
peson
Trainer
Trainer
Posts: 183
Joined: Tue Jul 20, 2004 10:33 am
Location: Sweden

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Mon Sep 03, 2012 9:52 am

AFAIK since a lot of stuff is happening in OS all the time, and guest and host has to do these things all the time, so each gets time-slice when to do stuff. If ping comes in wrong time it can miss current time-slice and get to the other one, hence the deviation.
Can this be true, since the "hangs" stops when voltage.ko is deleted? Confirmed by Nathan.

I can confirm the interruption in the ping tests for 450G.
I will try my RB1100AH production routers one of these days
Reboot is the last resort, try to find out what's wrong instead.
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Mon Sep 03, 2012 10:18 am

AFAIK since a lot of stuff is happening in OS all the time, and guest and host has to do these things all the time, so each gets time-slice when to do stuff. If ping comes in wrong time it can miss current time-slice and get to the other one, hence the deviation.
Could be, but 60ms seems to big, I would expect +/-2-4ms on an unloaded system, but I don't know the details of your system so I can't help much.
About freezing - any configuration details would be helpful. What interfaces are used, how many of them. Is load on the router required or problems will appear even without load?
Fo me its just the simple setup I mentioned earlier, one bridge with private IP, two MR connected to this via dynamic interfaces with a privcate IP, and one IP on the host unrelated to the MR setup. The only load on the system are pings spaced 1s coming into the external interface, I didn't yet connect/ping the MRs to the external network.

Judging by Nathan's posts, this seems to be a little hardware dependent as he has two boards acting differently.
 
User avatar
janisk
MikroTik Support
MikroTik Support
Posts: 6283
Joined: Tue Feb 14, 2006 9:46 am
Location: Riga, Latvia

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Mon Sep 03, 2012 12:43 pm

my explanation is about variable ping times when ICMP echo is sent from MR to itself or other MR. Of course, 60ms is excessive, but as i understand, that happens rarely.

About freezes - thanks for description, will recreate that configuration. Hopefully with same results as you guys did.
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Mon Sep 03, 2012 1:06 pm

Judging by Nathan's posts, this seems to be a little hardware dependent as he has two boards acting differently.
That is correct. I'm also sure I mentioned this before in past posts to this thread. I have some boards that have always locked up with more or less frequency than other boards. Most of the boards we have take a few hours or a few days before they hang or reboot. But in rummaging through various 450G boards that we have laying around, I have managed to unearth two boards that both lock up within SECONDS of starting a MetaROUTER when they are used with a 24V power supply. I have checked the capacitors on both boards, and they are both fine.

These boards are 100% solid when voltage.ko is deleted.

I am also sure that I made this offer before, too, but I am more than happy to send one of these boards back to MikroTik for them to test with.

As far as what my set-up is, I am using one of these boards that locks up within seconds to run my tests, since they make it VERY easy to reproduce any MetaROUTER problems. I am running 5.21rc1, completely virgin ("/system reset-configuration skip-backup=yes keep-users=no no-defaults=yes"). I then import my Asterisk MetaROUTER. Seconds after it begins booting, I start seeing the freezing-every-6-seconds-for-6-seconds problem. I don't even configure any IP addresses or add any vif interfaces; I just boot the image. When I delete voltage.ko, the problem is gone.

I have not tried running RouterOS MetaROUTER on these same boards, but I will. I suspect they will act exactly the same.

-- Nathan
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Mon Sep 03, 2012 1:13 pm

my explanation is about variable ping times when ICMP echo is sent from MR to itself or other MR. Of course, 60ms is excessive, but as i understand, that happens rarely.
Well it happens frequently, 60ms is an extreme value but 14-30ms seem to happen quite often.
 
guy1with1mr1problems
newbie
Posts: 43
Joined: Wed Jun 13, 2012 12:34 pm

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Mon Sep 03, 2012 2:00 pm

my explanation is about variable ping times when ICMP echo is sent from MR to itself or other MR. Of course, 60ms is excessive, but as i understand, that happens rarely.
Well it happens frequently, 60ms is an extreme value but 14-30ms seem to happen quite often.
I think, considering the hardware constraints, this is a pretty good latency.
 
barkas
Member Candidate
Member Candidate
Posts: 260
Joined: Sun Sep 25, 2011 10:51 pm

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Mon Sep 03, 2012 11:06 pm

my explanation is about variable ping times when ICMP echo is sent from MR to itself or other MR. Of course, 60ms is excessive, but as i understand, that happens rarely.
Well it happens frequently, 60ms is an extreme value but 14-30ms seem to happen quite often.
I think, considering the hardware constraints, this is a pretty good latency.
No it's not. I can get 60ms through my DSL over the Atlantic and back again.
And my DSL router is not faster than the RB450G, which is actually pretty fast for such a device.
 
guy1with1mr1problems
newbie
Posts: 43
Joined: Wed Jun 13, 2012 12:34 pm

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Tue Sep 04, 2012 7:12 am

...
No it's not. I can get 60ms through my DSL over the Atlantic and back again.
And my DSL router is not faster than the RB450G, which is actually pretty fast for such a device.
Yes it is. It is a virtual machine, you have to remember that. It may not be full virtualization but I'm pretty sure something happens only inside a running task on the host os. If you want proof try uploading something large via ftp to the board and keep pinging from metarouter to router os for example. You'll see it'll go significantly up. Maybe the vm's network subsystem keeps an event queue or something and generates irq only when it is metarouters turn to work. But if you understand embedded systems at all, you'll know this is a pretty amazing achievment.

Anyway, my theory on how to make it better: if metarouter indeed runs inside a task on the host system, why not provide a facility to renice the task ? Maybe give it realtime priority? Some option in routeros for this would be nice.

Also, if you give the vm realtime priority, you hang the entire host whenever vm experiences a high load. To counter this, I'd suggest using some hard CPU throttling - something like ACPI does when CPU is overheating on x86 systems. For example, you'd be able to say that the system will only run max. 80% of the allocated time. This could be done either in the virtual machines kernel or in metarouter.
 
guy1with1mr1problems
newbie
Posts: 43
Joined: Wed Jun 13, 2012 12:34 pm

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Tue Sep 04, 2012 7:25 am

Also forgot another suggestion: since barkas needs high interactive responsiveness, it might be a good idea to provide option that sets CONFIG_HZ in hosts kernels to a higher value, so that metarouter gets its turn more often and the ping decreases. This + making the vm a realtime task + the hard CPU limit cap for the vm would guarantee you a regular timeslot and should make the ping very stable regardless of whats happening on the host (and also probably reducing the hosts computing power a lot).
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Tue Sep 04, 2012 10:19 am

I got those 60ms while running two idling MR instances, most of the pings where at arroung 1-2ms. Also the host responded in about 1ms to external pings. I don't think this variance is caused by load but by the same problem which causes the system to freeze up. I'm still hilding my theorie that this all is caused by some pretty bad interrupt handling and/or locking inside either the systick handler or the "voltage.ko" logic.
But if you understand embedded systems at all, you'll know this is a pretty amazing achievment.
You must be kidding, achieving 1ms latency on this CPU(680MHz, MMU etc.) isn't a big deal. And for the amazing achievment, I have achieved better(read less jitter) latencies on an Intel 8051 clocked at 8MHz running multiple tasks in realtime. Virtualization isn't a big deal at all, is just another word for some pretty old technologies, especially in the case of MetaROUTER.
So in this thread, please stop acting as If you were god's answer to IT-questions, would you please?
 
guy1with1mr1problems
newbie
Posts: 43
Joined: Wed Jun 13, 2012 12:34 pm

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Tue Sep 04, 2012 11:08 am

I got those 60ms while running two idling MR instances, most of the pings where at arroung 1-2ms. Also the host responded in about 1ms to external pings. I don't think this variance is caused by load but by the same problem which causes the system to freeze up. I'm still hilding my theorie that this all is caused by some pretty bad interrupt handling and/or locking inside either the systick handler or the "voltage.ko" logic.
But if you understand embedded systems at all, you'll know this is a pretty amazing achievment.
You must be kidding, achieving 1ms latency on this CPU(680MHz, MMU etc.) isn't a big deal. And for the amazing achievment, I have achieved better(read less jitter) latencies on an Intel 8051 clocked at 8MHz running multiple tasks in realtime. Virtualization isn't a big deal at all, is just another word for some pretty old technologies, especially in the case of MetaROUTER.
So in this thread, please stop acting as If you were god's answer to IT-questions, would you please?
Programming microcontrollers and writing kernel drivers are two very different things. By the way, I do both. Well if I ping a routerboard I have here right now, I get like 0.3ms latency. Do you know why? Because when the ethernet interface in the ar71xx chip gets data, it triggers an interrupt. Linux can react to this data at once - queue an icmp response for example.
With virtualization, this is probably very different. The data we're seeing are consistent with metarouter vm running inside a userspace task. They probably use tun/tap to make a network interface to the vm (its easy and already in place). That means network data is queued in a buffer until the task wakes up and can handle it. If CONFIG_HZ is set to 100 (which it mostly is, in non-desktops), that means metarouter will 'poll' for the network data with at most 100 Hz frequency - that means 10ms BEST case scenario with HZ_100.

Well how do we arrive at 50ms then? Consider this: making the guest aware that it is 'being task switched' is very hard. I think it isn't aware of it at all and just checks some shm queue on every system interrupt (another 100 Hz in openwrt). And then also consider that it may get task switched when it processes the icmp echo and before it sends a reply. That would explain some great variance in time.

Want to do more than just whine about it? Ok compile your guest kernel with HZ_1000 and report back what latencies you get then - you should get 11-ish times and 20ms in worst case then. You'll also prove my point. Of course, the processing power of your mr guest will be a lot lower then, but hey, you want low latency, right ?

All my previous points stand - for your situation, most beneficial would be allowing host to run with HZ_1000, setting the mr task realtime and throttling its available cpu time with a hard limit.
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Tue Sep 04, 2012 11:19 am

Well guy1with1mr1problems, some comments then:
1.) I am getting 1ms latency to a MR, which wouldn't be possible with 100Hz polling frequency at all.
2.) Ever heard of of interrupt controllers which might be able to trigger software interrupts? Have a look at the "active" interrupts when you run one or more MR instances.

As long as you haven't worked with or for MT and designed MR with or for them:
"So in this thread, please stop acting as If you were god's answer to IT-questions, would you please?"
Also I wasn't whining about latency, I was reporting it.
 
guy1with1mr1problems
newbie
Posts: 43
Joined: Wed Jun 13, 2012 12:34 pm

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Tue Sep 04, 2012 11:42 am

1.) I am getting 1ms latency to a MR, which wouldn't be possible with 100Hz polling frequency at all.
2.) Ever heard of of interrupt controllers which might be able to trigger software interrupts? Have a look at the "active" interrupts when you run one or more M instances.
Fair enough, then they don't use tun/tap. As you say, getting <10ms latency with that would be impossible with HZ_100.
But metarouter still needs to run inside a thread at least. That means it needs to sleep and can 'wake up' at 100Hz at most.
If you send it icmp request just before it wakes up, then you get a very low latency (if it can reply in the same timeslot - which it most probably can). If you send icmp request just after it went to sleep, that would give you about 10ms latency. Makes sense?

Still the same thing stands - if host had HZ_1000, it would be 10 times better.

And I repeat, do you realize that when other tasks load the host, metarouter gets a smaller timeslot to run in? It may even be suspended for a few task switches before it gets its turn. Metatouter isn't the only thing thats running.

If you gave it realtime priority, it would be guaranteed to run in every task switch. But if you do that, you have to limit the guests cpu usage in some other way.
As long as you haven't worked with or for MT and designed MR with or for them:
Look a few posts up, I already provided a tiny patch for a tedious problem. I plan to fix the other issue too.

Also I think you're the one acting like you were god's answer to IT-questions - after all, you program an almighty 8051 ! (by the way, I mostly write bare metal code for ARM mcus now, but I did go through the 8-bit phase too). You're the only one thinking that just because of a 50ms ping to mr, something is wrong. Me and apparently Mikrotik think not. The way it is implemented it runs suprisingly well. What do you need a better ping for anyway ?
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Tue Sep 04, 2012 11:55 am

Fair enough, then they don't use tun/tap. As you say, getting <10ms latency with that would be impossible with HZ_100.
But metarouter still needs to run inside a thread at least. That means it needs to sleep and can 'wake up' at 100Hz at most.
If you send it icmp request just before it wakes up, then you get a very low latency (if it can reply in the same timeslot - which it most probably can). If you send icmp request just after it went to sleep, that would give you about 10ms latency. Makes sense?

Still the same thing stands - if host had HZ_1000, it would be 10 times better.

And I repeat, do you realize that when other tasks load the host, metarouter gets a smaller timeslot to run in? It may even be suspended for a few task switches before it gets its turn. Metatouter isn't the only thing thats running.

If you gave it realtime priority, it would be guaranteed to run in every task switch. But if you do that, you have to limit the guests cpu usage in some other way.

Quote:
You are basing all that on assumptions, there are some more programming models than threads and neither you nor I do know how MT implemented MR.
Look a few posts up, I already provided a tiny patch for a tedious problem. I plan to fix the other issue too.
You worked on the guest side, not the host side. That isn't what I meant.
Also I think you're the one acting like you were god's answer to IT-questions - after all, you program an almighty 8051 ! (by the way, I mostly write bare metal code for ARM mcus now, but I did go through the 8-bit phase too).
Another assumption which couldn't be more wrong. The Intel 8051 was an example, I won't say that "I went through an 8-bit phase" however. I hope you are not one of those guys which uses an Cortex-Mx or even A xfor solving a problem which an Intel 8051 or AVR8 would handle in it's spare time. But nice to here that you work with MCUs anyway.
You're the only one thinking that just because of a 50ms ping to mr, something is wrong. Me and apparently Mikrotik think not. The way it is implemented it runs suprisingly well. What do you need a better ping for anyway ?
Also wrong, janisk, which isn't all of MT, >thinks< this is ok, but passed it on to the devs anyway.
And again, so maybe even you will get it, I did just report this jitter not complain or whine or anything else.
 
barkas
Member Candidate
Member Candidate
Posts: 260
Joined: Sun Sep 25, 2011 10:51 pm

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Tue Sep 04, 2012 12:02 pm

You're the only one thinking that just because of a 50ms ping to mr, something is wrong. Me and apparently Mikrotik think not.
We are on page 6 of a thread where a problem was halfway fixed that we took a year for mikrotik to think something is wrong, so I do not think that counts for much.
The way it is implemented it runs suprisingly well. What do you need a better ping for anyway ?
60ms ping jitter may be ok for a laboratory, even if only just, but it is completely unacceptable for production deployment.
 
guy1with1mr1problems
newbie
Posts: 43
Joined: Wed Jun 13, 2012 12:34 pm

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Tue Sep 04, 2012 12:22 pm

Fair enough, then they don't use tun/tap. As you say, getting <10ms latency with that would be impossible with HZ_100.
But metarouter still needs to run inside a thread at least. That means it needs to sleep and can 'wake up' at 100Hz at most.
If you send it icmp request just before it wakes up, then you get a very low latency (if it can reply in the same timeslot - which it most probably can). If you send icmp request just after it went to sleep, that would give you about 10ms latency. Makes sense?

Still the same thing stands - if host had HZ_1000, it would be 10 times better.

And I repeat, do you realize that when other tasks load the host, metarouter gets a smaller timeslot to run in? It may even be suspended for a few task switches before it gets its turn. Metatouter isn't the only thing thats running.

If you gave it realtime priority, it would be guaranteed to run in every task switch. But if you do that, you have to limit the guests cpu usage in some other way.

Quote:
You are basing all that on assumptions, there are some more programming models than threads and neither you nor I do know how MT implemented MR.
Look a few posts up, I already provided a tiny patch for a tedious problem. I plan to fix the other issue too.
You worked on the guest side, not the host side. That isn't what I meant.
Also I think you're the one acting like you were god's answer to IT-questions - after all, you program an almighty 8051 ! (by the way, I mostly write bare metal code for ARM mcus now, but I did go through the 8-bit phase too).
Another assumption which couldn't be more wrong. The Intel 8051 was an example, I won't say that "I went through an 8-bit phase" however. I hope you are not one of those guys which uses an Cortex-Mx or even A xfor solving a problem which an Intel 8051 or AVR8 would handle in it's spare time. But nice to here that you work with MCUs anyway.
You're the only one thinking that just because of a 50ms ping to mr, something is wrong. Me and apparently Mikrotik think not. The way it is implemented it runs suprisingly well. What do you need a better ping for anyway ?
Also wrong, janisk, which isn't all of MT, >thinks< this is ok, but passed it on to the devs anyway.
And again, so maybe even you will get it, I did just report this jitter not complain or whine or anything else.
True, I base everything on assumptions. Although I saw the aformentioned shm event buffer in mr guest patches. And I also agree it could be a lot faster, but it would be much more complex and prone to programming error. I think the way it is now is ok. You were the one saying that 50ms latency is something absolutely unheard of and comparing virtualized network handling to something like serial irq or a software interrupt in 8051.

Btw. consider the alternative - you could execute the receive irq handler in the guest immediately. But you need to somehow guarantee that this handler plus all the other code of metarouter only runs as long as its priority allows. This is really very complex stuff. Imagine being on the implementors side.

You made one good point thogh - someone could want a lower interactive latency and a guaranteed CPU timeslot for something (audio processing, etc?). That's why I suggested a few improvements (option for HZ_1000 in host, realtime mr priority and cpu throttling). I'm sorry you took it like something else than trying to be helpful. But imho getting it stable first is much more important.

Btw. I also hate when complex mcus are misused. But cortex-m0 chips now are a lot of times cheaper and the development software stack (gcc, etc) is a lot better now than in many 8-bit mcus. But for hobby projects, I use avrs a lot.
We are on page 6 of a thread where a problem was halfway fixed that we took a year for mikrotik to think something is wrong, so I do not think that counts for much.
true, it took a long time. But trust me, it's not easy. Just look at and try to understand the guest patches they provided. It really is complex stuff. But the fact that noone even acknowledged the problems for a long time is inexcusable.
60ms ping jitter may be ok for a laboratory, even if only just, but it is completely unacceptable for production deployment.
in a non-virtualized system - yes, that's alarming. But with a kernel running inside a thread, that's pretty nice, considering on usual embedded kernels, task switching responsiveness is pretty bad. I agree it could be a lot better but we're running another kernel inside a container on a mips board, that's pretty amazing imo. I don't know of any other embedded system that can do that (except the new armv8 architecture that'll have hardware virtualization).
 
barkas
Member Candidate
Member Candidate
Posts: 260
Joined: Sun Sep 25, 2011 10:51 pm

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Tue Sep 04, 2012 2:01 pm

60ms ping jitter may be ok for a laboratory, even if only just, but it is completely unacceptable for production deployment.
in a non-virtualized system - yes, that's alarming. But with a kernel running inside a thread, that's pretty nice, considering on usual embedded kernels, task switching responsiveness is pretty bad. I agree it could be a lot better but we're running another kernel inside a container on a mips board, that's pretty amazing imo. I don't know of any other embedded system that can do that (except the new armv8 architecture that'll have hardware virtualization).
It's paravirtualized anyway. It should not have any performance impact, or almost. Certainly not this much.
 
User avatar
janisk
MikroTik Support
MikroTik Support
Posts: 6283
Joined: Tue Feb 14, 2006 9:46 am
Location: Riga, Latvia

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Tue Sep 04, 2012 2:21 pm

About problem at hand - we fixed what could be reproduced. So NathanA if you are willing to send us one of your really bad RB450G that is freezing up (those 6 - 7 seconds you mention) then please do so. My test routers seems have entered 2 week cycle and are not showing these issues.
 
User avatar
janisk
MikroTik Support
MikroTik Support
Posts: 6283
Joined: Tue Feb 14, 2006 9:46 am
Location: Riga, Latvia

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Tue Sep 04, 2012 3:02 pm

some update.

please check at what rate GPIO interrupt count increases on your routers, and what health reading you have. If possible, irq count increase on seemingly good router and same value on bad router.

NathanA you can hold a bit that delivery of the router.
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Posts: 801
Joined: Tue Aug 03, 2004 9:01 am

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Wed Sep 05, 2012 8:49 am

please check at what rate GPIO interrupt count increases on your routers, and what health reading you have.
Okay, I did so, and I'm getting some very interesting results.

First, I need to mention that I tried making a RouterOS MetaROUTER instead of OpenWRT/Asterisk one on the "bad" routers. I found that it still does it, but not as often as when running Asterisk. Sometimes it takes 2-3 minutes before it kicks in instead of every 6 seconds. But it will eventually do it in under 10 minutes at the most (usually under 5), which is way more frequently than on the "good" routers (which can go for hours or days before they freeze).

When I'm running 1 RouterOS MetaROUTER and the problem happens on the "bad" routers, sometimes it will only happen once, and then work again for another couple minutes. But often what happens is it gets "stuck" in a loop or a pattern of doing a bunch of freezes all in a row. So, it will be working fine for a couple of minutes, then it will freeze for 6 seconds, and then unfreeze for 6 seconds, and then freeze again...and it might do that 2 times or it might do that 200 times...freeze, unfreeze, freeze, unfreeze...every 6 or so seconds. Eventually it will manage to break out of the cycle, but then it will just start again a few minutes later.

Now I can talk about the GPIO IRQs and health stats:

On both the "good" and "bad" routers, they are roughly the same. I ran them both on 24V power supplies; the "bad" router was showing roughly 22V @ 55C and the "good" router was showing 24.5V @ 62C. And if no MetaROUTERs are running, or if MetaROUTERs are running but everything is working for the moment, both routers show roughly 200 GPIO requests-per-second.

Here is where it gets weird. On the "bad" router, whenever it gets stuck in a 6-second freeze-up loop, in between each freeze-up, I observe that the GPIO rate slows to an average of 50 requests/sec. It stays at 50 GPIO/sec for the entire 6 seconds that it is responsive, before it freezes up again. After it unfreezes, once I see that the GPIO rate has gone back up to 200 requests/sec, I know that it has managed to break itself free of the freeze-up loop. If it hasn't and is still showing 50 requests/sec, then I know that it will freeze again in another 6 seconds.

Now, while it is "frozen" for 6 seconds, I can tell you that no GPIO interrupts get requested or processed. I can tell you this because I have "/system resources irq print interval=1" running the entire time in a MAC-Telnet session to the "bad" router, and after it "unfreezes", the GPIO count does not suddenly jump to a much higher number: it starts counting up again at either 50 requests/sec or 200 requests/sec from the value it was at right before it froze up.

I have also noticed that once the "bad" router gets stuck in a freeze-up pattern, occasionally the health stats show some real bizarre numbers which get "stuck" there. For example, right now as I am typing this, the "bad" router is in the middle of one of its freeze-up loops, and "/system health print interval=1" is stuck at 4.3V @ 13C.

If it's in the middle of one of it's freeze-up loops and I manage to type "/metarouter disable [find]" fast enough during one of the 6-second windows when it is responsive, the GPIO rate immediately goes back to ~200 requests/sec AND the health stats will also immediately bounce back up to the right values.

I hope this helps.

-- Nathan
 
timberwolf
Member Candidate
Member Candidate
Topic Author
Posts: 274
Joined: Mon Apr 25, 2011 12:08 pm
Location: Germany

Re: MetaROUTER stability issues on certain MIPSBE and PPC bo

Wed Sep 05, 2012 10:53 am

Thank you for your investigation Nathan, I can't afford time for tests right now.
This once again shows, that the core problem seems to be interrupt handling.

Who is online

Users browsing this forum: No registered users and 3 guests