We are experiencing big issues with The Dude running with about 1000 devices. I would like to know the limits and make a desicion whether I should fix it (if possible) or move to another system, like Zabbix. I have people saying that Dude is designed to run with 10 - 100 devices and I don’t kwow if this is true.
The problem is: The whole Mikrotik virtual machine just hangs. It happend twice during this month and we had to reboot the whole virtual machine. Another issue is that some of the graphics just stop reading and drawing until the dude process reboots, but when this graphs start to work, other ones stop. There are also many failures when reading custom SNMP probes (like RSSI on an ubiqity links, UPS voltages). This just randomly fails and Dude says there is no response from the device.
We moved half of our network to a Dude Agent, but the problem is exactly the same. The agent did not help at all!
Does anybody has a similar problem? What is the maximum amount of devices that Dude can support? How can I debug or troubleshoot this situation? What factors can decrease Dude performance, like snmp version, router os checks, etc..?
Server characteristics:
ROS 6.43.16 on a CHR.
The host is running an ESXi-6.5.0-Update1
The host has Xeon X5670 @ 2.93GHz with an HP P410 raid controller with 2 Samsung Pro SSD
The Dude VM has: 4 GB RAM where only 0,16 GB active and 6 CPU(s). The CPU load does not reach more than 25% at peaks.
I run a RB1100 dude edition stable with 400 clients, so the chr should be capable enough. What I changed was how it polled, retention and what was polled.
Dude is a good system for up/down and trend analitics.
Retention 180 days
Poll every 15 minutes
Don’t poll everything. I poll the internet connection. When it overloads I poll the system behind it to find the cullprit. But with trend analytics it is almost not needed.
Dear Friends,
I’ve got major issues with Dude instability as well. Running on Dude server RB1100ah - probably 200 devices. It flat out won’t run and the client becomes un-responsive on most of our laptops. I re-organized the layout of various sites on a large map, and have several sub-maps. I noticed when I laid out the map to be larger - so I could see white space between individual sites - this is when it really started crashing.
---- I think maybe this has more to do with the Dude Client limitations or some glitch within the windows client and not the dude server itself. Since upgrading to 6.46.2 it seems that the RB has been stable where some previous versions caused complete lockup of the server.
Can Mikrotik please fix the Dude Client and investigate this issue?
Could Mikrotik Please add some limits, design considerations, etc to the Dude Wiki so we know how many nodes it can handle? How many services we can add to however many nodes befor we would expect to see issues?
I have a CHR version 6.42.2 running on a AWS T2.medium instance with over 3,000 devices and around 50 supmaps, and I haven’t experienced any major issues.
Right now we are running out space and I need to increased our storage from 1GB to 16GB. I haven’t done this before, but I’m sure someone already posted this.
These are some of the issues that we are having.
The timer for a resolved outage doesn’t stop some times and it just keeps going.
We are not able to modify the device settings for some of the devices when a service like ping is down. It doesn’t happen with all the devices.
PNG Images files disappear from the dude\file directory. We created some custom backgrounds and a few of them disappeared.
Email\SMS alerts take a lot of resources, so we don’t turn it on that often, this is off most of the time.
Well, it was abandoned for some time, now they are doing something. I’m not protecting Mikrotik at all. But, you need to understand the limits and The Dude purpose. Ir is for new companies, startups… It is a good tool to start, but once you grow up, you need to change it. When you get to a size of 1k devices, replace it. When you have more than 3gbps of traffic, replace it. It will read the traffic from router os devices (it will be a ccr1072 but it will reboot randomly every so often) and other devices will draw strange graphs due to 32 bit counters for snmp (because at the end of the day they want you to buy router os to have 64bit counters). So, Personally, I would suggest Zabbix. It’s very similar to the dude, a little bit more complicated, but, at the end of the day it’s a working solution. Now we are running it on a dedicated x86 server with ssd and we have some issues. We are slowly deploying Zabbix.
I am searching for user “antoxic”, which has implemented Zabbix. It seems, that I have to do the same here this year.
Maybe we could exchange useful informations and work together?
Would you provide me your email address or send me one to friedl@rauter-it.at ? Thanks
Currently I am running dude with 1500+ devices, 2000+ snmp links and 8000+ data sources on a CHR.
Well, I did not deploy Zabbix fully. The major problem that I´ve found is that it´s super complicated to do the parent dependencies. When one host goes down, all the connected hosts are also reported as down. At the moment we have the core equipment there and we use Grafana to see the traffic.
P.S. But from time to time I do really suffer from the dude slowness