Dude refuses login after router reboot

I have been running The Dude v3.6 on RouterOS 5.11 on NA-820 x86 Architecture for a month now.
It happened once before that the dude showed a lot of services on a lot or devices as down for no reason, but that cleared up after 5 minutes or so. Well, this happened again today, but this time they stayed down for more then one hour. I decided to reboot the RouterOS hosting The Dude and try to clear the situation.
Now, I cannot log in :slight_smile: I get “Loggin failed: Invalid username or password”

I have tried my own user/pass, the master admin as well as the helpdesk user/pass we have. None seem to work.
I have tried rebooting 2 times already. RourerOS show the dude package as running/enabled.

Any ideas how to remedy this situation?

Wholly s***!
I tried logging in with the default admin password, and I got in :frowning: The Dude configuration is blank!!!

Why? WTF?

Please, anyone, any ideas on how/why this happened? How can I protect myself from experiencing this again?


I started importing a week old backup and will let you know if the newly imported configuration doesn’t survive the next reboot.
I lost A LOT of configuration here! Please help.

We had this issue occurring quite frequently on v4.X versions of the dude running on top of RouterOS.

Ended up moving to using The Dude only on windows machines where we can take direct backup copies of the dude folder. :frowning:

The import procedure worked well. I tried several reboots after the import and there were no problems.
But I lost a weeks worth data + some nifty tweaks :frowning:

I would rather not move the dude on Windows, as I will have to purchase a windows license + I already have good MikroTik hardware and distributed dude agents.

Is there a way to do automatic daily backups/exports ?

I am curious if your settings caused the database to get larger than 2gb. How big was your import? If you are saving more than 2days of raw value keep time this could cause the database to get huge. FYI I recently vacuumed the database and it reduced it from about 1.6gb to 1gb.

As I said, I’ve been using The Dude for approximately one month, so my import file is very small, only 17MB :slight_smile:

An I haven’t changed the settings on what data and how long to keep it. I didn’t even know that you can do that with the 3.6 version.

Is there a way to vacuum the db on RouterOS?

That makes more sense, you can’t export configurations for RB in the v4 beta so you had an export from 3.6 and were able to import it but you lost one weeks worth of work since you could no longer log in. Sorry for your troubles (and I should have read the whole thread to be clear). You are probably way past this by now but I am guessing that you probably could not and did not have a way to save the corrupt configuration file?

Anyhow if you did you can find in the XML the description of functions and probes in the XML and copy and paste them back into your rebuild of dude. You might already know… I have seen where pasting XML directly into the dude has caused problems in the past maybe that caused your trouble?

Here is an example of a single probe in XML.

<?xml version="1.0" ?> 13 22917079 IOrec 8 oid("1.3.6.1.4.1.2021.11.58.0",10,29) if(oid("1.3.6.1.4.1.2021.11.58.0",10,29), "" ,"IOPS error") ratessiorawrec() IOPS

I believe that a duplicate system id in the XML could be very troubling for the dude and that could break your configuration even if you have not pasted XML to acquire new probes I would be careful and/or only manually type information in to create probes and functions.

There are the occasional users who lose their configurations anyone would hate to have trouble but since your new and you just got started thankfully not too much was lost. At this point I am trying to help you find out what could have possibly caused it so it doesn’t happen in the future. Is there any other additions or changes you might be able to point to? Since v4 beta doesn’t export from RB you might consider the windows version so you can make backups.

Lastly if a service is marked as down the default negative cache of 300 seconds causes the service to stay down for 5 minutes (examine the “oid” function). So when building your own probes you should modify the negative cache time to something below the retry interval. This way the device will come back up as soon as the next polling interval hits. Ping doesn’t have this negative cache time set.

HTH,
Lebowski

Thanks for your valuable input.

What happened is, the dude marked a lot of services on a lot of devices as down and they said that way for many hours. I tried re-probe, I tried everything and nothing helped. So I decided to reboot.
After the RouterOS reboot, the dude would not let me log in. That is when I wrote the post.
After another reboot and testing many user accounts, I decided to try to logon with the default admin user and empty pass, and guess what :slight_smile: I logged in, and the dude was reset to factory default.

Thinking back, I should have done an export before the reboot. (mental note) So no export for the failed state :frowning:

Not sure if I were clear on this, but I have never run on v4 beta. I have always used v3.6 on ROS 5.11 x86

I have never imported from backup before, and never tweaked the xml files, so that is not the case.
I cannot think of anything that could have caused this. My mikrotik device is pretty powerful and is never out of resources. It just happened. Well, maybe I am missing something :slight_smile:

What is negative cache and where do I configure it for my probes? Any documentation?

When you get some time read the probe thread and how to build a good probe in the wiki. Examine the functions in contents and read the description of the functions. If you plan to stay on RB you should stay on 3.6 since 4 won’t export, you had mentioned it I just glossed over it.

I have seen that all probes down one time about 2 weeks ago they were down for a long time since it started Friday night. I restarted the service and everything was fine. I am however using w2k3 sp2 and 4.3 Beta.

Here is the description of OID from the functions contents: OID returns value of given snmp OID. Only first parameter mandatory. First parameter - oid string, second - cache time - default 5 seconds (5.0), third - negative cache time - default 5 minutes (300.0), forth - ip address (overrides context device), fifth - snmp profile (overrides context device).

Here is an example of setting the cache time to 10 seconds and the negative cache time to 15 seconds. I just wish Negative cache time could be configured globally since I never want to wait 5 minutes for a probe to come back up. I have found that the dude has too many false positives at 30 second polling and upping it to 1 minute I almost never have a false positive.
oid(“1.3.6.1.2.1.2.2.1.14.10625”,10,5)

I doubt and hope you don’t have any more trouble getting your monitoring setup the way you want. The dude is certainly a very neat product.

Lebowski

Your example is set to 5 sec negative cache :slight_smile:

I plan to stay on v3.6 as v4 seems to have many many bugs. I also would like to stay on ROS, as it makes more sense to me. But I am willing to move to windows (preferably linux) if needed. If I could export/backup the dude configuration on a daily bases (maybe from ROS), then I would be very happy.

I have already reconfigured dude to pool every minute instead of every 30 sec, and that seems to have eliminated a lot of false positives. I still get some false positives on the devices I have in the middle east, which is fine as they are far away in terms of rtt :slight_smile: I have already ordered a RB1100AHx2 for the Dubai, which will have a dude agent and should eliminate such false positives.

If you don’t mind I would like to befriend you on Skype, as this is a very inefficient way of communicating :slight_smile: My skype account is: fbsdmon

Good catch on that negative cache, I will friend you in a bit…

lebowski, i have exactly the same problem with v4 beta3 on windows server 2003. Could you solve the problem ? I’m stuck with a one month old backup…

Hey benjamingois,

I can give advice on how to proceed but if you have lost your configuration and you are running RB version of 4.3b you are looking at rebuilding the whole thing. You might want consider a switch to windows for the backup reason. (if this is your trouble)

I am not sure what your exact trouble is since there were several issues discussed. Post a bit more detail so folks can help you out.

Lebowski