Page 1 of 1

version 3.3 so far

Posted: Tue Apr 21, 2009 12:08 am
by lebowski
I upgraded to 3.3 at around 1pm MST. So far it looks like it is much better in SNMP collection.
Here is an example SNMP graphic of 3.2 vs 3.3.

Re: version 3.3 so far

Posted: Tue Apr 21, 2009 1:11 pm
by Trisc
SNMP/DNS seem to be fixed but my device legends are still missing - including bit rates on links - and http probes aren't working correctly either.

Devices report "connection closed" however I can reach the device in my browser.

Re: version 3.3 so far

Posted: Tue Apr 21, 2009 11:08 pm
by lebowski
I am wondering if having the dude on a quad core processor is causing my issue. Although graphing in 3.3 is much better than 3.2 I still have times that I am not graphing. If I snmp walk an oid that it claims is timing out I see the value. Is anyone running w2k3 sp2 on a quad core?

Re: version 3.3 so far

Posted: Wed Apr 22, 2009 3:47 am
by adamd292
I have a dual core. I don't have any single core machines :(
Every now and then, I have "gaps" in my snmp traces.
If I snmpwalk the device during a gap, I can see the data values that are being traced.
I have two Dude 3.1 servers. If one Dude has a "gap", the other Dude will be ok.

It appears regularly for particular device/oid combinations.
For example a device will accurately report CPU, but temperature will have gaps.
Some of the devices with problems are on the same VLAN as Dude.
Some devices are worse than others.

Is this just a problem for Sweetdude and me?

Maybe it's time I used Wireshark to track exactly what is happening?

Re: version 3.3 so far

Posted: Wed Apr 22, 2009 10:38 am
by oscarBravo
Just upgraded - it looks like I've lost the ability to identify interfaces using the "routeros" method:


Re: version 3.3 so far

Posted: Wed Apr 22, 2009 3:48 pm
by lebowski
I have forced affinity to a single core... have to see if it makes a difference.

I have wiresharked and when a probe is timing out I can hit reprobe and see the response and it stays timeout.
It was much worse on 3.2 and 3.3 is much better I just can't use the dude for notification since it has many false positives.

If you want to force affinity you can just right click on the program in task manager. There is a program to force affinity that seems to be a good candidate, it runs at startup etc. ... /index.htm

Re: version 3.3 so far

Posted: Wed Apr 22, 2009 3:51 pm
by lebowski
Never mind forcing affinity to a single core does not fix it.

Re: version 3.3 so far

Posted: Wed Apr 22, 2009 5:09 pm
by lebowski
I restarted the service and forced affinity as soon as I could and things have been much better since. I will report back in a few hours. Maybe you should try it as well oscarBravo.

Re: version 3.3 so far

Posted: Wed Apr 22, 2009 5:18 pm
by oscarBravo
Any way to do that without restarting the router the Dude server is running on?

Re: version 3.3 so far

Posted: Wed Apr 22, 2009 5:23 pm
by lebowski
Yeah open the services control panel... stop and start the service. If it won't restart then kill dude.exe.
Download "set affinity II" beforehand and as soon as the process is restarted you will see it and set it's affinity then.

It is supposed to be able to set affinity upon a reboot but I have not tried it.
If you add dude.exe as a favorite it will show up at the top of the list, once you set that as an option in the program so you might want to run set affinity before you start and stop the service...
I hope you have good results!!! It would be so nice to have graphing work correctly.

Re: version 3.3 so far

Posted: Wed Apr 22, 2009 5:25 pm
by lebowski
Oh doh your running on a router board... nm that oscarBravo adamd292 is the one running dual core pc. I don't know how to deal with routerboard.

Re: version 3.3 so far

Posted: Thu Apr 23, 2009 4:30 pm
by lebowski
Ok so it looks like dual and quad core have collection issues. With the program added to Set Affinity II and automatically forcing affinity to a single core my graphs are much better. I did have a hickup (many false outages) when I was adding some devices to a submap yesterday morning but restarting the service fixed it, also this service restart was the first one that automatically went to a single core so that might have had something to do with it. So now about 20 hours later things are still mostly clean in the graphs. Much better than with out setting the Affinity.
w2k3 sp2 quad core...

Re: version 3.3 so far

Posted: Mon Apr 27, 2009 4:23 pm
by lebowski
Just a quick update, Since I forced affinity with the program above the collection of all the probes has stabilized I have not seen any false positives and there are no gaps in my graphs.

We need someone to determine what needs to be fixed to support multi-core processors.

Any Ideas?


Re: version 3.3 so far

Posted: Wed Apr 29, 2009 8:12 pm
by lebowski
Another update, the collection issue on multicore is not completely solved. It seems collection works just fine until discovery adds some devices.

I have added some auto discovery items to the dude and it looks like it is stable but if a few new items or if I manually add devices polling of other devices is not consistent. Stopping and restarting the service causes it to work fine for a while.

Re: version 3.3 so far

Posted: Thu Apr 30, 2009 3:27 am
by adamd292
I have upgraded to Dude 3.3 and started using Set Affinity II to limit Dude to my 1st Core.

I have *less* obviously false outages, but there still are some.

All of the false outages involve SNMP - "ping" and probes based on external commands do not generate false outages.
During a false outage, SNMP on the device in question can be walked and the individual oid associated with a failing probe can be requested and a valid value obtained.

CPU% on the Dude server and target device are <30%.
N/W traffic is <1Mb/s on 100Mb or 1Gb NICs.
No errors are reported in any logs (except the false outage in Dude).
Devices are all types: AIX servers, Fibre Switches, Windows servers, Storage Arrays, etc.

I get at least 2 false outages a day. Strangely they each last either 3:00 (20%) or 4:30 (80%).
My SNMP poll interval is set to 0:30 and timeout set to 0:29.

Re: version 3.3 so far

Posted: Fri May 01, 2009 1:12 am
by lebowski
I have a new theory on why probes become unstable.
I am thinking that probes based directly on OIDs and not on functions are more succeptable to become sporatic.

For example the battery capacity probe I made looks like this...
I just converted my Cisco CPU probe to work like this...
I am going to disable affinity forced to 0 and let it go over night. I will let you know as soon as I find out if it stabilizes CiscoCPU vs my other probes built with OIDs.

Re: version 3.3 so far

Posted: Mon May 04, 2009 5:26 pm
by lebowski
Hey adamd292, Could you convert one of the probes that you get false positives with into a probe that calls a function that reads the oid instead. Like I did in the previous post.

I will be watching my ciscocpu probe this week to see how things go.

Re: version 3.3 so far

Posted: Mon May 04, 2009 7:45 pm
by lebowski
Never mind with that... What I have found is it runs much better with single core affinity but anytime I add a device either manually or discovered within an hour or so I will start getting random failed snmp reads.

The probe based on a function is better from the standpoint that it will not get added to a device that does not have that counter it also doesn't show as being down when the counter is actually returning 0 as the result.

Re: version 3.3 so far

Posted: Tue May 05, 2009 11:15 am
by adamd292
Hi Sweetdude,

I agree, since going affinity=cpu0 my dude is much better behaved.
Whether this is a threading problem, or just a problem that gets fixed when the dude runs slower, I do not know.

I did some quick stats on SNMP Test Probe vs Function Probe with direct OID vs Function Probe that calls functions.

False positives over the past month: (Probes/Falses/Avg)

SNMP Test Probes 4/39/9.75
Function Probes with OID() calls 10/237/23.7
Function Probes with F'n calls 5/767/153.4

Since I started using affinity, I've only had 7 falses.
This is much better.

Re: version 3.3 so far

Posted: Tue May 05, 2009 4:41 pm
by lebowski
Hey adamd292,

How many devices are you running? I have about 100. What operating system are you running it on?

With the better cpu probe and affinity set to quad core I had 400 false positives overnight.

I have changed back to single core. Automatic discovery didn't discover any devices last night so even if discovering new devices helps break it sooner it clearly breaks faster with more than one core.

So at this point It looks like there is still a bug in collection that doesn't align the snmp result with the snmp get in Windows 2k3 sp2.


Re: version 3.3 so far

Posted: Wed May 06, 2009 10:11 am
by adamd292
I have only 75 devices, but most devices have 10-12 probes.
I have automatic discovery turned off.
I'm running Win2k3 with latest patches.
The Dude servers run at about 10% CPU (as measured by Dude); and average at 350kbps on a 1GB card.

I did have a Dude 3.0RC running on WindowsXP with the same problems.

I'm looking at the possibility that losing/delaying the SNMP UDP packets causes Dude some grief.
I have ipMonitor from SolarWinds too, so I might do a lower level comparison of their activities.

Re: version 3.3 so far

Posted: Wed May 06, 2009 4:52 pm
by lebowski
Good info... I used to run about 100 devices and once I got the affinity thing working I was rarely having outages but we recently started a project to find out which computers are not being shut off over night so I am auto discovering about 1/3rd of the computer which i just ping. They are all on the same map. It is kind of funny.

Even with affinity set the dude starts getting false positives in about 1 hour now that I am monitoring about 800 devices. It could be auto discover that is throwing the dude off and I will be disabling that later today to find out.

I am now running 3.4 with the same experience.

My kbps is really not that high only around 25kbps with occasional jumps to 300kbps about every 30 seconds which makes perfect sense.

Here is a shot of 670 devices on one page. I wish there was an easier way to organize them and sort them.
Most people are not at work yet....

Re: version 3.3 so far

Posted: Mon Aug 31, 2009 4:17 am
by adamd292
I've worked it out.

Some of my devices for their own fiendish reasons take more than 10 seconds to respond to SNMP.
I found this out using the ipMonitor tool from Solarwinds. The worst are taking almost 60 seconds on some occasions.

Maybe the way to fix this is to increase Try Timeout?
It seems to be limited to 10 seconds, but if I could get it to 60 seconds, I think that I could get rid of my remaining graph gaps.