version 3.3 so far

I upgraded to 3.3 at around 1pm MST. So far it looks like it is much better in SNMP collection.
Here is an example SNMP graphic of 3.2 vs 3.3.
vercomp.JPG

SNMP/DNS seem to be fixed but my device legends are still missing - including bit rates on links - and http probes aren’t working correctly either.

Devices report “connection closed” however I can reach the device in my browser.

I am wondering if having the dude on a quad core processor is causing my issue. Although graphing in 3.3 is much better than 3.2 I still have times that I am not graphing. If I snmp walk an oid that it claims is timing out I see the value. Is anyone running w2k3 sp2 on a quad core?
graph.JPG

I have a dual core. I don’t have any single core machines :frowning:
snmp_gaps.png
Every now and then, I have “gaps” in my snmp traces.
If I snmpwalk the device during a gap, I can see the data values that are being traced.
I have two Dude 3.1 servers. If one Dude has a “gap”, the other Dude will be ok.

It appears regularly for particular device/oid combinations.
For example a device will accurately report CPU, but temperature will have gaps.
Some of the devices with problems are on the same VLAN as Dude.
Some devices are worse than others.

Is this just a problem for Sweetdude and me?

Maybe it’s time I used Wireshark to track exactly what is happening?

Just upgraded - it looks like I’ve lost the ability to identify interfaces using the “routeros” method:

I have forced affinity to a single core… have to see if it makes a difference.

I have wiresharked and when a probe is timing out I can hit reprobe and see the response and it stays timeout.
It was much worse on 3.2 and 3.3 is much better I just can’t use the dude for notification since it has many false positives.

If you want to force affinity you can just right click on the program in task manager. There is a program to force affinity that seems to be a good candidate, it runs at startup etc.
http://www.geocities.com/edgemeal_software/SetAffinity/index.htm

Never mind forcing affinity to a single core does not fix it.

I restarted the service and forced affinity as soon as I could and things have been much better since. I will report back in a few hours. Maybe you should try it as well oscarBravo.

Any way to do that without restarting the router the Dude server is running on?

Yeah open the services control panel… stop and start the service. If it won’t restart then kill dude.exe.
Download “set affinity II” beforehand and as soon as the process is restarted you will see it and set it’s affinity then.

It is supposed to be able to set affinity upon a reboot but I have not tried it.
affinity.JPG
If you add dude.exe as a favorite it will show up at the top of the list, once you set that as an option in the program so you might want to run set affinity before you start and stop the service…
I hope you have good results!!! It would be so nice to have graphing work correctly.

Oh doh your running on a router board… nm that oscarBravo adamd292 is the one running dual core pc. I don’t know how to deal with routerboard.

Ok so it looks like dual and quad core have collection issues. With the program added to Set Affinity II and automatically forcing affinity to a single core my graphs are much better. I did have a hickup (many false outages) when I was adding some devices to a submap yesterday morning but restarting the service fixed it, also this service restart was the first one that automatically went to a single core so that might have had something to do with it. So now about 20 hours later things are still mostly clean in the graphs. Much better than with out setting the Affinity.
affinity.JPG
w2k3 sp2 quad core…

Just a quick update, Since I forced affinity with the program above the collection of all the probes has stabilized I have not seen any false positives and there are no gaps in my graphs.

We need someone to determine what needs to be fixed to support multi-core processors.

Any Ideas?

Thanks
SD

Another update, the collection issue on multicore is not completely solved. It seems collection works just fine until discovery adds some devices.

I have added some auto discovery items to the dude and it looks like it is stable but if a few new items or if I manually add devices polling of other devices is not consistent. Stopping and restarting the service causes it to work fine for a while.

I have upgraded to Dude 3.3 and started using Set Affinity II to limit Dude to my 1st Core.

I have less obviously false outages, but there still are some.

All of the false outages involve SNMP - “ping” and probes based on external commands do not generate false outages.
During a false outage, SNMP on the device in question can be walked and the individual oid associated with a failing probe can be requested and a valid value obtained.

CPU% on the Dude server and target device are <30%.
N/W traffic is <1Mb/s on 100Mb or 1Gb NICs.
No errors are reported in any logs (except the false outage in Dude).
Devices are all types: AIX servers, Fibre Switches, Windows servers, Storage Arrays, etc.

I get at least 2 false outages a day. Strangely they each last either 3:00 (20%) or 4:30 (80%).
My SNMP poll interval is set to 0:30 and timeout set to 0:29.

I have a new theory on why probes become unstable.
I am thinking that probes based directly on OIDs and not on functions are more succeptable to become sporatic.

For example the battery capacity probe I made looks like this…
cap.JPG
I just converted my Cisco CPU probe to work like this…
ciscocpu.JPG
I am going to disable affinity forced to 0 and let it go over night. I will let you know as soon as I find out if it stabilizes CiscoCPU vs my other probes built with OIDs.

Hey adamd292, Could you convert one of the probes that you get false positives with into a probe that calls a function that reads the oid instead. Like I did in the previous post.

I will be watching my ciscocpu probe this week to see how things go.

Never mind with that… What I have found is it runs much better with single core affinity but anytime I add a device either manually or discovered within an hour or so I will start getting random failed snmp reads.


The probe based on a function is better from the standpoint that it will not get added to a device that does not have that counter it also doesn’t show as being down when the counter is actually returning 0 as the result.

Hi Sweetdude,

I agree, since going affinity=cpu0 my dude is much better behaved.
Whether this is a threading problem, or just a problem that gets fixed when the dude runs slower, I do not know.

I did some quick stats on SNMP Test Probe vs Function Probe with direct OID vs Function Probe that calls functions.

False positives over the past month: (Probes/Falses/Avg)

SNMP Test Probes 4/39/9.75
Function Probes with OID() calls 10/237/23.7
Function Probes with F’n calls 5/767/153.4

Since I started using affinity, I’ve only had 7 falses.
This is much better.

Hey adamd292,

How many devices are you running? I have about 100. What operating system are you running it on?

With the better cpu probe and affinity set to quad core I had 400 false positives overnight.

I have changed back to single core. Automatic discovery didn’t discover any devices last night so even if discovering new devices helps break it sooner it clearly breaks faster with more than one core.

So at this point It looks like there is still a bug in collection that doesn’t align the snmp result with the snmp get in Windows 2k3 sp2.

Thanks,
SD