Common Factor for Failure....

beerfiend · Thu Mar 13, 2008 6:23 pm

We recently started playing with the dude "someone with a sense of humor named this product didn't they" and one thing we found after running it for aobut a month was we had a device drop issue. specificlly the devices went into a completely unresponsive state and had to be rebooted. this happened three times. the only affected devices were Cisco wireless AP's (1200Series), and Cisco 1400 Series bridges. basically our entire wireless network plopped on us. we had a few theories as to what happened and this is what we came up with. first time was just reacting toa blow like that. rebooted devices yadda yadda yadda, got them all back online and started checking logs. funny thing is when a 1200 ap goes down it erases it's logs, so that sucks. we had no indication as to what happened. second time we wondered if it might not be our WLSE appliance which manages all these devices. so we took that offline. third time it happened, we grasped at straws and tried to console a device to get the logs before reboot. this was unsuccessful. still graspoing at straws we disabled the dude server. since disabling the dude server we have not had this problem recur. what i'm wondering is if anyone else has ever seen anything like this, in relation to running the Dude?

Don't get me wrong here. i'd really love to continue running dude as it's a more versatile and easier to use peice of equipment than a lot of other network monitors, i just need to be sure of stability before investing any more time in it.

VTWifiGuy · Thu Mar 13, 2008 10:23 pm

What services are being monitored on these devices?

beerfiend · Fri Mar 14, 2008 3:14 pm

Telnet, ping, snmp, http, prop wireless(checks OID for SSID to identify). nothing to heavy. we also are monitoring the links via snmp for traffic stats.

VTWifiGuy · Fri Mar 14, 2008 4:13 pm

I'd betcha a beer its the telnet service. With our waverider brand end user and ap systems if telnet was monitored the telnet server on them would eventually crash (were talking a matter of hours here not weeks) making the device not accessible until reboot. Although your example is more dramatic its the only thing i have seen that could account for the slow motion disaster your describing, i'd think it was something cumulative because the dude doesn't just randomly send out death packets

Ive also been recently told that the http monitor done by the dude generated a stack trace error in the log for our mail servers web interface the log files were like 100 times larger than normal.

So what am i saying? I dont monitor telnet at all for anything on my network and honestly just stick to ping and snmp, keepin it simple.

Although there is another person on here saying the snmp queries were killing his trango ap's... or was it a switch...

What was your polling interval on your probes? Im curious if that has a part to play in how long it took your ap's to lock up?

beerfiend · Mon Mar 17, 2008 2:57 pm

well I was going to try to send you a digital beer but it didn't work so that'll have to wait. Anyway, the probing interval was default (30) I think. I've been running a test environment for the past week trying to recreate the issue. in an attempt i jumped the probing from 30 to 5 and set the timeout to 30, just to try and stack up the traffic and so far no luck recreating the issue. like i said though, it's weird that we ran it for like three or four months without an issue then all of a sudden boom, downed the wireless three times in a week. wondering if maybe I've got a corrupt file somewhere. random death packets, i like that, but I'm fairly positive that there are very few of them flying around on networks.

I totally agree it's got to be cumulative. hmmm, the AP's do run HTTP as well.

well at any rate i'll continue running, and if this mysterious crash happens again I'll turn off my telnet monitoring. think real heard here. what debugs would you turn on to log traffic issues like that so you can get an early warning or at least a boot print after the fact. i'm thinking debug http, and possibly ethernet interface as that's the only way the dude should be able to see them. tried debug all, and well.....we all know what happens when you do that.

Common Factor for Failure....

Common Factor for Failure....

Re: Common Factor for Failure....

Re: Common Factor for Failure....

Re: Common Factor for Failure....

Re: Common Factor for Failure....

Who is online