Community discussions

MikroTik App
 
lebowski
Forum Guru
Forum Guru
Topic Author
Posts: 1619
Joined: Wed Aug 27, 2008 5:17 pm

Gaps in graphs... just FYI.

Fri Dec 04, 2009 6:01 am

I recently started mucking around with the timing in the SNMP community names. I moved the timings up to 10000ms. I started getting gaps in my graphs. I changed it back down to 3000ms and the gaps went away. If you are seeing labels on your links that are showing up with the rx: tx: etc instead of actual values you might want to set the timing to default for SNMP.

I have not tested setting them shorter.
 
Toepfe
newbie
Posts: 30
Joined: Fri Oct 31, 2008 11:48 am

Re: Gaps in graphs... just FYI.

Fri Dec 11, 2009 12:41 pm

I also fight with timeouts/false alarms. As adamd292 described in

http://forum.mikrotik.com/viewtopic.php ... d8#p169372

By the way, thanks for this excellent "detectiv work"! ;-) It was always a mystery to me, why these false alarms happens.

I also tested the highest timeout value which is currently possible (10.000ms) but it doesn't get better. So I switched back to 3.000ms but with 20 tries. It seems to be a little bit better, but unfortunatelly the timeouts/false alarms are still there. In my opinion the gaps in graphs go hand in hand with these timeouts/false alarms.

Maybe the guys from Mikrotik will enlarge the possible value of the field "Try Timeout" in the SNMP configuration.
 
lebowski
Forum Guru
Forum Guru
Topic Author
Posts: 1619
Joined: Wed Aug 27, 2008 5:17 pm

Re: Gaps in graphs... just FYI.

Mon Dec 14, 2009 6:44 pm

Well the list of interesting bugs is fairly long I should try to compile it but since the new version is going to be so different I will just wait and see...

BUT you will find that when certain types of probes fail (function based?) no amount of re-probing will bring it back up. But after about 4 or 5 minutes it will come back up all by itself. Probably a cached failed read?

I don't understand the relationship between the SNMP timeout (the post from adamd292) and the polling timeout. I suspect that trying to tune these causes lots of weirdness.

Do you set your Dude priority to Real time?

One nasty bug that I have not had to deal with is there is a memory leak when polling services on servers.
Well it is hurry up and wait for the new version maybe they will open it up after the next version.
 
Toepfe
newbie
Posts: 30
Joined: Fri Oct 31, 2008 11:48 am

Re: Gaps in graphs... just FYI.

Mon Dec 21, 2009 4:36 pm

Sorry for the late answer, the flu got me :?

I am not really sure what you mean with "Do you set your Dude priority to Real time?". In the past I see that you install special software for dual core processors to reduce false alarms. I tried to deactivate the dual core function in the bios of our Dude computer. But with no success.
 
lebowski
Forum Guru
Forum Guru
Topic Author
Posts: 1619
Joined: Wed Aug 27, 2008 5:17 pm

Re: Gaps in graphs... just FYI.

Sun Jan 03, 2010 10:12 pm

NP I have been on vacation, going to get back to work :)

I have tried about every combination of disabling multiple cores, forcing priority, changing polling timeout and snmp timeout.

I have not found a fix for my situation. I get the least amount of false positives with the SNMP probe timeout at 8000ms and forcing the dude to real time priority. When I set SNMP probe timeout too high 10000ms I get gaps in the graphs. To force real time priority it can be done with with the system internals process explorer.

I have not found running on a single core to decrease false positives. At this point I am really just waiting for the next version.
 
lebowski
Forum Guru
Forum Guru
Topic Author
Posts: 1619
Joined: Wed Aug 27, 2008 5:17 pm

Re: Gaps in graphs... just FYI.

Mon Jan 04, 2010 4:42 pm

Just fyi I upgraded to 3.5, still have false positives.
 
lebowski
Forum Guru
Forum Guru
Topic Author
Posts: 1619
Joined: Wed Aug 27, 2008 5:17 pm

Re: Gaps in graphs... just FYI. - possible fix?

Mon Jan 04, 2010 9:50 pm

Maybe I was premature or maybe I had set something wrong but the last false positive I had was hours ago and things seem much better.

I did make more changes this morning right after I had a false positive on 3.5 so now I am starting to believe that I caused all the trouble to begin with.

My current settings for polling is 30 seconds for probe, 16 for timeout and 3 for count. The settings in SNMP are; every community string timeout is set to 9000 ms but the no-snmp is set to 10000 ms, changing no-snmp to a higher value than all the other snmp could have cleaned it up? Although I don't know how the code works internally so why that would be true is beyond me, also I started running 3.5 first thing this morning.

If you are having graphing issues or false positives will you try the settings for polling and SNMP that I describe and post your results here!!! Also I am still forcing high priority on the dude server service.

Thanks
SD
 
lebowski
Forum Guru
Forum Guru
Topic Author
Posts: 1619
Joined: Wed Aug 27, 2008 5:17 pm

Re: Gaps in graphs... just FYI.

Tue Jan 05, 2010 3:51 pm

Well no joy I had 42 false positives overnight. The reason I am certain they are is because we have a completely separate system that pings devices every 5 minutes and if something was down it would have had generated notifications as well. NMIS is that system and it had none.

But at least my graphs don't have gaps...
 
Toepfe
newbie
Posts: 30
Joined: Fri Oct 31, 2008 11:48 am

Re: Gaps in graphs... just FYI.

Tue Jan 05, 2010 4:12 pm

Thanks for the informations. I made an update from 3.3 to 3.5 on my second Dude server. But after your last entry, I am not very confident :?
 
lebowski
Forum Guru
Forum Guru
Topic Author
Posts: 1619
Joined: Wed Aug 27, 2008 5:17 pm

Re: Gaps in graphs... just FYI.

Tue Jan 05, 2010 4:51 pm

The only thing I have come up with after all the mucking around is it seems that sometimes when I restart the service it just starts good and other times it doesn't. I am getting more than average false positives I will restart the service and force it to real time priority as fast as possible. This seems to correct it but it also seems when I add and remove devices false positives just start happening.

I suppose I could capture traffic until I get a false positive and see if the response is actually there although most of my network is fiber so I know there is not much traffic if any getting dropped. None of the gigabit links are over 40Mbps. None of the switches or routers are high cpu utilization.

I still like setting real time priority since it just makes it run smoother but I don't think it is helping with false positives either.

Anyhow if anyone has any ideas why a dude server running as a service on win 2k3 sp2 with 4gb of ram and a quad core 3.4ghz xeon could have false positives I would like to hear from you.
 
User avatar
gsandul
Member Candidate
Member Candidate
Posts: 154
Joined: Mon Oct 19, 2009 1:42 pm

Re: Gaps in graphs... just FYI.

Wed Jan 06, 2010 3:14 pm

Hi, sweetdude.
1) How many probes do you have? Currently I have about 300 services monitored, about 50 SNMP links and about 100 different functions displayed on devices and have no problem with devices, but had to setup 5000 ms SNMP timeout for windows servers.
2) It is possible, the problem is not in dude itself, but in network. In some situations devices (routers) could limit amount of UDP traffic (new udp connections per second, is better to say) and you will have no responce from device behind such a routers from time to time.
Note: it seems when you make new probe and functions dude reshedules all the probes and that is, probably, why you have positives (just not good for your network probe shedule because amout of UDP connetions per second is limited on intermediate router)

Note2:I know, udp is connectionless protocol when I say new connection it means new(SrcIP:newSrcPORT to newDstIP:DstPORT) packet. May be better to say new UDP flow (if you know what is cisco netflow you will understand me)
 
lebowski
Forum Guru
Forum Guru
Topic Author
Posts: 1619
Joined: Wed Aug 27, 2008 5:17 pm

Re: Gaps in graphs... just FYI.

Wed Jan 06, 2010 5:21 pm

Hey gsandul,

I have about 350 probes and very few items added to labels.

The dude server is directly connected via gigabit to a 3750 stack that is then connected to a 6509 through a 10gb connection. I could move the connection directly to the 6509. If your familiar with Cisco you will know that a 6509 with a sup720 should be sufficient. Although if I do intermittently have too many udp connections I could someday break the server into two.

I am thinking maybe I broke it adding MIBs or something long ago although with the current settings it is are not as bad as they were.

I am looking forward to the database version...

Thanks,
SD
 
danielillu
Member Candidate
Member Candidate
Posts: 111
Joined: Sun Aug 27, 2006 5:37 am
Location: Barcelona, Spain

Re: Gaps in graphs... just FYI.

Wed Feb 24, 2010 2:34 am

Hi,
I'm testing a new setup and trying to monitor tests using a fresh dude:
RB450 with all ethernets bridged.
ether1 is connected to a 10mbps line.
bridgeFilter+queues to shape bandwidth to/from ether1 to ether5@4mbps (max 5) and ether4@6mbps (max 7).
ether4 has a w2003r2 acting as dude server & FTPserver (to push traffic over the line).
Dude 3.5

If I set up the map labels to "mastering type: routeros", the bitrate label gets stuck after a few reads showing always same values.

If I set up the labels to "mastering type: SNMP", I get the text most of time. Even many time I cannot select the ethernet on the dropdown list because only appear one or two interfaces (RB450 has five, plus a bridge).

In both cases there are many gaps in graphs. In fact, sometimes there are some lines.

And now my big discover :)

I've just found out that gaps only appear when there are labels set up to "mastering type" different from "simple" (that's is when there is no label showing bitrate).
 
lebowski
Forum Guru
Forum Guru
Topic Author
Posts: 1619
Joined: Wed Aug 27, 2008 5:17 pm

Re: Gaps in graphs... just FYI.

Wed Feb 24, 2010 4:16 pm

You have to use mastering type SNMP, the other is broke.

Under settings SNMP, what is your timeout? I only found that setting timeout above about7 seconds seemed to cause issues and setting to 10 seconds seems to cause graphs to have gaps. I am not using RB so ymmv.
 
User avatar
blackpaw
newbie
Posts: 37
Joined: Thu Jan 28, 2010 7:19 pm
Location: Amsterdam, NL

SNMP timeouts - Re: Gaps in graphs... just FYI.

Tue Oct 26, 2010 10:29 pm

I am having simiar issues, gaps in graphs, false positives and no idea what is causing them.
the network does not have problems, the devices I monitor have no snmp bottlenecks so I am starting to wonder what these timeout values mean... (also I tried a lot of combinations with no effect)
What also happens is that some probes just time out for no particular reason at all. Just timeout.

A restart of the Dude service always helps and cleans all false positives immeditately so I have set up a daily dude server restart using the "net stop" and "net start" command.
This took care of the graphing problem but I still have these timeouts...

So: (brainstorming)
- snmp uses UDP, meaning there is no "TCP Session " to track the polling. Dude sends, Router answers.
- There must therfore be some kind of intelligence somewhere inside the Dude that keeps track of all pending SNMP operations and waits for an answer to any probe. But how does it do that?
- Assuming the Dude is the limit, I will try to set up two extra Dude servers that will poll 1/3 of my devices on behalf of the "master" Dude server. IF this helps I will know for sure.
- If the timouts are caused by probes not returning answer on time increasing teh time in SNMP preferences should help but it doesn't. Also setting it shorter does not help... (100 ms and less)

Any comments on this? Ideas how to monitor?

P.S. Using Dude 4.0 beta 2 - love the database and improved graphing!
P.P.S. monitoring about 30 devices with about 10-40 probes each - calling various OIDs and functions in them (CPU, interface description, etc...)

Andreas
 
lebowski
Forum Guru
Forum Guru
Topic Author
Posts: 1619
Joined: Wed Aug 27, 2008 5:17 pm

Re: Gaps in graphs... just FYI.

Tue Oct 26, 2010 11:52 pm

Sounds like you have gone probe craZy :)

There are two things to do, modify all probes to use a 29 second negative cache (or any time as long as it is shorter than your probe interval) and force the dude service to real time priority if running windows. Changing the negative cache time will let probes that do fail come back up quickly. Running the dude service real time priority reduces the chance that a response is not processed.

At this point I believe that the false positives are caused by items that just fail to get processed.

I thought this was interesting, There is no way any of devices below would ever be higher than 10ms to ping... Notice how the graphs are similar. From what I can tell if the dude calculates a rtt of 50ms it will most likely calculate that again and again even if it is not correct.
Service_2010-10-26_14-46-39.png
You do not have the required permissions to view the files attached to this post.
 
chrisd13
Frequent Visitor
Frequent Visitor
Posts: 68
Joined: Mon Feb 20, 2006 4:05 pm
Location: UK

Re: Gaps in graphs... just FYI.

Wed Oct 27, 2010 7:33 pm

I agree with the "fail to get processed" statement. I am monitoring somewhere between 400-500 devices, most are just ping but 150 have diskspace and service probes assigned. The majority of the false positives I get are service or disk space probes, they will alert saying the probe has failed and then be back up in a matter of seconds. I am running my Dude server on a DL380 G3 dualcore Xeon 2.4 with 4Gb RAM on W2k3 R2. The server is just ticking over, so I am a little confused as to why I get these false positives.

Just adding my observations :)
 
lebowski
Forum Guru
Forum Guru
Topic Author
Posts: 1619
Joined: Wed Aug 27, 2008 5:17 pm

Re: Gaps in graphs... just FYI.

Wed Oct 27, 2010 10:31 pm

Like you I am running w2k3 sp2 xenon quad core, 10% CPU utilization... My false positives seem to repeat on the same device, virt mem on x up and down 10 times... then cpu on x up and down a few times...
 
User avatar
blackpaw
newbie
Posts: 37
Joined: Thu Jan 28, 2010 7:19 pm
Location: Amsterdam, NL

Re: Gaps in graphs... just FYI.

Thu Oct 28, 2010 9:57 pm

Lebowski and other probe-crazies ;)

I tried out the probe/function ideas that you assembled in the wiki at
http://wiki.mikrotik.com/wiki/Getting_s ... and_probes

however, I ran into a problem:

[code]#sho proc cpu sort
CPU utilization for five seconds: 46%/9%; one minute: 63%; five minutes: 44%
PID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process
196 21695537441564813062 1386 31.35% 44.85% 29.82% 0 SNMP ENGINE
191 2585685402353894333 0 2.87% 2.68% 2.28% 0 IP Input
184 2824427122856171744 0 1.83% 1.78% 1.21% 0 IP SNMP
10 58310624 390475751 149 0.87% 0.74% 0.66% 0 ARP Input
185 472294481442857906 32 0.23% 0.30% 0.19% 0 PDU DISPATCHER
148 4694300 99594750 47 0.07% 0.03% 0.02% 0 HSRP IPv4
[/code]

This is from a 7600/6500 Cisco router - quite some impact :)

I think it may be a little too much to call a whole SNMP walk every time you execute the function and call the same function 5 times in the probe.. but that may be just me...

I have set the SNMP timeouts to standard 3000 ms and the retries to 5, later back to 3 but it kept flooding my router. back to square one, I guess.

Maybe I really should distribute the load over several Dude servers...

Any hints would be appreciated


Andreas


My code:

[code]
<?xml version="1.0" ?>
<dude version="4.0beta2">
<Function>
<sys-type>57</sys-type>
<sys-id>1026822</sys-id>
<sys-name>Cisco_CPU_1min</sys-name>
<code>if(array_size(oid_column("1.3.6.1.4.1.9.2.1.57", 10 ,29)), oid("1.3.6.1.4.1.9.2.1.57.0", 10, 29)+1 ,"False")</code>
<descr>Reads the 1 minute CPU of a Cisco device.</descr>
</Function>
</dude>


<?xml version="1.0" ?>
<dude version="4.0beta2">
<Probe>
<sys-type>13</sys-type>
<sys-id>1026825</sys-id>
<sys-name>CiscoCPU</sys-name>
<typeID>8</typeID>
<functionAvailable>Cisco_CPU_1min() <> "False"</functionAvailable>
<functionError>if(Cisco_CPU_1min()<>"False",if(Cisco_CPU_1min() < 60, "", concatenate("Warning: high CPU: ", Cisco_CPU_1min(), "%")), "CPU polling fault")</functionError>
<functionValue>oid("1.3.6.1.4.1.9.2.1.57.0",10,29)</functionValue>
<functionUnit>%</functionUnit>
</Probe>
[snip a lot of service IDs and other stuff the Dude adds over time...]


Here the interface probes:

<?xml version="1.0" ?>
<dude version="4.0beta2">
<Function>
<sys-type>57</sys-type>
<sys-id>1106054</sys-id>
<sys-name>if_0_status</sys-name>
<code>if(array_size(oid_column("1.3.6.1.2.1.2.2.1.8", 10 ,29)), oid_raw("1.3.6.1.2.1.2.2.1.8.0", 10, 29),"False")</code>
<descr>polls the status of ifindex 0 (1 means 'up', 2 means 'down')</descr>
</Function>
</dude>

<?xml version="1.0" ?>
<dude version="4.0beta2">
<Function>
<sys-type>57</sys-type>
<sys-id>1106327</sys-id>
<sys-name>if_0_name</sys-name>
<code>oid("1.3.6.1.2.1.2.2.1.2.0", 10, 29)</code>
<descr>polls the interface name of ifindex 0 </descr>
</Function>
</dude>


<?xml version="1.0" ?>
<dude version="4.0beta2">
<Function>
<sys-type>57</sys-type>
<sys-id>1106600</sys-id>
<sys-name>if_0_desc</sys-name>
<code>oid("1.3.6.1.4.1.9.2.2.1.1.28.0", 10, 29)</code>
<descr>polls the interface description of ifindex 0 </descr>
</Function>
</dude>


<?xml version="1.0" ?>
<dude version="4.0beta2">
<Probe>
<sys-type>13</sys-type>
<sys-id>1106873</sys-id>
<sys-name>IFindex_0</sys-name>
<typeID>8</typeID>
<functionAvailable>if_0_status() = 1</functionAvailable>
<functionError>if(if_0_status()<>"False",if(if_0_status() = 1, "", concatenate(if_0_desc()," connected to interface: ",if_0_name(),"is down") ), "SNMP polling fault - most likely false alarm")</functionError>
</Probe>
</dude>
[/code]
 
lebowski
Forum Guru
Forum Guru
Topic Author
Posts: 1619
Joined: Wed Aug 27, 2008 5:17 pm

Re: Gaps in graphs... just FYI.

Thu Oct 28, 2010 10:19 pm

CPU utilization for five seconds: 46%/9%; one minute: 63%; five minutes: 44%

Yep that is high, but probes only read the oid values and not the entire table. Separating the probes onto different servers will not reduce how much SNMP data is being requested. The issue is most likely too many probes being run against your router or being run too often

There is a huge difference between snmpwalk and requesting one oid. See the below example...
oid_column("1.3.6.1.4.1.9.2.1.57") returns all the values contained in that oid and everyone below it. It is actually only 1 counter. (not bad)
oid_column("1.3.6") returns all the values contained in that oid and every oid within it. It can be thousands of counters. (Horrible)

I double checked that you are using an OID that is as complete as possible in your probes, they look good. Are you running many other built in probes? How often are you running your probes?

Divide and conquer... Remove probes one at a time from the router to determine which is the offending probe?
 
User avatar
gsandul
Member Candidate
Member Candidate
Posts: 154
Joined: Mon Oct 19, 2009 1:42 pm

Re: Gaps in graphs... just FYI.

Fri Oct 29, 2010 9:07 am

This is from a 7600/6500 Cisco router
Do you run eBGP and have a lot of routes in routing table? Or do you have a lot of virtual interfaces (eq pppoe, pptp, etc...) on your router?

Who is online

Users browsing this forum: No registered users and 17 guests