re-visiting link label value updating failures...

mikepratt · Tue Feb 19, 2013 3:09 am

This topic has been visited many times in this forum, with no apparent resolution, as near as I can tell. I have read every relevant post, and attempted nearly all of the proposed remedies, but still have the well-known problem - link labels do not remain updated, and slowly drift off the the rx:tx: default. Yes, this is mostly evident on Snmp links, but also the performance on ROS links is poor as well. I've included screen pics below, to illustrate. I also see the poor performance graphs on the ethernet Snmp links, as noted elsewhere.

In an attempt to address this, I have adjusted nearly every element I see noted in the forum regarding this - I have 1) increased polling intervals to 5 min; 2) increased polling Probe Timeout to 1 min; 3) increased Label Refresh to 30 sec; 4) increased Snmp (v1) Try Timeout to 10 sec (max); 5) un-selected Resolve MAC Address Manufacturer; 6) increased Mac Mapping Refresh Interval to 1 hr; 7) increased the Database Commit Interval to 30 sec; 8) increased ROS Connection Timeout to 15 sec; 9) increased the ROS connection Interval to 1 min; 10) and finally, forced CPU core affinity to a single core for Dude.

No joy.

Of course, when I simply select and view any of these link labels, the data will 're-fresh' and read correctly. I appears that something in the process of polling is failing or timing out, and eventually completely defaulting to the default label value.

I know this has been visited on many posts in this forum, but I can find no definitive 'fix.' If it is known and I missed it, please be kind enough to point it out to me. If MikroTIk is listening, please enlighten.

Thanks in advance...

Here's a small network with the link labels displaying appropriate data:

Here it begins to decay - notice seveal labels start to show a vlaue of '0'... even those on ROS interfaces...

Here it is in full decay - many labels showing the default rx:tx Interface stuff...

Here's the default label probes, as you're probably aware..

Here, I added another Variable string, just to see if it was the default string... - no joy.

Lastly, here's an example of the Link History charts showing the erratic results displayed.

Any help will be much karma'ed/////

mikepratt · Wed Feb 20, 2013 6:37 pm

Anyone out there have a thought on this issue? Someone must have resolved it...

Lebowski - you've written on/around this topic. Any pointers?

Normis? Uldis? Any help here?

Sat Feb 23, 2013 1:56 am

This is a bug, there are many things that can be discussed from this but it would be so much simpler if they would just open the code and fix some bugs... My solution was to disable link labels on the "dashboard" map so I don't see them as often.

(continued reading)
It seems to me that any response that is slightly slow is more sensitive to false positives. Most likely the dude does not wait correctly for packets to return.

I could never solve link labels going haywire. I improved false positives dramatically by doing two things. Configuring every probe to have a 5 second negative cache value and setting the polling interval to 1 minute. I know you are not calling these false positives but here is why I am calling them that.

With wireshark I have watched a specific request for an snmp counter get returned but the packet somehow did not get processed. So I saw the request and I saw the response and then the probe showed 1 failure. Now with the default 300 second negative cache time the dude will not correctly retry any re-probes. This is a very specific bug. The negative cache time should not be set until the configured retires are expired but the negative cache time is set in the first failed probe and since sometimes a response is lost (doesn't seem to get processed). A device that is certainly on the network will show as down and no mater how many times you reprobe it while the negative cache is in effect it will not come back up BUT you are logged into the device so you know it is not down. Reducing the negative cache time lower than the reprobe interval is how to solve this. You have to do this to every single function since negative cache time is not a variable also I cant think of any reason for negative cache time, if a device is down you will want to know the exact time when it came back up not 300 seconds after a failed probe.

So now lets apply this to links which are probably based on label refresh interval. Do they also have a global negative cache time? (I don't know if it can be modified.) A failed poll of an interface causes the label to show that RX[interface.in.bitrate] label. The next successful poll shows a 0 on the label since the previous value is no longer in record, the dude can't do any math to determine how much the link has been utilized. Interface utilization is calculated in octets so you have to have a value to start with then diff the value based on the time since last successful read. I would bet that the probe of the interface was successful and the packet was never processed which is basically a false positive.

I listed the effects of your attempts to resolve it.
1) increased polling intervals to 5 min - all outages are now 5 min and false positives are still 5 minutes, doesn't affect labels.
2) increased polling Probe Timeout to 1 min - the packet probably made it back to the dude so the probe didn't actually need a longer time.
3) increased Label Refresh to 30 sec - Decreased the number of failed reads by reducing total reads. (labels might be correct more often)
4) increased Snmp (v1) Try Timeout to 10 sec (max); - snmp v1 or v2 should take well under 5 seconds to respond.
5) un-selected Resolve MAC Address Manufacturer; - shouldn't matter
6) increased Mac Mapping Refresh Interval to 1 hr; - mine is set to 24 hours maps mac addresses to switchports
7) increased the Database Commit Interval to 30 sec; - shouldn't matter

increased ROS Connection Timeout to 15 sec; - shouldn't matter
9) increased the ROS connection Interval to 1 min; - shouldn't matter
10) and finally, forced CPU core affinity to a single core for Dude. - was an early attempt to understand the issue - I am still forcing dude to a single core and running it in real time. This seems to cause just a few less false positives.

To see the dude fail to read some devices click on services, then sort by problem and watch the list, occasionally devices will show "not available" when that happens and the negative cache has not been modified those probes that show "not available" will become down since none of the retries ever get sent.

Also the HTTP request probe shows some signs of this troubled processing, many times one http probe will show down and clicking on retry will quite often show "connection closed". Watching this in a protocol analyzer even though the server responded in .009 seconds the dude shows connection closed before the three way handshake has even finished.

Since they are not going to make it open source and they don't have anyone developing it maybe they could "hire" someone,have them sign a NDA, and let them work on the code for free... (me). That way the code will get some much needed attention.

the wall of text,
Lebowski

mikepratt · Sat Feb 23, 2013 5:52 am

Wow - extraordinary response to this issue, thanks, Lebowski. Your posts always stand out. Thanks for responding to my cry for help...

I had little hope any of the config tweaks I made would have any likely impact, but wanted to cover what 'flailing in the dark' I'd attempted.

If you would, could you give me your thoughts/speculation on a couple of follow-up questions?

First, I forgot to mention that I am now using 4.0beta3 - is this wise, or should I be on the 3.6 release? Is this issue on both? I think I gleaned from the forum is was...

With regards to the labels - why when you (I) simply select one that has decayed, and merely observe and then close the link Settings window, does the data then show correctly? I assume this initiates/forces a new probe/re-probe, but why does this always seem to produce a relevant result/update? Wouldn't this negative cache thing affect any/all probe attempts?

Why do some/most/all labels/probes seem to behave OK on some networks, where on other networks, all labels decay? Physical elements of these different networks a quite similar if not identical (wired ethernet, with some MT radio links, Dude resides MS host on wired LAN).

Also, when I look at the actual Variables(?) shown on the label Appearance window - such as [Interface.OutBitRate] - is that the actual probe that the label calls? Is this where I might try to modify the negative cache time?

Finally, I know you've penned much on the negative cache issue and hopefully, how to modify it. Can you point me to the 'best' example of this on the forum? I might give it a try for these two Interface probes, at least.

Being able to see traffic levels on links is very valuable to me, in video backhaul networks. I find it incredible something as universally useful as this doesn't work or get the attention it deserves to get fixed. But, I gather it is but one of a large list of 'opportunities for improvement' of an incredibly powerful tool....

Thanks again, Lebowski, for taking the time to respond so thoroughly.

Mon Feb 25, 2013 5:20 pm

I never fixed the trouble I just disabled link labels on my top map... I hover on links I am interested in and I have my links set to their daily high water mark so they turn red when they get used not when they are just utilized. Once you watch your own network for a few days you will know the utilization on a link so set the speed near that utilization and links will show when things pickup. I find it makes for a much cleaner top map and if a label is showing the wrong thing I never see it. I left the labels turned on for all the sub maps.

The issue started somewhere before 3.6 and there were a few concerned posts right when the issue appeared but it was never directly addressed as far as I can remember.

The reason to run 3.6 is if you don't want to use a database to store your data. There are not any database experts in the forum. So if you have a database issue and you lose your maps or whatever you will be on your own (besides vacuum which is documented on how to do). At least with 3.6 if you can read xml you can understand the layout.

The reason to run 4.3 is the graphics always show up, sometimes in 3.6 a graphic for an object would just not draw and restarting the server would fix it, also there was some drag and drop issues with copying files. So over all I like 4.3 better but they both suffer from some internal issues. I have been running 4.3 since the day it came out. (If you are going to upgrade the dude make a backup of the folder).

The issue with timing is very complex and sitting here trying to understand why is kind of pointless due to the object oriented nature of the program. Since I didn't write it I can't actually explain it but consider this, if you export the configuration the polling is halted while the file is exported. You could say that is a bug but the database can't be updated while the configuration is being exported so polling has to stop while the configuration is exported BUT there is a bug, even though polling is halted all the internal timers for those services are not halted. So if your configuration takes 10 minutes to export as soon as it is done every object on the map will show a false positive. This is the nature of maintaining someone eles's code. If you didn't understand why someone did something and you code around it thinking you are improving the program you could be making a huge mistake.

The dude timing seems to be this; Create all probes and execute them sequentially as soon as possible instead of dividing it into time slots/polling interval/services. To observe this, once you configure enough probes to have a short Negative cache time you open the services panel then sort the list of services (sorted by problem) while watching re-poll a service that seems flaky, this service will now have its execution sequence modified and it will hopefully exist in a less contested time slot. Most of the devices that have false positives stop having them after doing this. Usually when I restart the service I will go watch this and I re-probe the ones that are having false positives.

To answer your questions,
#1 - I have not tried since the RX:[interface.in.bitrate] is a built in function that I couldn't figure out how to modify.
#2 - maybe a bug that causes the dude to expire a timer that it should not have.
#3 - It would be if you could access the function or replace it with your own.
If you made a function call it "dname" with code oid("1.3.6.1.2.1.1.5.0") then put that on the appearance of a device [dname()] you will see the device name then I can change the fucntion to oid("1.3.6.1.2.1.1.5.0",40,5) it would remember a good read for 40 seconds and a failed read for 5 seconds, if you have 30 second polling interval the name should be retrieved ever other time the function is executed because the cache timer should expire between every other poll. I have not examined this directly with Wireshark but the point is you can modify the cache and negative cache times of any function that appears in that panel but not for the built-in functions as far as I know.

Maybe you could build your own bitrate function but then it would have to include the context (the IP address and link details). When the dude goes to execute the bitrate function it automatically determines the link details and the IP address of the device it is polling. I am not saying it is not possible, I have not tried it and I have not tried to add ,10,5 to any functions that are not exposed in the functions panel. (Normally I set cache to 10 and neg cache to 5.)

Yep the dude is the best example of object oriented programming I have ever seen. It is really great and anyone can successfully run it but getting good at it takes some time.

Finally if you really want to be notified when a link goes over a certain utilization find the posts from gsandul, he showed how to do this.

HTH,
Lebowski

re-visiting link label value updating failures...

re-visiting link label value updating failures...

Re: re-visiting link label value updating failures...

Re: re-visiting link label value updating failures...

Re: re-visiting link label value updating failures...

Re: re-visiting link label value updating failures...

Who is online