Community discussions

 
User avatar
macsrwe
Long time Member
Long time Member
Topic Author
Posts: 646
Joined: Mon Apr 02, 2007 5:43 am
Location: Arizona, USA
Contact:

Queues lie

Thu Jun 13, 2019 3:46 am

This isn't a question, it's a warning.

There is a bug of long standing in RouterOS that causes invisible, internal queue corruption. I have experienced it with both tree queues and simple queues, over a period of something like eight years, and have incontrovertibly proved it is happening.

The symptom is that target devices do not receive the full configured bandwidth, despite the queue entry being correct (and unmodified over extremely long periods). In all cases, exporting the queue collection to a file, removing all queue entries, then reimporting the file to recreate the "same" queue entries magically fixes the problem. We have termed this operation "rejuvenating the queues." We recently instituted a timed script that performs this operation automatically once per month, to write-around this bug.

We do not know how long it takes for a perfectly functioning queue to become corrupted; we settled on the monthly rejuvenation totally arbitrarily.

We also do not know if the corruption works equally in the direction of raising bandwidth limits, because in our 12 years of operation we have never received a single phone call from anyone saying, "My Internet is too fast, please fix it." We have, however, received an oral report of near ludicrous speeds from at least one customer, so there is a possibility.

More information at:

http://www.grandavebb.com/blog/2019/03/27/q-lies/
http://www.grandavebb.com/blog/2019/03/ ... up-q-lies/
 
mducharme
Trainer
Trainer
Posts: 788
Joined: Tue Jul 19, 2016 6:45 pm

Re: Queues lie

Thu Jun 13, 2019 6:04 am

This isn't a question, it's a warning.

There is a bug of long standing in RouterOS that causes invisible, internal queue corruption. I have experienced it with both tree queues and simple queues, over a period of something like eight years, and have incontrovertibly proved it is happening.

The symptom is that target devices do not receive the full configured bandwidth, despite the queue entry being correct (and unmodified over extremely long periods). In all cases, exporting the queue collection to a file, removing all queue entries, then reimporting the file to recreate the "same" queue entries magically fixes the problem. We have termed this operation "rejuvenating the queues." We recently instituted a timed script that performs this operation automatically once per month, to write-around this bug.
"Incontrovertibly proved"? Do you mind sharing this "incontrovertible" proof?

We haven't experienced this issue before and we are a reasonable sized ISP. I haven't heard of such an issue like this before either. I would be surprised if there actually was a problem with this and suspect instead that there is something wrong with the way you have the queues configured that is causing this problem for you.
 
User avatar
macsrwe
Long time Member
Long time Member
Topic Author
Posts: 646
Joined: Mon Apr 02, 2007 5:43 am
Location: Arizona, USA
Contact:

Re: Queues lie

Thu Jun 13, 2019 6:49 am

Sure. Let me give you the most recent instance.

Subscriber calls me. "My internet is slow." I run the standard BTest, and he has headroom up the wazoo, 3x contract rate. (On our network, the ports used by MT BTest avoid the queues so we can measure true available headroom, not just the max contract rate.) For kicks, I run BTest between his in-home hAP lite and our gateway, just in case the home cable is bad, and I get the same.

He's adamant, so I figure maybe it's a radio problem in the hAP. I roll truck out to the residence. Sure enough, both he and I are reading 2Mb if he's lucky, and not even steadily. So I swap out the hAP with a known good hAP. Same problem. I drag out a long ether cable and F/F connector and eliminate the hAP altogether. Same problem. Meanwhile, BTest between the gateway and CPE is still showing headroom up the wazoo, and BTest from my portable diagnostic MT to the customer's CPE is nearly the full 100Mb.

Remembering the queue problem I saw in the past, I run the same fix: safestore the queues to a file, remove them all, then re-import them. Bang, the customer is now getting full contract rate, high and tight, both over the cable and wirelessly from his hAP. And has ever since, and it's been several months.

You're not going to tell me the queues were not at fault here. You're also not going to tell me I had them configured wrong, because the configuration I put back was EXACTLY the same configuration that I wiped clean, and it worked just fine from then on… not only for this customer, but for about five others I heard from after applying the "rejuvenation" fix to all the towers.

I had precisely the same experience with queue trees some seven years ago — the symptoms were the same, and the fix was the same.

This problem exists. Deny it at your peril.
 
mducharme
Trainer
Trainer
Posts: 788
Joined: Tue Jul 19, 2016 6:45 pm

Re: Queues lie

Thu Jun 13, 2019 7:14 am

You're not going to tell me the queues were not at fault here. You're also not going to tell me I had them configured wrong, because the configuration I put back was EXACTLY the same configuration that I wiped clean, and it worked just fine from then on… not only for this customer, but for about five others I heard from after applying the "rejuvenation" fix to all the towers.

I had precisely the same experience with queue trees some seven years ago — the symptoms were the same, and the fix was the same.

This problem exists. Deny it at your peril.
Just because the problem went away when you deleted and reimported the exact same queues from exactly the same configuration doesn't mean that you don't have a problem with your queue setup. Often I find it is because people are attempting a rather complicated queue configuration without understanding how to do it properly. There are some configurations of queues that are incorrect and cause unexpected behavior.

You say that your btest traffic bypasses the queues - how are you having this traffic bypass the queues? This is certainly a more advanced setup and you are probably doing something that is a bit wrong and is causing random-seeming unexpected behavior.

Can you export and share your queue configuration here?
Last edited by mducharme on Thu Jun 13, 2019 7:39 am, edited 1 time in total.
 
User avatar
macsrwe
Long time Member
Long time Member
Topic Author
Posts: 646
Joined: Mon Apr 02, 2007 5:43 am
Location: Arizona, USA
Contact:

Re: Queues lie

Thu Jun 13, 2019 7:39 am

It's pretty difficult to misconfigure simple queues. Queue, queue, queue, that's the one that matches this customer, we're done. There are no other queues. Removing them and putting exactly the same queues back should not fix a bandwidth problem like this... and yet it does.

To bypass the queues, I have a single queue at the head end that matches on a packet mark and has unlimited bandwidth. The packet mark is mangled onto traffic to and from the defined BTest ports. There is no other mangling or other packet marks on the router.

sq.jpg
You do not have the required permissions to view the files attached to this post.
 
mducharme
Trainer
Trainer
Posts: 788
Joined: Tue Jul 19, 2016 6:45 pm

Re: Queues lie

Thu Jun 13, 2019 7:43 am

It's pretty difficult to misconfigure simple queues. Queue, queue, queue, that's the one that matches this customer, we're done. There are no other queues. Removing them and putting exactly the same queues back should not fix a bandwidth problem like this... and yet it does.
Can you share an export of that section instead of a screenshot? That screenshot doesn't show if you have parent/child set, the queue type used, etc.

Also, what is 1-POE-Host?
 
User avatar
macsrwe
Long time Member
Long time Member
Topic Author
Posts: 646
Joined: Mon Apr 02, 2007 5:43 am
Location: Arizona, USA
Contact:

Re: Queues lie

Thu Jun 13, 2019 7:52 am

1-POE-Host is port 1 on the PowerBox, which would normally supply the host's own service, except the property is vacant currently so there's no connection.

Here is a partial printout of the queue, the rest of the rules are identical except for address.

sqcli.jpg
You do not have the required permissions to view the files attached to this post.
 
mducharme
Trainer
Trainer
Posts: 788
Joined: Tue Jul 19, 2016 6:45 pm

Re: Queues lie

Thu Jun 13, 2019 8:05 am

1-POE-Host is port 1 on the PowerBox, which would normally supply the host's own service, except the property is vacant currently so there's no connection.

Here is a partial printout of the queue, the rest of the rules are identical except for address.


sqcli.jpg
OK Thanks.

So, there are a few things you are doing that can cause problems.

First, your bandwidth test exception rule at the beginning will not work as expected. One thing that I have found with simple queues is that if it is set for no limit on upload and download, it behaves as though that queue is not there. If you wish to try this for yourself it is easy to reproduce - create two queues for the same IP, the first of which will be set to unlimited, the second of which has an upload and download limit. You will find that traffic skips the first queue and goes to the second - the first gets ignored because it is unlimited. Because of this behavior, in cases like yours, I set a very large limit (more than I would ever need) for a "bypass" queue instead of simply leaving it unlimited. I think you will find that your btest traffic to the customer IPs is not bypassing the way you want and is instead getting counted in the customer's queue.

Second, is the Tower (self) queue for the device itself? Why is that even there?

Third, in my experience, most interface types do not work properly as "target" for a queue. This includes ethernet ports. Target=1-POE-Host will likely not work as you expect.
Last edited by mducharme on Thu Jun 13, 2019 8:26 am, edited 1 time in total.
 
User avatar
macsrwe
Long time Member
Long time Member
Topic Author
Posts: 646
Joined: Mon Apr 02, 2007 5:43 am
Location: Arizona, USA
Contact:

Re: Queues lie

Thu Jun 13, 2019 8:25 am

First, your bandwidth test exception rule at the beginning will not work as expected. One thing that I have found with simple queues is that if it is set for no limit on upload and download, it behaves as though that queue is not there. If you wish to try this for yourself it is easy to reproduce - create two queues for the same IP, the first of which will be set to unlimited, the second of which has an upload and download limit. You will find that traffic skips the first queue and goes to the second - the first gets ignored because it is unlimited. Because of this behavior, in cases like yours, I set a very large limit (more than I would ever need) for a "bypass" queue instead of simply leaving it unlimited. I think you will find that your btest traffic to the customer IPs is not bypassing the way you want and is instead getting counted in the customer's queue.

Manifestly improbable, as our bandwidth tests quite often show speeds well in excess of any customer's bursted queue. I can run BTest from one end of our network to the other (three or four wireless hops) and see speeds in the range of of 80-140Mb. That speed will also be reflected in the first queue entry of all the intermediate towers.

Second, is the Tower (self) queue for the device itself? Why is that even there?

Each night, the towers can swap large data files among themselves (ROS updates and the like). This line was an attempt to ensure that this administrative traffic proceeded at all possible speed. It never was entirely successful, and as it never was a source of problems, we didn't pursue the effort.

Third, most interface types do not work as "target" for a queue. This includes ethernet ports. Target=1-POE-Host will not work as you expect.

I recalled that one or the other wouldn't work well, which is why I specified both the interface and the IP address, knowing one would work. Again, only about half our towers have hardwired hosts, but relief from this problem was seen at both types of tower.

It doesn't sound to me like you have found any problem with these queues gross enough to have reliably suppressed the bandwidth of certain few (but not all) subscribers to an unsteady 20% of what it was set to have been. I have no idea how long you have been operating your own ISP with the same queue entries day in and day out, but I would not be at all surprised if the corruption took years to develop. On the tower serving the subscriber where I did my most recent diagnosis, the PowerBox was upgraded to a PowerBox Pro in July of 2017, which represents the last time the queues were "fresh" until we began rejuvenating them monthly starting in March of this year. Similarly, when I saw the same syndrome on our queue tree when our bandwidth limitation architecture was centralized at the gateway, they had been up for over a year when I finally determined that certain links were not getting the bandwidth they were allowed; again, wiping and reinitializing the queues with the same entries fixed the issue immediately.
 
mducharme
Trainer
Trainer
Posts: 788
Joined: Tue Jul 19, 2016 6:45 pm

Re: Queues lie

Thu Jun 13, 2019 8:50 am

Manifestly improbable, as our bandwidth tests quite often show speeds well in excess of any customer's bursted queue. I can run BTest from one end of our network to the other (three or four wireless hops) and see speeds in the range of of 80-140Mb. That speed will also be reflected in the first queue entry of all the intermediate towers.
Try what I said to reproduce it. If you make two queues for the same thing, the first unlimited and the second limited, the first will be skipped and the second used. It is easy to test. The first queue will show a rate of zero. I am not making it up - try it for yourself. If BTest is not getting limited to the customer rate, perhaps there is a reason for that. Are you routing all APs or are you bridging? What is the btest to - is it to the same IP that is listed in the simple queue for the customer? What IP is it coming from? A simple queue will not limit traffic that the device is bridging (although I have heard that enabling "Use IP Firewall" can change this behavior). Maybe there is also something different on the intermediate towers with the way the queue is set up that is making it work?
Each night, the towers can swap large data files among themselves (ROS updates and the like). This line was an attempt to ensure that this administrative traffic proceeded at all possible speed. It never was entirely successful, and as it never was a source of problems, we didn't pursue the effort.
Why would you need this? why would it be limited in the first place - what would limit the traffic between the towers? I really don't understand. Also if it doesn't do anything why leave it? You seem to be ignoring the possibility that it could cause problems.
I recalled that one or the other wouldn't work well, which is why I specified both the interface and the IP address, knowing one would work. Again, only about half our towers have hardwired hosts, but relief from this problem was seen at both types of tower.
Just specify the IP. Remove the interface, it doesn't really help you.
It doesn't sound to me like you have found any problem with these queues gross enough to have reliably suppressed the bandwidth of certain few (but not all) subscribers to an unsteady 20% of what it was set to have been.
If you put lines into your router and they don't do what you expect, don't leave the garbage in there, remove it. All of your problems could be caused by this stuff that you are leaving in because you think it is doing nothing, and maybe most times it is doing nothing until you trigger some unusual corner case bug that wouldn't happen if you didn't have those config problems.

We run several thousand queues on hundreds of routers over several years and I have not once seen the behavior you describe.
 
reiniss2
MikroTik Support
MikroTik Support
Posts: 47
Joined: Wed Jan 02, 2019 12:14 pm
Location: Latvia
Contact:

Re: Queues lie

Thu Jun 13, 2019 9:01 am

This isn't a question, it's a warning.
There is a bug of long standing in RouterOS that causes invisible, internal queue corruption. I have experienced it with both tree queues and simple queues, over a period of something like eight years, and have incontrovertibly proved it is happening.
Have you tried to generate a supout.rif file when the client experiences the issues and send it to support@mikrotik.com?
 
User avatar
macsrwe
Long time Member
Long time Member
Topic Author
Posts: 646
Joined: Mon Apr 02, 2007 5:43 am
Location: Arizona, USA
Contact:

Re: Queues lie

Fri Jun 14, 2019 6:01 am

Have you tried to generate a supout.rif file when the client experiences the issues and send it to support@mikrotik.com?

Sadly, no. As I mentioned, I have corrected for this bug only three times in all of 12 years. Each time, it was a Hail Mary attempt: "I'm totally buffaloed... let's try this, it couldn't POSSIBLY be the problem, could it?" And by the time I discovered that yes, indeed, the queues were causing the problem and rebuilding them solved it, it was too late to grab a "before" supout.

As the saying goes, one time is happenstance, two times is coincidence, and three times is enemy action. If I had to manually rebuild queues in the future, I would definitely remember to preserve a supout. But since it's more important to my users that I prevent this problem (with the periodic auto rebuilds) rather than catch it red-handed, I suspect that opportunity will not present itself in the future.

Who is online

Users browsing this forum: No registered users and 78 guests