we're using v5.4 for now, and sometimes our PCQ queue shows erratic behaviour. that's my queue type:
add kind=pcq name=unlim-35-upload pcq-classifier=src-address \
pcq-rate=512k pcq-limit=256 pcq-total-limit=65536
I'll preface this by saying I'm far from a PCQ expert, but my initial thought would be your queue is WAY too large. During your trouble times, you have approximately 15k or 18k packets queued up, but you specifically allow for up to 64k packets in the queue before discarding. *Technically* your queue appears to be working as programmed.
Intuitively, we would think that a bigger buffer is always better. Packet loss is the enemy, so the more data we can queue up, instead of dropping, the better, right? Unfortunately TCP doesn't work that way.
Large queue lengths have a very adverse effect on latency and overall throughput. TCP has built-in congestion-avoidance mechanisms which are designed with the understanding that all routers along the way simply drop packets when overloaded. If you buffer everything instead of dropping, TCP keeps sending data and never backs off, causing more congestion. in your screenshots you had as much as 2MB worth of data queued up. I'm not sure what your link-speed is, but 2MB in queue for a router seems like a lot.
If you figure you have 2MB worth of packets queued up, and a 10Mbit/s line, then it would take over 1600ms to clear the queue. That means the last packet in that queue has a 1600ms delay added to it before you send it. Now imagine if every router along the way was doing the same thing? Everyone adds another 10-20ms of delay due to excessive queue-length, and pretty soon the delay is enough to cause the TCP packets to time out.
This problem is collectively known as Buffer Bloat. Jim Getty has done a very nice job of documenting the effects of it on his blog:
http://gettys.wordpress.com/bufferbloat-faq/
The Mikrotik wiki also has some slides that support this idea:
http://wiki.mikrotik.com/wiki/Manual:Queue_Size When the queue is unlimited, most of the packets end up delayed.
In the systems I have worked with, I typically pick a pcq-limit of around 20-75 (20 works fine for smaller offices / homes, 75 for a lot of "bursty" traffic) then a PCQ-total-limit of PCQ-limit*max users*80% (queue can hold 80% of the maximum expected users all at max PCQ-limit). I usually try to keep PCQ-total-limit under 10000 to avoid the bufferbloat problem. Janis talks a bit about this in his QoS presentation (Page 26 & 27):
http://mum.mikrotik.com/presentations/US08/janism.pdf
As I said before, I'm not incredibly familiar with the finer workings of PCQ, nor do I have much carrier-level QoS experience, but this would be my best educated guess. Turn down your PCQ-limit and PCQ-total-limit and the latency and timeout issues should clear up. Best of luck.
--CC_DKP