Intermittent loss of packets.............argg

Setup RB450gx4 router
Two WANIPs, primary is 1gib fibre
In between DGS-1100-24 UPS
All on UPS
I have on two separate vlans, two users who play poker online.
They describe short time period outages, enough to lose packets, and they see slowdowns and on some sites, this is enough to boot them off the site, others are more forgiving (better servers).

They noticed that it happened at the same time (both vlans, and each was using diff servers) which points to a COMMON source.

Since this is seemingly sporadic/random, I am inclined to think what would cause such intermittent delays that are more prevalent during the day than at night.
THe first thing I am going to do is put them on separate WANIPs to eliminate the primary ISP from being the culprit.
If they both still simultaneously experience the issue, then its not the ISP and that leaves the router and the switch (and the UPS).

Are switches prone to such intermittent traffic interruptions?
Can using the router as DNS server ([bv]ache on router)[/b] cause such issues?
Can a UPS cause such issues?
Any other ideas?

Does the game use UDP? Packet loss is normal with UDP. For example if a buffer is full then new UDP packets simply will be dropped, unlike with TCP.
See also http://forum.mikrotik.com/t/high-packet-loss-switching-udp-traffic/101746/1 and http://forum.mikrotik.com/t/unrecorded-packet-drops-and-low-level-udp-packet-loss/45531/1
Enabling Ethernet flow control could maybe help as well increasing buffer sizes and/or queue sizes, if possible.
And of course also giving higher priority for such traffic via QoS could help as well…
A traffic loop can cause packet loss too: then TTL becomes 0 sooner (and gets autom. dropped) before the packet can reach its destination…
See also this MT diagram where packet drop is visualized: http://forum.mikrotik.com/t/queue-bucket-size-option-explained/98061/1

What I have been able to determine so far:

  • that I am getting easily 30% or more packet loss to sites in question. (not good)
  • that both ISPs are showing me a 5-10% packet loss to the gateway of each ISP (bad)
  • that traffic flow on one of the ISP has significant gaps simultaneously in both TX/RCVE but the routers logs do not show the connection down (very bad).
  • CPU load is always around 0-1%

On the bad ISP connection, I will try a different cable to the modem, and a different ethport on the router.
Overall will try a replacement router when able, to see if I can quickly test traceroute to the gateway - obviously looking for 0% packet loss which should be expected.
That would be a first good step.

You should ensure that on your device Ethernet frame control (EtherType 0x8808) is operational/activated.

not seeing a place to inspect or modify that?

In firewall or ACL accept (don’t block) these L2 packets.

For example I have in one of my devices these ACL rules:

add switch=switch1 ports=$myPorts mac-protocol=0x8808 comment=“L2 Ethernet flow control”
add switch=switch1 ports=$myPorts mac-protocol=802.2 comment=“essential L2”

S.a.
https://en.wikipedia.org/wiki/Ethernet_flow_control
https://en.wikipedia.org/wiki/EtherType#Examples

I am afraid it may not be that easy. Ethernet flow control packets are usually processed by the hardware itself on a very low level, so it is a challenge to even capture them, let alone processing them using switch rules. It may depend on the particular switch chip used whether it is possible or not.

The manual talks about the 8327 switch chip (which is used on the OP’s RB450Gx4) using flow control towards the CPU (so a slow speed negotiated on one port can affect sending packets via other ports), but it doesn’t mention whether it itself handles ethernet flow control frames arriving from outside (so also X-off received from one uplink could affect transmission through the other one).

Hmm. it’s just an Ethernet frame.
In the first link above there is an image showing a Wireshark screenshot of an Ethernet “Pause” frame. It’s about that very frame packet.
Since Wireshark is a PC application, then one could wonder how it was able to capture that frame…

On this MT wiki page it’s described: https://wiki.mikrotik.com/wiki/Manual:Interface/Ethernet

tx-flow-control (on | off | auto; Default: off) When set to on, the port will generate pause frames to the upstream device to temporarily stop the packet transmission. Pause frames are only generated when some routers output interface is congested and packets cannot be transmitted anymore. auto is the same as on except when auto-negotiation=yes flow control status is resolved by taking into account what other end advertises.

rx-flow-control (on | off | auto; Default: off) When set to on, the port will process received pause frames and suspend transmission if required. auto is the same as on except when auto-negotiation=yes flow control status is resolved by taking into account what other end advertises.

The fact that Wireshark can dissect a pause frame says nothing about how easy or difficult it is to actually capture it, using Wireshark/dumpcap/tcpdump. The key is the ability of the network card to let the pause frames through to the software layer rather than handling them internally. https://osqa-ask.wireshark.org/questions/56214/best-nic-for-detecting-pause-frames

Here’s a screenshot. At least for the WAN port the “Tx Flow Control” and “Rx Flow Control” should be set to “Auto” or “Yes”.
On my device I’ve set them all to Auto.
Auto Negotiation is by default enabled.
Of course such packets (in and out) must not be blocked by a firewall rule, meaning these have to be accepted…
Interface-Flow-Control.png

Both are off on mine, I changed it to auto for both on my vlan bell cconnection and there was no change in packet loss to the gateway of the ISP.
After running for about 1.5 hours, both were sitting at about 50%

Then I think iperf is your best friend… :slight_smile:
I think I would get rid of VLAN and use pure IP routing instead, and also use an MT switch with RouterOS instead of the DL switch, since with ROS you have much more control.
Sorry, can’t help any further, I just tried. Maybe @sindy has some more ideas.

Much appreciated, right now I just connected another router (spare hex) to the bell fibre connection so am checking out any differences. By the way my connection on the hex is via a vlan with vlan bridge filtering.

Good news is that on the hex, after 10 minutes not a single failure on traceroute to the Bell gateway,whereas on the rg450g was around 50% on bell and 20% on eastlink.

here is a winmtr report to reach google on the hex via bellfibre

|------------------------------------------------------------------------------------------|
|                                      WinMTR statistics                                   |
|                       Host              -   %  | Sent | Recv | Best | Avrg | Wrst | Last |
|------------------------------------------------|------|------|------|------|------|------|
|                             192.168.2.1 -    0 |   97 |   97 |    0 |    0 |    3 |    0 |
|       loop0.52w.ba06.drmo.ns.aliant.net -    0 |   97 |   97 |    0 |    2 |   29 |    2 |
|         be12-83.cr01.drmo.ns.aliant.net -    0 |   97 |   97 |    1 |    2 |    5 |    1 |
|      hg-0-4-0-0.cr01.hlfx.ns.aliant.net -    0 |   97 |   97 |    1 |    2 |    4 |    2 |
|            be19.bx01.nycm.ny.aliant.net -    0 |   97 |   97 |   18 |   19 |   24 |   20 |
|                            72.14.220.96 -    0 |   97 |   97 |   17 |   19 |   68 |   21 |
|                          108.170.248.33 -    0 |   97 |   97 |   19 |   20 |   24 |   19 |
|                          216.239.42.165 -    0 |   97 |   97 |   18 |   19 |   22 |   18 |
|                lga34s14-in-f3.1e100.net -    0 |   97 |   97 |   18 |   18 |   21 |   18 |
|________________________________________________|______|______|______|______|______|______|
   WinMTR v0.92 GPL V2 by Appnor MSP - Fully Managed Hosting & Cloud Provider

Here is an example of the bad router with the more stable internet of eastlink, still can use it despite the losses…

|------------------------------------------------------------------------------------------|
|                                      WinMTR statistics                                   |
|                       Host              -   %  | Sent | Recv | Best | Avrg | Wrst | Last |
|------------------------------------------------|------|------|------|------|------|------|
|                             192.168.0.1 -    0 |   98 |   98 |    0 |    0 |    5 |    0 |
|                             10.80.128.1 -    4 |   86 |   83 |    8 |   13 |   26 |   14 |
|              ns-trur-asr002.eastlink.ca -    4 |   86 |   83 |    9 |   13 |   20 |   13 |
|            ns-hlfx-dr001.ns.eastlink.ca -    4 |   86 |   83 |    9 |   14 |   21 |   13 |
|            ns-hlfx-br001.ns.eastlink.ca -    4 |   86 |   83 |   10 |   14 |   21 |   16 |
|            ns-hlfx-br002.ns.eastlink.ca -    4 |   86 |   83 |    9 |   13 |   23 |   13 |
|                           74.125.52.182 -    4 |   86 |   83 |   22 |   28 |  129 |   30 |
|                          108.170.251.17 -    4 |   86 |   83 |    0 |   27 |   38 |   24 |
|                          108.170.231.65 -    4 |   86 |   83 |    0 |   28 |   37 |   28 |
|                yul03s05-in-f3.1e100.net -    4 |   86 |   83 |    0 |   27 |   36 |   31 |
|________________________________________________|______|______|______|______|______|______|
   WinMTR v0.92 GPL V2 by Appnor MSP - Fully Managed Hosting & Cloud Provider

I was doing about 350-450 mbps and uptp 500.000 pps packets (~ 80% UDP and 20% TCP) on a Linux router with old kernel (2.6…) with 200 Euro hardware with 0% packet loss.

Dont waste time with a non working mikrotik hardware. Just replace it (maybe with other cheap vendor) and get rid of problems.

A netinstall was advised to rule out some nand corruption etc…

in any case the hex is up and running and I am seeing 0-.1 percent loss on traceroute to the gateways of each ISP, so I am somewhat relieved.
Sad to have to junk the rb… Looking back it was always quirky but never had enough proof… Out of warranty too.

Anav, first do as you preach, i.e. Diagram with copes of config posted here.

Then someone can have a more educated look at things

Hi CZFAN, I did exhaustive testing with someone far more knowledgeable than me, so not to worry I wasn’t bumbling around like a complete fool.
Suffice to say, even if a miracle occurred and the 450G was resurrected from insidious hell, it would remain led lights off as I am going on a new voyage with a CCR1009.
Yes, I have graduated to the next level. Perhaps Normis will send me an MT t-shirt. :wink:

I just completed a neinstall and I will post the trace route to the ISP gateway, and winmtr to google.ca results as well as the gap image from earlier testing, which shows one ISP with no Gap and the other with a huge gap.

No change! :frowning: Originally using 6.46.6 firmware, now uising 6.45.9 firmware - no difference!


https://imgur.com/1y5Gd9s
https://imgur.com/DYPJGH4
https://imgur.com/OHdK0RH

@Anav,

Not sure if I missed anything, but I have not seen any evidence in this thread that indicates any problems on the 450. Changing things from the default, i.e. flow control, etc is going to make your environment more complicated and more prone to problems.

You are welcome to throw money at it and buy a CCR1009, but I dont think there are any guarantees that it will solve the problem.

So lets start with the basics, work from the RB450 outwards, and provide:

  1. a diagram / more info on how this connects to the ISP network
  2. anonymized export of the 450 config

Hi CZFAN, I have been testing a netinstall version of the 450 using a backup config and a default config with interesting results.
Suffice to say, that it may be something in the config and will confirm tomorrow.