Community discussions

MikroTik App
 
User avatar
rfc1149
Frequent Visitor
Frequent Visitor
Topic Author
Posts: 77
Joined: Fri May 15, 2020 4:26 am
Location: England

RBMetalG-52SHPacn link constantly disconnecting under load and randomly

Sat Apr 02, 2022 1:02 am

I have a RBMetalG-52SHPacn (International) configured for a client to link their downstairs 5GHz CPE (DWA0122) to a PC upstairs.
The Metal is configured to connect to the 5GHz network and then distribute an IP via DHCP to the PC linked to it via Ethernet with a PoE injector.
This room is directly above the router broadcasting the 5GHz network and has the stock dipole omni antenna installed with the right angle N connector.
Under load, the link will average 150 to a maximum of 265Mbps UL or DL. CPU runs between 7% and 72%, temperature is nominal 34C - 38C.
Link power is nominal (between -61dBm and -54dBm) and I see no indication of DFS issues on the CPE side (there are detailed logs).

The wireless mode is set to station. There is no bridge (STP or RSTP) enabled. DHCP server for ether1 providing Ethernet client with IP address.
Default route for wlan1. Fasttrack enabled and from all tests, wired connectivity is sane. Here is the config for the wireless link and the security profile.

name="wlan1" mtu=1500 l2mtu=1600 mac-address=REMOVED arp=enabled interface-type=Atheros AR9888 mode=station ssid="Devon" frequency=auto band=5ghz-n/ac channel-width=20/40/80mhz-Ceee secondary-frequency="" scan-list=default wireless-protocol=802.11 vlan-mode=no-tag vlan-id=1 wds-mode=disabled wds-default-bridge=none wds-ignore-ssid=no bridge-mode=enabled default-authentication=yes default-forwarding=yes default-ap-tx-limit=0 default-client-tx-limit=0 hide-ssid=no security-profile=Devon compression=no
name="Devon" mode=dynamic-keys authentication-types=wpa2-psk unicast-ciphers=aes-ccm group-ciphers=aes-ccm wpa-pre-shared-key="" wpa2-pre-shared-key="PassphraseRemoved" supplicant-identity="" eap-methods="" tls-mode=no-certificates tls-certificate=none mschapv2-username="" mschapv2-password="" disable-pmkid=yes static-algo-0=none static-key-0="" static-algo-1=none static-key-1="" static-algo-2=none static-key-2="" static-algo-3=none static-key-3="" static-transmit-key=key-0 static-sta-private-algo=none static-sta-private-key="" radius-mac-authentication=no radius-mac-accounting=no radius-eap-accounting=no interim-update=0s radius-mac-format=XX:XX:XX:XX:XX:XX radius-mac-mode=as-username radius-called-format=mac:ssid radius-mac-caching=disabled group-key-update=1h management-protection=disabled management-protection-key=""

Network connectivity and reachability when initially established works as configured, no firewall issues.
Under constant or random load of 10Mbps to 100 Mbps UL+DL, the wireless link will disconnect repeatedly and erratically. There is no predictable pattern to this.
The reasons are most commonly extensive data loss and less commonly no beacons received.
Reconnecting within 15-30 seconds, a bit longer because the DHCP client won't come straight back up once reassociated.

All other 5GHz clients are unaffected by this, it's just the Metal. The other wireless clients have no DHCP delay once connected and get addresses within 1s.
Additionally, other 5GHz devices connecting to the CPE with either 1x1, 2x2 or 3x3 have no trouble with traffic throughput in the 350Mbps to 650Mbps range without breaking a sweat.
The CPE has a 3x3 5GHz radio capable of 1300Mbps. The 5GHz radio config is a/n/ac, 3x3, Channel auto (Chooses 60@80MHz, DFS, region UK), Auto channel width (20/40/80MHz),
SGI support, STBC enabled, LDPC enabled, Frame bursting disabled.

The fibre link for the WAN side of the CPE is testing at 850+Mbps easily over Ethernet, so when I speed test the Metal at 250Mbps, I know it's getting less than theoretical throughput due to real world conditions (noise, attenuation, humidity, backscatter, reflections etc etc) but the constant disconnections are the deal breaker. Only the Metal exhibits this behaviour and it's driving me and the client insane. There is no indication of what it could be like CPU bottlenecking because it never even hits 100%. There isn't even a great distance to blame it on! Worse so, sometimes the speed tests don't even cause a dropout, making it seem like it's finally functioning as intended. Then with lower than 1Mbps either direction, the disconnections happen again! The CPE side has no error or issue other than losing visibility of the client and recording the reassociation 25-35s later. This is not seen for other 5GHz clients as I mentioned, some of which have uptimes of over 6 days on the 5GHz CPE network (iPhones, iPads, Laptops).

I have tried various settings, manual-txpower, regulatory-domain (UK), no_country_selected with both regulatory-domain and manual-txpower,
changing the operating mode to 5GHZ-only-AC, N/AC, A/N/AC, A/N, only-N, A, increasing the Group Rekey from 30m to 1h, changing installation from
indoor, outdoor and any, station roaming disabled (there's only one SSID for 5GHz on the CPE), manually setting the frequency to connect on,
I've even gone so far as to lock it to the currently broadcast frequency which is less than ideal because of DFS for the CPE side.
In every case, the disconnections still reoccur mid-load and bring the test to a halt.
All test periods were at least 15 minutes with 5 minutes constant load, 5 minutes random load and 5 minutes of doing nothing.
Constant load tests were done with StarTrinity CST and speed tests with Ookla Speedtest app for Windows.

Link tests were done with the Metal to a hAP AC2 over a 15cm Cat6a cable and btest server, 94-96% gigabit speeds with CPU between 13% and 28%.
The firewall function is enabled but with less than 10 rules total for input and forward chain, there is no indication that this is causing any issues.
What's making this worse is that I cannot determine any cause for the wireless link itself because there is nothing descriptive about the link loss. See below.


metal-52shpacn-disconnects.png
metal-52shpacn-throughput-graph.png


I have tried the extra wireless logging and there is just radar detection output, so no useful feedback was gleaned.
Should I keep shaking my fist at god or is this an indication of a faulty unit?
I know that the wireless line up from MikroTik doesn't tend to be super great but this...why this?
I'm starting to wonder if any of the other Metals I've deployed are performing poorly like this unit.
I'm very sad because the Metal is such a hardy, well built piece of kit so to see it struggle like this is just painful. :(
You do not have the required permissions to view the files attached to this post.
 
User avatar
rfc1149
Frequent Visitor
Frequent Visitor
Topic Author
Posts: 77
Joined: Fri May 15, 2020 4:26 am
Location: England

Re: RBMetalG-52SHPacn link constantly disconnecting under load and randomly

Sat Apr 02, 2022 11:30 am

I have logged a support request, I really hope they have an answer or some insight for this.
 
User avatar
rfc1149
Frequent Visitor
Frequent Visitor
Topic Author
Posts: 77
Joined: Fri May 15, 2020 4:26 am
Location: England

Re: RBMetalG-52SHPacn link constantly disconnecting under load and randomly

Sat Apr 02, 2022 4:50 pm

Update April-02: Stable function on 2.4GHz with no disconnects since I switched it over.
Seems to be just 5GHz that has this issue.

Update 2-April 02: Very stable on 2.4GHz since, but at the loss of so much link throughput since I can't get the CPE to use 40MHz due to crowding.

metal-52shpacn-2.4ghz-stable-link.png

Update 3-April 03: Tried load testing at 15Mbps UL+DL again and the 2.4GHz link crapped out and stopped responding to traffic, requiring me to disable/enable the link to get it functional again. It didn't drop its DHCP lease or disconnect from the network, it just stopped delivering packets via the default route and was still reachable via Winbox. This is worse than I expected.

Update 4-April 04: The same problem as update 3 continues. The 2.4GHz link to the CPE appears connected with no issues, wireless link rates indicate connectivity hasn't dropped so it doesn't appear that the wireless driver is faulting (although who really knows what is going on because I can't see the kernel logs) and then suddenly, without warning, the Metal stops delivering packets to the default route. Pinging Cloudflare DNS returns "unreachable from <CPE DHCP pool IP address>". Disabling and re-enabling the link restores connectivity but this really shouldn't be the case. I can script restarts of the link but is that really what it has come to? Still waiting on support response (I'm reasonable, I'll give them at least 30 days since I'm not some major corporate customer or anything like that). I should note that the above scenario is occurring randomly, without link load or with link load. There is still no predictable pattern to the issues. CPE WAN and LAN uptime is 25+ days with all directly connected clients logging 6 to 12+ days uptime and no stability issues. I'm going mad.
You do not have the required permissions to view the files attached to this post.
 
User avatar
rfc1149
Frequent Visitor
Frequent Visitor
Topic Author
Posts: 77
Joined: Fri May 15, 2020 4:26 am
Location: England

Re: RBMetalG-52SHPacn link constantly disconnecting under load and randomly

Thu Apr 07, 2022 6:49 pm

For the client, I've gone ahead and put in CAT6a cabling free of charge to bear the embarrassment of my failed recommendation with a proper solution for them. Note that they wanted a wireless option with very high speed and high reliability, hence the choice of hardware. I won't make this mistake twice.

I got restless waiting for a support response and decided to take a crack at the issue myself.
The online supout.rif viewer wasn't helping much so I found a client-side tool to unpack them. Bursting full of useful information.
(Warning: Do not share your .rif's with websites you do not trust as they contain a lot of identifying information about your network and configuration)

From the initial assessment and some digging around (a completely unrelated setup but similar wireless family https://bugs.freebsd.org/bugzilla/show_ ... ?id=213621) it seems like after a packet corruption event involving the RSC bit, in this specific reoccurring scenario, every subsequent frame after that is considered a replay attack. Also it seems that for the beacon related disconnects, it might be related to a beacon miss so either packet corruption or power saving settings at play? Lots of guess work, I know. Not sure if this can be or is fixed in recent firmware's for the SoC. I'll start digging further into qca9889 to see if fixes for these issues ever made it to later firmware's but I am limited to the output from the snippets in the rif's in debugging this so I'm capturing opportunistic data when generating a supout and trying to time it just right so that the buffers for syslog don't fill up with replay messages so I can see what output happens right before the disconnection event. Less than ideal. :(

Edit: Completely missed that the full syslog is actually available when I was glancing at the output, I just need to restructure it for a continuous timeline.
@wireless@
2022.04.02-07:31:10.12@0: chan: 5300/20-eeCe/ac/DP(23dBm) defaultrate: 0x2000 rates: OFDM:6-54 BW:1x-4x SGI:1x-4x HT:0-7 VHTMCS:SS1=0-9
2022.04.02-07:31:10.12@0: VapDevice::vapInit: wdsmode 0
2022.04.02-07:31:10.12@0: StaVapMode::join: master, wlansmode
2022.04.02-07:31:10.12@0: HwDevice::hwStateSwitch: 17 curr 0/0 pend ...
2022.04.02-07:31:10.13@0: statechange: curr 0/0 pend 17/66 old 0/0 new 17/66
2022.04.02-07:31:10.14@0: StaVapMode::masterNlStatechange: state 3, change 0->17
2022.04.02-07:31:10.14@0: SuppContext::SuppContext 51d540
2022.04.02-07:31:10.14@0: Context::setupParams: ucast 3, group 3, auth 1
2022.04.02-07:31:10.15@0: StaVapMode::nlJoinresult
2022.04.02-07:31:10.15@0: SuppContext::associated
2022.04.02-07:31:10.15@0: SuppContext::doAuth
2022.04.02-07:31:10.15@0: SuppContext::doAuth: passToPSK...
2022.04.02-07:31:10.15@0: SuppContext::doAuth: passToPSK complete: pass <CPE_AP_PSK> ssid <SSID_REMOVED> pmk <PMK_REMOVED>
2022.04.02-07:31:10.15@0: SuppContext::doHandshake
2022.04.02-07:31:10.15@0: StaVapMode::nlJoinresult: joined
2022.04.02-07:31:10.15@0: StaVapMode::roamingInit: disabled
2022.04.02-07:31:10.15@0: StaMode::masterStateCb: masterstate 1 vapstate 5 info: 
2022.04.02-07:31:10.17@0: StaPeer::nlDot1x
2022.04.02-07:31:10.17@0: SuppContext::handle1
2022.04.02-07:31:10.17@0: SuppContext::handle1: isRsn, lookup pmkid
2022.04.02-07:31:10.18@0: StaPeer::nlDot1x
2022.04.02-07:31:10.18@0: SuppContext::handle3
2022.04.02-07:31:10.18@0: SuppContext::handle3: installing group, idx 1
2022.04.02-07:31:10.18@0: SuppContext::handle3: setKernelKey
2022.04.02-07:31:10.18@0: SuppContext::handle3: all done, go to IDLE
2022.04.02-07:31:10.18@0: NetDev::authorizedPeer: now 1
2022.04.02-07:31:10.22@0: WPeer::changeState: minsig -120, maxsig 120
2022.04.02-07:31:10.22@0: StaPeer::onStateChange: authorized
2022.04.02-07:50:49.78@0: NetDev::unauthorizedPeer: now 0
2022.04.02-07:50:49.78@0: Peer::removeFromRegTable <CPE_AP_ADDRESS>
2022.04.02-07:50:49.78@0: StaPeer::wdsUnauthorized: bridgeportid 4294967295
2022.04.02-07:50:49.78@0: StaVapMode::delPeer: extensive data loss
2022.04.02-07:50:49.78@0: StaVapMode::disconnected
2022.04.02-07:50:49.78@0: StaMode::masterStateCb: masterstate 3 vapstate 7 info: 
2022.04.02-07:50:49.78@0: StaMode::disable
2022.04.02-07:50:49.78@0: StaMode::disablingCheck: all stopped
2022.04.02-07:50:49.78@0: HwDevice::hwIdle: curr 17/66 pend ...
2022.04.02-07:50:49.80@0: statechange: curr 17/66 pend 0/67 old 17/66 new 0/67
2022.04.02-07:50:49.80@0: StaMode::disablingOver


@syslog@
2022.04.02-07:50:40.09@0: [37169.063089] aesccm_check: replay detected
2022.04.02-07:50:41.10@0: [37170.081837] aesccm_check: replay detected
2022.04.02-07:50:41.10@0: [37170.081965] aesccm_check: replay detected
2022.04.02-07:50:41.48@0: [37170.458112] leg_rx_get: msdulen>space (2972,1548)
2022.04.02-07:50:41.48@0: [37170.458146] leg_rx_get: att 0x82300002 frags 0 proc 29 hw 31 fw 28 fill 28
2022.04.02-07:50:46.33@0: [37175.303006] w_node_maybelost: <CPE_AP_MAC> entered maybelost (3687536)
2022.04.02-07:50:46.43@0: [37175.406740] node_lostwakeup
2022.04.02-07:50:46.53@0: [37175.506737] ath_txtimeout: monitor reset
2022.04.02-07:50:46.53@0: [37175.506769] ath_reset
2022.04.02-07:50:46.53@0: [37175.510153] ar9888_rx_disable: timeout waiting for DMA rx stop
2022.04.02-07:50:46.53@0: [37175.510183] ch2whal: mode 10
2022.04.02-07:50:46.55@0: [37175.525731] ar9888_rx_setup: bufsize 1728, layout 180
2022.04.02-07:50:46.55@0: [37175.528126] ht: flush rx, seq 3462 start 3286 offs 3983
2022.04.02-07:50:46.55@0: [37175.528980] ht_handle_bar!!!
2022.04.02-07:50:46.56@0: [37175.536758] node_lostwakeup
2022.04.02-07:50:46.56@0: [37175.536802] leg_setup_key: idx 4, kt 6
2022.04.02-07:50:46.56@0: [37175.536831] ar9888_dst_asthash: m31_0 <CPE_AP_MAC>, m47_32 0xcaa6, keyid 0, mcast 0
2022.04.02-07:50:46.56@0: [37175.536853] ar9888_dst_astsetup: mac <CPE_AP_MAC> h 472 ast_idx 472 idx 4 mcast 0 keyid 0
2022.04.02-07:50:46.73@0: [37175.704450] w_node_maybelost: <CPE_AP_MAC> entered maybelost (3687576)
2022.04.02-07:50:46.83@0: [37175.806767] node_lostwakeup
2022.04.02-07:50:47.01@0: [37175.986757] node_lostwakeup
2022.04.02-07:50:47.15@0: [37176.125364] aesccm_check: replay detected
2022.04.02-07:50:47.19@0: [37176.166743] node_lostwakeup
2022.04.02-07:50:47.37@0: [37176.346733] node_lostwakeup
2022.04.02-07:50:47.55@0: [37176.526736] node_lostwakeup
2022.04.02-07:50:47.73@0: [37176.706735] node_lostwakeup
2022.04.02-07:50:47.91@0: [37176.886732] node_lostwakeup
2022.04.02-07:50:48.07@0: [37177.045056] aesccm_check: replay detected
2022.04.02-07:50:48.09@0: [37177.066724] node_lostwakeup
2022.04.02-07:50:48.27@0: [37177.246732] node_lostwakeup
2022.04.02-07:50:48.45@0: [37177.426731] node_lostwakeup
2022.04.02-07:50:48.63@0: [37177.606729] node_lostwakeup
2022.04.02-07:50:48.81@0: [37177.786729] node_lostwakeup
2022.04.02-07:50:48.99@0: [37177.966736] node_lostwakeup
2022.04.02-07:50:49.17@0: [37178.146773] node_lostwakeup
2022.04.02-07:50:49.19@0: [37178.171982] aesccm_check: replay detected
2022.04.02-07:50:49.35@0: [37178.326727] node_lostwakeup
2022.04.02-07:50:49.53@0: [37178.506727] node_lostwakeup
2022.04.02-07:50:49.71@0: [37178.686732] node_lostwakeup
2022.04.02-07:50:49.78@0: [37178.759844] w_node_maybelost: disconnect_timeout exceeded (3687882)
2022.04.02-07:50:49.78@0: [37178.759891] wl_update_wmm: update active_wmm
2022.04.02-07:50:49.78@0: [37178.759905] ath_reset
 
User avatar
rfc1149
Frequent Visitor
Frequent Visitor
Topic Author
Posts: 77
Joined: Fri May 15, 2020 4:26 am
Location: England

Re: RBMetalG-52SHPacn link constantly disconnecting under load and randomly

Fri Apr 08, 2022 12:47 pm

I now understand that the wireless daemon is dying (signal 17 = SIGCHLD) during these disconnects.

2022.04.02-07:51:10.31@0: /bndl/wireless/nova/bin/wireless
2022.04.02-07:51:10.31@0: --- signal=17 --------------------------------------------
2022.04.02-07:51:10.31@0:
2022.04.02-07:51:10.32@0: pc=0x77fd4654 at=0x00000001 v0=0x00000004 v1=0x00000000
2022.04.02-07:51:10.32@0: a0=0x0050ab00 a1=0x00000007 a2=0xffffffff a3=0x00000001
2022.04.02-07:51:10.32@0: t0=0x00000000 t1=0xfeffffff t2=0x7fffa2dc t3=0x0050f940
2022.04.02-07:51:10.32@0: t4=0x00000001 t5=0x10000000 t6=0x0050cf00 t7=0x77f5a000
2022.04.02-07:51:10.32@0: s0=0x7fffa2dc s1=0xffffffff s2=0x77faa000 s3=0x7fffa148
2022.04.02-07:51:10.32@0: s4=0x0000004f s5=0xffffffff s6=0x00000004 s7=0x00010001
2022.04.02-07:51:10.32@0: t8=0x004eb050 t9=0x77fd4640 k0=0x77faa6f0 k9=0x00000000
2022.04.02-07:51:10.32@0: gp=0x78001380 sp=0x7fffa0d8 s8=0x004c0000 ra=0x77f6dbfc
2022.04.02-07:51:10.32@0:
2022.04.02-07:51:10.32@0: maps:
2022.04.02-07:51:10.32@0: 00400000-004db000 r-xp 00000000 07:00 884        /bndl/wireless/nova/bin/wireless
2022.04.02-07:51:10.32@0: 004eb000-004ed000 rwxp 000db000 07:00 884        /bndl/wireless/nova/bin/wireless
2022.04.02-07:51:10.32@0: 004ed000-00533000 rwxp 00000000 00:00 0          [heap]
2022.04.02-07:51:10.32@0: 77d08000-77d2a000 r-xp 00000000 07:00 240        /lib/libuc++.so
2022.04.02-07:51:10.32@0: 77d2a000-77d2b000 rwxp 00012000 07:00 240        /lib/libuc++.so
2022.04.02-07:51:10.33@0: 77d2c000-77e9e000 r-xp 00000000 07:00 254        /lib/libcrypto.so.1.0.0
2022.04.02-07:51:10.33@0: 77e9e000-77eb0000 rwxp 00162000 07:00 254        /lib/libcrypto.so.1.0.0
2022.04.02-07:51:10.33@0: 77eb0000-77ed5000 r-xp 00000000 07:00 249        /lib/libeap.so
2022.04.02-07:51:10.33@0: 77ed5000-77ed6000 rwxp 00015000 07:00 249        /lib/libeap.so
2022.04.02-07:51:10.33@0: 77ed6000-77f06000 r-xp 00000000 07:00 243        /lib/libucrypto.so
2022.04.02-07:51:10.33@0: 77f06000-77f07000 rwxp 00030000 07:00 243        /lib/libucrypto.so
2022.04.02-07:51:10.33@0: 77f08000-77f1e000 r-xp 00000000 07:00 247        /lib/liburadius.so
2022.04.02-07:51:10.33@0: 77f1e000-77f1f000 rwxp 00006000 07:00 247        /lib/liburadius.so
2022.04.02-07:51:10.33@0: 77f20000-77f49000 r-xp 00000000 07:00 246        /lib/libubox.so
2022.04.02-07:51:10.33@0: 77f49000-77f4a000 rwxp 00019000 07:00 246        /lib/libubox.so
2022.04.02-07:51:10.33@0: 77f4a000-77faa000 r-xp 00000000 07:00 242        /lib/libumsg.so
2022.04.02-07:51:10.33@0: 77faa000-77fac000 rwxp 00060000 07:00 242        /lib/libumsg.so
2022.04.02-07:51:10.33@0: 77fac000-77fae000 rwxp 00000000 00:00 0
2022.04.02-07:51:10.33@0: 77fae000-77fea000 r-xp 00000000 07:00 239        /lib/libc.so
2022.04.02-07:51:10.33@0: 77ff4000-77ff5000 rwxs 00000000 00:0a 366        /ram/rtrace/control
2022.04.02-07:51:10.33@0: 77ff6000-77ff8000 r-xp 0302c000 00:06 392        /dev/jiffies
2022.04.02-07:51:10.33@0: 77ff9000-77ffa000 rwxp 0003b000 07:00 239        /lib/libc.so
2022.04.02-07:51:10.33@0: 77ffa000-77ffb000 rwxp 00000000 00:00 0
2022.04.02-07:51:10.33@0: 7ffda000-7fffb000 rwxp 00000000 00:00 0          [stack]
2022.04.02-07:51:10.34@0: 7fffb000-7fffc000 r-xp 00000000 00:00 0
2022.04.02-07:51:10.34@0:
2022.04.02-07:51:10.34@0: stack: 0x7fffb000 - 0x7fffa0d8
2022.04.02-07:51:10.34@0: <stack_data_removed>
2022.04.02-07:51:10.34@0: <stack_data_removed>
2022.04.02-07:51:10.35@0:
2022.04.02-07:51:10.35@0: code: 0x77fd4654
2022.04.02-07:51:10.35@0: 10e00002 00402025 00022023 8f998078 100099ea 00000000 00000000 3c1c0003
2022.04.02-07:51:10.35@0:
2022.04.02-07:51:10.35@0: backtrace: 0x77fd4654 0x77f6dbfc 0x77f6de54 0x77f7a3f0 0x0041b9e0 0x77fb798c

However, testing the Metal associated to a MikroTik 5GHz AP yields stable throughput so this is leading me to believe the issue is some vendor intolerance for the DWA0122's implementation. I guess the wider question now is what is the condition or bit field that is causing the issue. I don't think MikroTik can be faulted if it's vendor intolerance but the lack of a response from them and the inability to diagnose this issue is forever going to leave a question in my mind.

Who is online

Users browsing this forum: No registered users and 16 guests