Capsman Forcing All Access Points To Rejoin

Hi Everyone

We have an issue at the moment where CAPSMAN at roughly the same time each day kicks most access points off itself and forces them to reregister

It looks like this

19:27:24 caps,info 50:1A:C5:E6:56:E1@MikroTik-823 disconnected, interface disabled
19:27:24 caps,info A8:DB:03:11:740:CC@MikroTik-823 disconnected, interface disabled
19:27:24 caps,info CC:2D:B7:9E:12:D7@MikroTik-638 disconnected, interface disabled
19:27:30 caps,info [::ffff:10.40.1.4:509,Join,[B6:69:F4:as:CB:C1]] joined, provides radio(s): B8:69:F4:6A:CB:C4
19:27:31 caps,info [::ffff:10.40.1.19:495,Join,[B6:69:F4:57:9E:42]] joined, provides radio(s): B8:69:F4:84:9E:45
19:27:31 caps,info [::ffff:10.40.1.9:464,Join,[B6:69:F4:51:C5:DB]] joined, provides radio(s): B8:69:F4:6A:C5:DE

It does this for every Ap inside Caps man and all 4 of our wireless controllers seem to do it.

We initially thought it was due to the A.P’s trying to force an update, but we turned the update policy off and are still facing the same issue

Any answers?

ROS version?
How does the cap see the Capsman? Through L2? L3? Capsman name?
Equipment used?

Any other events in log immediately before disconnects (can be seemingly unrelated)?

We had this happen with high amount of connected caps, static interfaces and generally high load because of high traffic (firewalling, nat-ing etc). We asked the support whether capsman is multi-threaded or single-threaded and they recommended to use CHR with higher cpu power to avoid getting into unstable situations. However, we substituted the CCR1009 with a CCR1036 which uses the same cpu frequency and the situation went better. We did not run into similar situations since then, i guess they may have tweaked capsman since 6.42.9 a little bit as well.

Which router are you using? How is the general system load (/tool profile or /system resource cpu print)? Are you using an older or actual routerOS?

What is considered a high amount of caps?

No - more questions for you though.

What version of ROS are you using?
What model clients are you using?
Are you using local forwarding or not?
Are all the devices on the same L2/broadcast domain?
What make and model switches are you using?
What model is the CAP controller?

I had a lot of trouble on one site with CAPs disconnecting - it drove me bonkers until I replaced the old TP-Link switch and a handful of other switches with Mikrotik devices and the problem went away. One thing I noticed was one segment of CAPs would drop out regularly - I determined it was the fridge motor on the same power circuit causing the junk switch to reset or play up - replacing it stopped that issue.

As for scaling your CAPMAN controller CPU, I have one site using hEX with 15 clients connected - a mix of wAPAC, hAPAC and cAPAC using non local forwarding. It is very stable and works well with over 100 devices at a time (iPads and Chromebooks) using non-local forwarding. The WAN is 50Mbit/50Mbit so the actual traffic being used is low, but at times I push out new apps (the iPads are managed) or do app updates and the common bridge throughput goes over 500Mbit as the apps come off the local Macos cache. No single core of the hEX maxes out. I have taken a belt and braces approach with this and also made sure the hEX, the switch and all the CAPs are using flow control on ethernet - there are a surprising number of ethernet pauses on the hEX but the pauses aren’t regular they seem to happen in bursts which would indicate the CPU might get stalled for some reason now and then - but the CAPSMAN system is very stable. I am using CRS328 switches.

I have installed another CAPSMAN system recently with 17 cAPACs and it is using non-local forwarding with an RB4011 connected to a CRS328 over 10GBit fibre. The CPUs are catatonic on this system but for the small amount of extra money the RB4011 gives plenty of headroom for the future. Again all the important interfaces have ethernet flow control enabled.

I get the impression from CAPSMAN that if used on an L2 LAN that the encapsulated traffic and control is done over a UDP or UDP-like interface and there is a low tolerance for any of these packets getting lost including any broadcasts that CAPSMAN will do. I have not tried CAPSMAN using a routed (L3) network. Some others I know have mentioned that they run CAPSMAN controllers over L3 from a cloud VPS to CAPs on a site and that it can be intermittently unreliable at times which suggests there is a low state timeout thresh hold in the CAPSMAN system which can initiate disconnections.

Aha, from the CAPSMAN manual: “If the CAPsMAN or CAP gets disconnected from the network, the loss of connection between CAP and CAPsMAN will be detected in approximately 10-20 seconds.”

So you don’t have much time before a cascade of CAPS drop off if a segment on your LAN drops out due to a switch reboot or your TCP WAN connection fails. ADSL/VDSL retraining, cable network dropouts, ISP core network faults and routing problems somewhere out past the WAN can easily last more than 10-20 secs. Even obscure things like broadcast rate limiting and storm suppression in a switch could have you cursing and swearing at CAPSMAN for hours.

Normis: It would be good if this the CAPSMAN heartbeat/keepalive timeout was a value we can adjust as needed - at our own peril of course.

In our case ~200 access points, ~1000 interfaces and we are using capsman forwarding and not local forwarding.

Wow, well if CAPSMAN is mostly single core and so CPU bound your best off getting the best single core performance and using the 10GBit interface. An x86 or RB4011 would give better performance and be vastly cheaper than a 1036 just for CAPSMAN.

Hello,

the short version: CAPSMAN forwarding is full of bugs. The “fun” begins if you have ~1000 clients on the CCR controller, doing some traffice (500 Mbps/30000 sessions)…
Long version: No time today to tell you all things today.

So, let´s make it short:

  • CCR will crash in certain high load situations
  • CCR will disconnect all caps if you configure settings which let´s the CAPSMAN reprovision the CAP devices
    => They will disconnect an reconnect

On cAP ac clients disconnect, because 5 GHz interface is unstable. Some say the driver module crashes and reloads
Until:

I don´t know whether this will solve your problems. Btw, do you also seee thousands of ~DHCP offer no lease messages in your logs?
=> Imagine you sit in front of one cAP, you can see.
=> Your notebook connects to the cAP but it doesn´t get any IP.
=> You disconnect and reconnect but your notebooks doesn´t get any IP.
=> You look on the DHCP server within the CCR and you can see the timer from the offer goes down from 30 seconds to 0 until it restarts at 30 seconds again. It stays on “offered” but nothing happens
=> Until you hit the provisioning button for the cAP and guess what happens your notebook gets an IP address immediately afterwars
=> Now you have this problem on 300 cAP device running on the CCR/CAPSMAN. Sometime the clients work without any problems for some hours until there is more load on the CCR. So starting at ~10:30 am each day I see “DHCP offer no lease” within the logs. 6.46beta59 doesn´t help. Noone knows why and I did everything this forum asked me to do.
=> The solution for me: Drop CAPSMAN, use single access points with local management and local forwarding. Sad? Of course.

The load was distributed to all cores on my CCR1036 devices with CAPSMAN based forwarding enabled.