Any answers?
No - more questions for you though.
What version of ROS are you using?
What model clients are you using?
Are you using local forwarding or not?
Are all the devices on the same L2/broadcast domain?
What make and model switches are you using?
What model is the CAP controller?
I had a lot of trouble on one site with CAPs disconnecting - it drove me bonkers until I replaced the old TP-Link switch and a handful of other switches with Mikrotik devices and the problem went away. One thing I noticed was one segment of CAPs would drop out regularly - I determined it was the fridge motor on the same power circuit causing the junk switch to reset or play up - replacing it stopped that issue.
As for scaling your CAPMAN controller CPU, I have one site using hEX with 15 clients connected - a mix of wAPAC, hAPAC and cAPAC using non local forwarding. It is very stable and works well with over 100 devices at a time (iPads and Chromebooks) using non-local forwarding. The WAN is 50Mbit/50Mbit so the actual traffic being used is low, but at times I push out new apps (the iPads are managed) or do app updates and the common bridge throughput goes over 500Mbit as the apps come off the local Macos cache. No single core of the hEX maxes out. I have taken a belt and braces approach with this and also made sure the hEX, the switch and all the CAPs are using flow control on ethernet - there are a surprising number of ethernet pauses on the hEX but the pauses aren't regular they seem to happen in bursts which would indicate the CPU might get stalled for some reason now and then - but the
CAPSMAN system is very stable. I am using CRS328 switches.
I have installed another
CAPSMAN system recently with 17 cAPACs and it is using non-local forwarding with an RB4011 connected to a CRS328 over 10GBit fibre. The CPUs are catatonic on this system but for the small amount of extra money the RB4011 gives plenty of headroom for the future. Again all the important interfaces have ethernet flow control enabled.
I get the impression from
CAPSMAN that if used on an L2 LAN that the encapsulated traffic and control is done over a UDP or UDP-like interface and there is a low tolerance for any of these packets getting lost including any broadcasts that
CAPSMAN will do. I have not tried
CAPSMAN using a routed (L3) network. Some others I know have mentioned that they run
CAPSMAN controllers over L3 from a cloud VPS to CAPs on a site and that it can be intermittently unreliable at times which suggests there is a low state timeout thresh hold in the
CAPSMAN system which can initiate disconnections.
Aha, from the
CAPSMAN manual: "If the
CAPsMAN or CAP gets disconnected from the network, the loss of connection between CAP and
CAPsMAN will be detected in approximately 10-20 seconds."
So you don't have much time before a cascade of CAPS drop off if a segment on your LAN drops out due to a switch reboot or your TCP WAN connection fails. ADSL/VDSL retraining, cable network dropouts, ISP core network faults and routing problems somewhere out past the WAN can easily last more than 10-20 secs. Even obscure things like broadcast rate limiting and storm suppression in a switch could have you cursing and swearing at
CAPSMAN for hours.
*Normis*: It would be good if this the
CAPSMAN heartbeat/keepalive timeout was a value we can adjust as needed - at our own peril of course.