PPPoe connections dropping at random intervals

HELP! I can’t figure this out, and I am going CRAZY!!!


We have a small WISP setup with the following:

  • PPPoE router is a MK RB1100AHx2 v5.24
  • Multiple APs that are UBNT gear
  • One PxP using tranzeo gear
  • MOST residential clients are UBNT radios in router mode, with PPPoE enabled
  • a few larger clients are setup with UBNT radios in bridge mode, with PPPoE on the CX router itself
  • PPPoE authenticates to a RADIUS server
  • REMOTE site has a few APs on a different ISP connection, but authenticates in via VPN
  • Aircontrol is setup on a VLAN, and we have management ports enabled on that VLAN on the UBNT gear

REMOTE SITE

  • RB 450 with PPPoE server enabled on one port on a VLAN, and this site seems to be all fine
  • PPPoE stays up all the time, no issues really
  • No Aircontrol setup here yet, but I ruled out Aircontrol by turning off the Aircontrol server and the clients still dropped

MAIN SITE
On RB1100AHx2 V5.24 (I’ve tried the newer v6 ROS, it didn’t help, it actually made things worse, so I have since downgraded. I also have purchased TWO CCR1036, they do the same thing basically)

  • Each AP is plugged into the RB on its own interface
  • RADIUS Manager server is plugged into its own interface (Freeradius, DMAsoftlab Radius Manager)
  • PPPoE server is setup on each of the AP interfaces, each one has its own service name, IP pool, etc


    PROBLEM:
    What happens is on the MK the PPPoE connections will drop all at once, then come back seconds later. Usually it’s JUST the UBNT radios that are in router mode, HOWEVER two or three times a day ALL radios go offline. MK logs just say USERNAME logged out, XXXX XXXXX XXXXX XXX XX (Xs are random numbers).

I have tried various MTU settings, 1500, 1492, currently at 1480 in PPPoE server
I have tried various MTU settings on the client devices also
I have powered down the air control UBNT server thinking it MIGHT be causing the disconnects
The BRIDGED clients seem to stay up more often, yet SOME of them always drop at the same time as well
The ONE client on the Tranzeo link has been up solid for several days, however it WAS doing the same thing last week, its magically working OK now???
The logs on the RADIUS server don’t show anything wrong, and because the remote site uses this same RADIUS server without issue, I can’t see it being that, but maybe?
I tried to enable RADIUS logging, and it doesn’t show anything wrong, there is authentication and accounting traffic
I have tried to change the accounting interval in MK from 1mins to 2min to 10min, no effect
I have increased the RADIUS timeout to 5000ms but realistically I can’t see it being radius at this point as the other site doesn’t have issues

I’m not sure what the MK interface MTU should be set to (it’s currently 1500, and L2 MUT is 1592) - the ones plugged into APs, that’s the default settings

I’m so burnt out, confused and mad I don’t know where to go from here…

If you have an extra available, it might be good to try a 400G series router in place of the RB1100AHx2.

I’ve had a lot of issues with my RB1100AHx2. MikroTik support says that the complex config I was running on it was responsible for the several problems on 6.0rx*. I had to move to 6.0 because I was getting kernel panics on 5.x. My interpretation of the responses from MikroTIk support are that the dual core in this model needs help which is only in 6.0. But 6.0 has it’s own issues for now.

So, my theory is, try something with a single CPU running the same RouterOS as the working site. See if things smooth out.

I would like to try the same type of router you have at the currently stable location, but it may not have enough horsepower for your load. If the same model router doesn’t smooth things out, swap the working router to the non-working site and see if the problem moves with router or stays with the location.

This type of problem could be caused by just enough occasional packet loss in the network between the clients and the RB1100AHx2 to disconnect a few of the session and sometimes all of them. The MikroTik PPPoE client may be more sensitive to temporary issues in middle than the PPPoE client on the customers’ routers. That could explain why the MikroTiks in router mode tend to have more disconnects.

I do not see how the problem you are describing could be caused by RADIUS. I also don’t think you are dealing with an MTU issue.

Thanks, I will try a 450, I have one here collecting dust. So funny that a $60 product may fix the problem with a $600 one. I’ll post my results.