We have a WISP which is made of 10 towers, 9 are connected to the main tower (HQ) with PtP links in a spoke topology, each tower has a different number of sectors depending on the number of customers there. Currently we are servicing a little over 2000 customers.
There is one gateway, a CCR1036 which has all the PPP servers, these servers are on 50 VLAN interfaces.
Authentication and billing is done by a FreeRadius server which is also in HQ. All CPEs use PPPoE to connect to the gateway and are assigned a private IP address from the corresponding VLAN pool (10.x.x.x), the gateway NATs all the customers through a single Public IP address.
Everything goes smoothly until sometimes the network becomes congested and the customers complain, this does not effect a certain tower but the whole network and the overall bandwidth usage drops. This usually doesnt take long though, but when it reaches half an hour we get complaints.
We have disabled our DNS server and are using the google public dns server, this is because the monitoring system showed a spike of DNS queries just as the problem goes away.
As far as I know OSPF wouldn't fix this as there are no failover paths for routing to take.
So we were thinking of decentralizing the PPPoE termination so it is done in the towers instead of the GW.
What do you guys think? What else could we do to improve our network?
Please advise and thanks in advance