WISP with PPPoE and VLANs

Hello,
We have a WISP which is made of 10 towers, 9 are connected to the main tower (HQ) with PtP links in a spoke topology, each tower has a different number of sectors depending on the number of customers there. Currently we are servicing a little over 2000 customers.
There is one gateway, a CCR1036 which has all the PPP servers, these servers are on 50 VLAN interfaces.
Authentication and billing is done by a FreeRadius server which is also in HQ. All CPEs use PPPoE to connect to the gateway and are assigned a private IP address from the corresponding VLAN pool (10.x.x.x), the gateway NATs all the customers through a single Public IP address.
Everything goes smoothly until sometimes the network becomes congested and the customers complain, this does not effect a certain tower but the whole network and the overall bandwidth usage drops. This usually doesnt take long though, but when it reaches half an hour we get complaints.

We have disabled our DNS server and are using the google public dns server, this is because the monitoring system showed a spike of DNS queries just as the problem goes away.

As far as I know OSPF wouldn’t fix this as there are no failover paths for routing to take.

So we were thinking of decentralizing the PPPoE termination so it is done in the towers instead of the GW.
What do you guys think? What else could we do to improve our network?
Please advise and thanks in advance

PPPoE termination on tower router would be the way to go imo

IMO, your problem may be caused because 1 public IP address is not enough.

I have at home 60 TCP connections with not very much usage of Internet. It goes to near 200 when several devices are used.

If you have only 1 IP address for 2000 customers, considering that TCP port is a 16-bit number so there are 65536 possible values, you have 30 TCP connections per client.

Sure you have to consider using more than 1 public IP address, in high-usage time there are not enough ports available for connections.

Yes i guess that’s a step we will take, we will create a VLAN for every sector and bridge it to the Radius server, and the NATting will still be done in the GW (centralized).
thank you

Good point, we will work on assigning more public IP addresses right away.
Thank you

It seems we are having a problem with our CSS106-1G-4P-1S switches falling to 100M auto speed which might be the problem.
http://forum.mikrotik.com/t/css106-1g-4p-1s-autonegotiation-drops-to-100m/118059/1