Hi Guys, We are having a problem with our PPPOE connections that are disconnecting daily from Monday to Friday around 02:15 each morning and the clients come back online around 02:35 every morning. For some reason Weekends dont have the issue, connections stay stable every Saturday and Sunday. DHCP connections also dont drop ever. This has been going on for 6 months now and we have had a few guys look into it with no results. Thank you in advance for any feedback
The system is setup as follows
Router 1 - Internet facing: CCR1072-1G-8S+ (V6.49.15) We have tried upgrading this to V7 but is giving a lot of issues when upgraded
Router 2 - Runs traffic between Client facing and Internet facing routers: CCR1072-1G-8S+ (V7.15.2) recently upgraded to V7 was also running 6.49.15 until around 2 weeks ago
Client facing routers
Router 3 - CCR1072-1G-8S+ (V6.49.15) Also tried upgrading to V7 but failed coming back online so we had to revert it back to V6 (450 PPPOE Clients)
Router 4 - CCR2116-12G-4S+ (V7.15.2) Recently installed last week to test if router 3 was the issue but this one is also having pppoe drops at the exact same time (200 PPPOE Clients)
First, there is a significant difference between DHCP and PPPoE - DHCP is only used to assign the IP configuration so it does not have (or even need) any means to monitor the network path between the DHCP server and the DHCP client. PPPoE, on the other hand, does use keepalives to monitor the path state, so it is quite sensitive about packet loss. So the fact that DHCP clients “don’t drop” says nothing regarding the path between your routers and the clients, unless some of the DHCP clients use some kind of link monitoring and report no losses during that time. So I would set up such a monitor as the first step, I’d run a continuous ping to the address of one of the DHCP clients, saving the results to a file, and look into it the next morning. Have you done this, and is this what you actually had in mind when you wrote that they don’t ever drop?
As for the PPPoE itself, here, sniffing is your friend. Take a dedicated Mikrotik device that can handle a USB disk, or a laptop, whichever is easier for you, and let it sniff into a file on the Ethernet port to which the PPPoE client is attached, starting a few minutes before the earliest time the issue began in past days and running until a few minutest past the latest time it began in past days. For sniffing at the CCR side, you will need to insert additional hardware between the CCR and the network, because sniffing is a low priority process in RouterOS so even if you filtered by MAC address of the client, you might nevertheless get false positives (i.e. frames not sniffed although they actually did pass through the interface). So a switch that supports selective traffic mirroring in hardware would be my choice:
E.g. hAP ac² (which has the QCA8337 switch chip) seems to support this (at least it does not reject the configuration), but it is a rarely used feature so test it before betting your reputation on it.
By comparing the contents of the two sniffing results you should be able to tell whether
there is traffic loss in the network between the CCR and the client,
an overload of the CCR makes it respond to client’s keepalives too late (a reason for an overload at this time of day has to be further investigated),
the CCR actively tears down the PPPoE connections (least likely to me)