PPPoE and OSPF drops

jefferyf · Fri May 15, 2020 8:56 pm

Hello I have a strange issue happening on one of our routers. It is a CCR1072-1G-8S+ currently serving ~250 PPPoE connections.

The issue is whenever 5 or more PPPoE connections are dropped at the same time the log will fill up with already active closing previous one multiple times over and over. And then shortly after the OSPF will drop and then reestablish. The router doesnt reboot and maintains its uptime, but access to it isnt possible since it seems like its locking up attached is a screenshot of these type of log entries. We are using current firmware 6.46.6.

mducharme · Sat May 16, 2020 1:58 am

Hello I have a strange issue happening on one of our routers. It is a CCR1072-1G-8S+ currently serving ~250 PPPoE connections.

The issue is whenever 5 or more PPPoE connections are dropped at the same time the log will fill up with already active closing previous one multiple times over and over. And then shortly after the OSPF will drop and then reestablish. The router doesnt reboot and maintains its uptime, but access to it isnt possible since it seems like its locking up attached is a screenshot of these type of log entries. We are using current firmware 6.46.6.

Hi,

Make sure that that router does not have any "masquerade" rules and that your PPPoE users are in a stub area, not your backbone.

jefferyf · Sat May 16, 2020 5:33 pm

Thank you for the reply. We have no masquerade rules only dst nat and src nat. We also have the stub area for the pppoe ips and the passive default as well already.

mducharme · Sat May 16, 2020 8:43 pm

Thank you for the reply. We have no masquerade rules only dst nat and src nat. We also have the stub area for the pppoe ips and the passive default as well already.

That sounds fine - do have an OSPF area range configured for the PPPoE customers?

What does the Profiler show when a few PPPoE customers disconnect?

jefferyf · Sun May 17, 2020 6:42 am

I will watch the /tool profile and see what I can find out when I drop a few of them. I have to do this in maintenance windows since its service affecting for all OSPF to go down, I will report back with my findings. Thank you.

mducharme · Sun May 17, 2020 11:17 pm

I will watch the /tool profile and see what I can find out when I drop a few of them. I have to do this in maintenance windows since its service affecting for all OSPF to go down, I will report back with my findings. Thank you.

Also make sure your Routerboot firmware is up to date in /system routerboard since there were some performance fixes for TILE in the past.

jefferyf · Tue May 19, 2020 10:03 am

I will watch the /tool profile and see what I can find out when I drop a few of them. I have to do this in maintenance windows since its service affecting for all OSPF to go down, I will report back with my findings. Thank you.
Also make sure your Routerboot firmware is up to date in /system routerboard since there were some performance fixes for TILE in the past.

Thank you. The current firmware is 6.46.6 which is the current latest stable version. Also watching /tool profile when around 10 pppoe sessions are manually disconnected nothing is spiking up very high, total rarely goes over 10% with none of the individual processes going much higher than 3 or 4%.

/system routerboard> print
routerboard: yes
model: CCR1072-1G-8S+
serial-number: 8A350B0A6803
firmware-type: tilegx
factory-firmware: 6.44.5
current-firmware: 6.46.6
upgrade-firmware: 6.46.6

mducharme · Tue May 26, 2020 8:35 pm

Thank you. The current firmware is 6.46.6 which is the current latest stable version. Also watching /tool profile when around 10 pppoe sessions are manually disconnected nothing is spiking up very high, total rarely goes over 10% with none of the individual processes going much higher than 3 or 4%.

So when you manually disconnect 10 sessions you don't see this problem? That makes it seem like the source is something else. I wonder if perhaps some malfunctioning or hacked device is creating many connections in the connection tracking table, overwhelming it.

jefferyf · Thu Jun 04, 2020 10:41 pm

Thank you. The current firmware is 6.46.6 which is the current latest stable version. Also watching /tool profile when around 10 pppoe sessions are manually disconnected nothing is spiking up very high, total rarely goes over 10% with none of the individual processes going much higher than 3 or 4%.
So when you manually disconnect 10 sessions you don't see this problem? That makes it seem like the source is something else. I wonder if perhaps some malfunctioning or hacked device is creating many connections in the connection tracking table, overwhelming it.

Sorry for the delay. Yeah it could be that, I'm in the process though of loading this config on a different router instead of a 1072 as no where else in the network do we have this problem, but also no where else are we using a 1072 as a BNG. I also have hopes for this new stable 6.47 update, but in patch notes dont see anything that seem to pertain to this issue.

nichky · Fri Jun 05, 2020 12:48 am

can we see ospf /area/ network / interfaces and ip address

flameproof · Fri Aug 14, 2020 9:13 am

Greetings - we have been having issues with CCRs (including a 1072) dropping all or part of the established PPPoE sessions, during which the CCR appears to stop responding to API/SSH (WebFig remains available but most sections are blank). We contacted Mikrotik with support files etc. and eventually we were told "our hardware is not up to your requirements", yet our requirements are extremely "light" - PPPoE, NAT/route, rate-limit via simple queues (no OSPF etc.). We have a 1072 in our DC handling 9.5Gbps of traffic with NAT/route from 20 sub-networks, at CPU loads not exceeding 50%.

If you don't find a fix, I'd suggest looking for an alternative vendor for PPPoE concentrator duties, as we are currently doing. This problem was first reported in 2009 (11 years ago!!) and still has not been fixed. We can reliably reproduce and trigger the problem (if we drop any downstream network segment e.g. reboot a switch or a PtP link), but no fix in sight:

viewtopic.php?f=3&t=37162

We have tried all sorts of fixes, including raising PPPoE timeout to allow sessions to "survive" a temporary break of a downstream link (e.g. switch reboot as mentioned), to no avail. The traffic from the session comes back, but then all traffic stops while sessions are dropped and re-established:

clipboard-image-6.png

Above you can see when we reboot a switch, the traffic from ~400 clients drops out, then comes back as if nothing had happened (session timeout 300 seconds on both CCR and CPEs), then the drop takes place about 2 minutes later.

Good luck!

CZFan · Sat Aug 15, 2020 12:53 am

...
If you don't find a fix, I'd suggest looking for an alternative vendor for PPPoE concentrator duties, as we are currently doing. This problem was first reported in 2009 (11 years ago!!) and still has not been fixed. We can reliably reproduce and trigger the problem (if we drop any downstream network segment e.g. reboot a switch or a PtP link), but no fix in sight:

viewtopic.php?f=3&t=37162

...
Good luck!

The answers has been given for resolution to your problem, by both MT Support as well as in the topic you referenced.

flameproof · Sat Aug 15, 2020 12:43 pm

Indeed, both Mikrotik support and other colleagues of the forum pointed to alternative hardware as a solution, thank you.

CZFan · Sat Aug 15, 2020 1:36 pm

Indeed, both Mikrotik support and other colleagues of the forum pointed to alternative hardware as a solution, thank you.

Hmmm, I read it more like the problem is your architecture / configuration, but anyway....

flameproof · Mon Aug 17, 2020 9:24 am

There is part which is our architecture / configuration, but the more deep & worrying issue is that in some cases, one has almost zero visibility into when the architecture or configuration is causing a particular issue. Had we these tools at our disposal (or was the wiki more useful in some areas), the noise on this forum caused by "architecture" or "configuration" would die considerably.

Let's take DNS, as our next "mistery headache" example - I'm going to shut up about all this on this thread and open a more generic one on "Lack of tools and visibility into performance issues", but it serves as a useful example:

Screen Shot 2020-08-17 at 07.46.46.png

Notice anything odd? Not really, right? Tool->Profile didn't either, with DNS sitting at less than 1% CPU usage. However, we had hundreds of customer tickets related to "some sites don't load" and "DNS doesn't resolve" which in turn caused poor service. We've been trying to debug where the issue was for quite some time. We have run packet traces that only show DNS responses sometimes don't arrive at all, sometimes they come with SERVFAIL, but nothing that signalled performance issues on the CCRs or the network topology. This is our architecture (in terms of configuration, we tried all sorts of changes on the DNS settings of all the boxes involved):

dns.png

The arrows show who the resolver is upstream for each device. The customer hAPs each also serve as caching DNS servers for the devices connected to them by the customers. Technically, this is quite an efficient setup, as the central CCR will only need to query once (within TTL limits of course) and cache the result for all the network-level CCRs, and the network-level CCRs do the same for the up to 1500 CPEs connected to each one.

The final proof, fed up with coming up with nothing, was to setup a caching bind9 server on a DigitalOcean droplet as a test. We noticed an immediate 15-20% uptick of traffic on two network-level CCRs which had this DNS server setup as their upstream source.

We then setup a bind9 server next to the 1072 at the DC, and pointed all 20 network-level CCRs to it, noting a jump in aggregate traffic from 8.3Gbps peak time to just over 9.5Gpbs. Customers now report they are fine.

Back to my first graph, there is zero positive impact on CPU from removing DNS duties off the main CCR. The change was made mid-day on Friday.

My BIG question to you now is: can you predict, or failing that, observe directly, when the network-level CCRs will "give up" on DNS, just like they "give up" on PPPoE? Yes, we can then point customer hAPs directly to the bind9 server, but do we need to wait for customers to scream and trash us on social media to notice this? What will be the next feature that the CCRs will "give up" on?

We need far far greater visibility into issues before we can be blamed for certain "architecture" or "configuration" problems.

CZFan · Mon Aug 17, 2020 7:44 pm

solution is given to you by a fellow Mikrotik Forum member, from topic you quoted earlier

Connection Tracking places a big load on router

pppoe-issue.JPG

flameproof · Mon Aug 17, 2020 8:11 pm

Sooooo why is that info not in performance data provided by Mikrotik, alongside DNS resolver capabilities, etc? Why is soooo hard to analyze connection tracking issues? You're getting to the answer, not the path that it takes to resolve an issue.

Please see my "philosophical" thread on this topic, where I address the wider issue at hand:

viewtopic.php?f=2&t=165068

Thanks for your input in any case, it's all valuable and counts!

CZFan · Mon Aug 17, 2020 10:57 pm

I guess if all those good things were in place, we would be paying Cisco prices for Mikrotik

flameproof · Mon Aug 17, 2020 11:14 pm

Hmmmm don’t agree with that - Cisco charges outrageous prices because they have a captive market, and of course solutions for extremely large volumes of data. If we assume that RouterOS’ DNS resolver is Bind under the skin, there is no excuse not to expose the level of logging Bind provides, which is excruciatingly extensive (and the source of the charts I posted earlier).

Same goes for connection tracking, and other similar features. There is no need to spend the level of cash Cisco in turn charges, in order to expose more information.

On the other hand, Mikrotik is squarely targeting the mid-size ISP market, into which we are just starting to get into (40Gbps modules, CWDM solutions, boxes capable of passing Tbps around, etc), but it cannot get into such market without grown-up tools.

What I expected as an answer to a customer that has deployed 15.000 hAP Lites and dozens of CCRs was “Hey, here are specific things you can try” or even “let us put you in touch with a paid consultant who will fix it for you”. I would even accept “you are a noob who doesn’t do what he’s doing, here is how it’s done...” (I am a noob!).

Shrugging and saying “our boxes are not enough for you” is self-defeating...

aacable · Sat Sep 19, 2020 5:36 pm

I guess if all those good things were in place, we would be paying Cisco prices for Mikrotik :-)

IMHO , That's the right answer from CZ. Discussion should end here or head towards wish list zone ... :)

flameproof · Sun Sep 20, 2020 12:10 am

Ok, I could live with it heading to wish list - however, please try to marry 40Gbps modules or CDWM multiplexers which you’d use to run at least 80Gbps of traffic, with a lack of monitoring capabilities on the boxes meant to drive said traffic.

If the top-of-the-line 1072 (at over $3000!!) chokes at 1.5Gbps, without clearly telling you where it is choking, no tools to debug, etc - you are actually in Cisco territory from a price / performance standpoint. I think proper performance monitoring and debugging should be native, and not require a wish list to implement.

That, or admit you want to play in the little leagues, and stop offering SFP+ interfaces or QSFP modules...

PPPoE and OSPF drops

PPPoE and OSPF drops

Re: PPPoE and OSPF drops

Re: PPPoE and OSPF drops

Re: PPPoE and OSPF drops

Re: PPPoE and OSPF drops

Re: PPPoE and OSPF drops

Re: PPPoE and OSPF drops

Re: PPPoE and OSPF drops

Re: PPPoE and OSPF drops

Re: PPPoE and OSPF drops

Re: PPPoE and OSPF drops

Re: PPPoE and OSPF drops

Re: PPPoE and OSPF drops

Re: PPPoE and OSPF drops

Re: PPPoE and OSPF drops

Re: PPPoE and OSPF drops

Re: PPPoE and OSPF drops

Re: PPPoE and OSPF drops

Re: PPPoE and OSPF drops

Re: PPPoE and OSPF drops

Re: PPPoE and OSPF drops

Who is online