probably loop problem and pppoe server issue

Hi all,

we have a problem in one of our networks at the moment. We use a CCR1036 (6.43.1) as pppoe-server which terminates ~1000 sessions. Behind the router we have several CRS switches, CCRs (mainly in bridged mode, no routing), DSLAMs, and sadly also ubnt radio networks. To seperate customers traffic from mgmt traffic we use some vlans.

Now to the problems:

A few days ago we plugged one of the DSLAMs off its uplink switch and plugged it into a new uplink switch. So we expected that our customers behind that DSLAM would have a service interruption of <1min. Of course the ~250 pppoe session of the affected customers got lost at the CCR1036 for that minute. So far as expected… But the CCR started to kick ALL pppoe sessions! Over a period of a few minutes all sessions were gone! After a reboot things started to work fine again. So the result was not a service interruption of <1min for 250 customers but some minutes for all behind that CCR because of the reboot. In the logfile you see hundreds entries like “PPPoE connection established from [caller MAC]” but no “[user] logged in” entries.
A few days later a branche of the ubnt network went down. => a bit less than 100 pppoe sessions got lost. => Same behavior at the CCR; all sessions kicked until reboot. We did not have that issue in earlier days but I can’t tell when exactly this behavior started because we try to avoid loosing a bunch of sessions at a blow :slight_smile: Has anyone experienced the same behavior of pppoe server?


Now to the main problem:

A lot of MT devices behind the CCR (but also the CCR itself) write to the log:

[interface]: bridge port received packet with own address as source address, probably loop

Those messages occure several times a day at the CCR (but sometimes just one times…) and sporadic at I think all MT devices (but also non-MT devices). I guess it depends on the traffic amount through the respective bridges. In most cases the bridges contain vlan interfaces. We observe the loop messages at all different vlan bridges. (normaly a device has at least a bridge for mgmt which contains mgmt vlan interfaces and a bridge for customer traffic which contains customer access vlan interfaces. The pppoe server runs on the customer access bridge). Since we observe those messages at all points in the network its hard to find the error origin. Is it a real loop or could it be another error? We check device for device if there’s an error in the bridge configuration which causes the loop but until now we didn’t find anything… The log messages tell “probably” loop; what else could lead to the situation that bridges at any point in the network complain about receiving frames with own scr.MAC?

In the context of my research for trouble shooting I got confused by different statements concerning the MAC adresses of bridges and their slave interfaces: In RouterOS, when I create a bridge and put interfaces in it, the bridge gets the same MAC address as the first physical interface to come up. But I have read that bridges should have a different MAC address as their slave devices. In one case someone had also loop error messages until he changed the mac of the bridge. But I can’t imagine that the default behavior of RouterOS brdige implemantation leads to faulty configurations?! :open_mouth:
Also can it be a problem if various bridges on one device have the same MAC? In my understanding it shoud not…

Conclusion: I need help please how to handle the loop errors and: Could the strange behavior of the pppoe-server be a symptom of the loop problem? Thank you in advance for any help and hints!!

Michael

I want to give an update, maybe someone has to deal with the same issues…

We found the reason for the CCR to run amok when it looses the many pppoe clients at once. The solution is: don’t use the NAT target MASQUERADE if you have a lot of dynamic interfaces because it forces the connection tracking tree to rebuilt everytime an dynamic interface appears or disappears. Since the masquerade target is designed for cases where the IP address can change (to which you want to NAT) you can also use target SRCNAT. For srcnat it’s not necessary to rebuilt the tracking tree…

Details can be found here

I am having the same problems could you give a example of how you use src-nat as target instead of masquerade…example like for the pppoe pool address & wan address in ip-firewall-nat those are what i use masquerade to do.my example:
/ip firewall nat
add action=masquerade chain=srcnat comment="PPPOE-4MBS " src-address=
172.168.4.0/24
add action=masquerade chain=srcnat out-interface=ISP1