Urgently need help with strange forwarding issue

We’ve got an issue popping up thats causing a lot of grief and is seemingly unsolvable

We’ve got VDSL2 modems connected to a DSLAM, and the DSLAM connects to a RB3011 (have also used a HEX)
This is a common setup where nothing is different across almost a hundred DSLAM’s that we’ve installed, there is nothing different in config

VDSL2–>DSLAM–>Mikrotik Customer Router–>MikroTik ISP Routers–>Internet

2 sites are having a common problem whereby sporadically just 1 (or a handful) of VDSL2 modems can ping the router, and the router can ping back. However the modems simply cannot ping any further than the RB3011, the packets hit the MikroTik customer router, but there’s never a reply. That is until I forcefully change the IP address on the MikroTik customer router
so i.e. if the modem was assigned 100.64.1.2 I just force it to 100.64.1.3 and it all magically works, without requiring re-sync or anything else. Then it’ll stop working again in a few/days so I change it back to 100.64.1.2 (or anything else) and hey its all working again

I still don’t know if the problem is with the DSLAM or with the Router. All signs point to the mikrotik router because if it was something like an IP conflict or corrupted ARP table in the DSLAM, then they wouldn’t be able to ping each other on the local IP subnet, but they can. And packets going any further are still just local MAC-MAC communication so there shouldn’t be any difference from the DSLAM’s perspective as its only seeing the IP/MAC of the local devices
Whereas the router is well… doing the routing

I’ve thoroughly checked the firewall and NAT rules with a fine tooth comb. I’ve literally disabled every single drop/reject rule. Everything out the internet is NAT’d so the ISP routers don’t see any discernable difference in what specific IP any VDSL modem is getting
I can verify the correct MAC address is mapped to the IP address in the ARP cache
There are no bridge filter rules
Only thing I can think of is some kind of weird IP connection bug that prevents return traffic from going back to the VDSL modem. But i’m so stumped. There is nothing apparent here, but i’m open to having a look at anything as this is now urgently important to resolve and nothing makes sense, outside of random bugs or hardware issue (we have entirely replaced all hardware twice now)

Have you checked the firewall connection tracking table for any icmp entries when the issue occurs?

The first thing I’d do would be sniffing on both routers simultaneously while the issue exists - ping, say, 8.8.8.8 from the VDSL modem, make the command line window as wide as your screen allows on both routers’ management connection, and run /tool sniffer quick ip-address=8.8.8.8 on both, without specifying any interface. This should show you how far the echo request gets and whether an echo response ever comes from the internet and whether it gets back, and to what MAC address the response is sent if it is. I could imagine some ARP response conflict, possibly one of the customer routers not affected by the issue having arp=proxy-arp set at its 3011-facing interface.

Also, I am lost in your explanation regarding the DSLAM’s knowledge of MAC addresses. If your topology diagram is correct (I would expect the “customer router” to be connected at the VDSL modem end, not between the DSLAM and the 3011, plus your text description clearly states that the DSLAM is connected to the 3011), then neither the VDSL modem nor the DSLAM ever learn the MAC address of the 3011 as there is the customer router between the DSLAM and the 3011. But this uncertainty changes nothing about the steps suggested above.

I’ll draw up a diagram tomorrow that explains better

But no, the DSLAM exists between all customers and the RB3011. The 3011 is the 'internet gateway’s or essentially just ‘the router’ in most organisations and the DSLAM is the switch. All VDSL routers behind it are acting as their own router (so double NAT is occuring, however all VDSL models have support for ping/traceroute and the problem still exists there, so double NAT is not the issue)
The 3011 is acting as the router for the entire network
Just assume the DSLAM is.a switch in any typical environment, the only real difference here is natively the DSLAM does port isolation (all traffic allowed upstream, but not between clients, this is desirable)

The DSLAM just like any switch has an ARP table of sorts, though I imagine it functions slightly differently due to the VDSL technology but for all intents and purposes it’s an ARP table so that the return traffic for i.e. 100.64.1.2 ends up back at VDSL port #2 and not at VDSL port #9

However I still don’t know if this is a layer2 or a layer3 problem. And I don’t know if the problem is the DSLAM getting confused about return traffic and not returning it to the correct port, or if the issue is with the Mikrotik router
Logically it should all work - as I said in the opening statement there is nothing different about this config setup vs many many other sites - however it’s very important that this gets diagnosed, as just throwing parts and money at the problem is not my idea of effective networking application

I have no problem with the [VDSLmodem] - [DSLAM] - [3011] topology, nor with the fact that the DSLAM acts as a switch with port isolation between the VDSL ports. Only the “customer’s Mikrotik” element in your diagram causes the confusion :slight_smile:

I can also understand that since you assign RFC 6598 addresses to the clients, there must be at least double NAT, and I agree with you that there’s no reason why it should cause a trouble as such. But I’ve seen various boxes not behaving the way you’d expect them to behave.

Still, sniffing is essential to identify the issue, or at least the guilty element. But that doesn’t mean a diagram is not necessary as it may help interpret the results of the sniffing. Are there multiple DSLAMs connected to the same 3011 or is there a single 3011 dedicated to each DSLAM?

Is there any chance to switch the VDSL modem to bridge mode and connect another Mikrotik device to its Ethernet side?

Sorry yes I see now how it’s confusing.
‘Customer’ in this instance is the company to which we provide the primary ISP service. But we still manage the entire Infrastructure

In much the same way an ISP would sell to a business with its main router, yet also manages the internal equipment i.e. switch/PC’s and in this case DSLAM/VDSL routers
The 3011 is essentially our direct customer, VDSL modems are indirect customers, if that makes sense

We can put a mikrotik router behind the VDSL routers however since the issue happens randomly to a random device, and with 100+ connections it’s not reliable to even pick a dozen of them as a test bed - which isn’t feasible anyway. Those may never have an issue and others do

Ah, click, so on the original diagram, the 3011 is the “Mikrotik Customer Router”, not the “MikroTik ISP Routers”. So no need to sniff at two routers, just at the 3011 itself, but still without filtering on interface, just on the remote IP (which is not affected by all the src-nats). If you see the ping responses to arrive from the internet-facing interface but never to leave through the DSLAM-facing one, check the ARP table for the IP address of the VDSL modem and use /ip route check for that IP address as well.