CCR1036-8G-2S+ regular packet loss

Hello,

I’m going throught a dificult with one of our CCR1036-8G-2S+, where it regularly lose packets every 10 seconds.
In the diagram attached, the CCR #1 is the one with the problem. If I ping it, on any of it’s IPv4 address from any other CCR, it regularly fails for a fraction of a second, every 10 seconds. It happens both when I try to ping in the ethernet (eht8) interface (that is connected to our office internal network for management) and when I ping from any other CCR to it’s SFP+ interface addresses.

The CCR #1 is connected to the internet, and it does not only fails in the input chain, it fails regularly also forwarding packets.

I can verify it easy when I use the ping utility with a lower timeout like 100ms, and every 10 seconds (or 100 packets) one or two packets are lost. If I ping any address on internet that goes throught that CCR #1 it also fails the same way.

I read some topic here and some other forums that the “Detect Internet” feature could be causing it, and some people had that problem and solved it disabling “Detect Internet”. In my case, the “Detect Internet” feature was never enabled. I’ve been having this issue for quite some time now, and could find a solution.

Does anyone have a clue on how to solve it, or on a way to find out what is the problem?
diagram.jpg

Are you sure no Loops are created (not physical ones)?
How have you achieved a Loop free topology ?

Hello,

On the CCRs there are no bridges, and on the CRS #2 (where a loop might seem possible) the interfaces 3 and 2 have forward only to interface 1:

[Nettfacil@CRS-2_RACK-1] /interface ethernet switch port-isolation> pr
Flags: I - invalid 
 0   name="sfp-sfpplus1" switch=switch1 

 1   name="sfp-sfpplus2" switch=switch1 forwarding-override=sfp-sfpplus1 

 2   name="sfp-sfpplus3" switch=switch1 forwarding-override=sfp-sfpplus1

The thing is, the switch to the right, is an internal office switch, with our computers connected to it. If I open any CCR or CRS ssh/winbox and try to ping to any other CCR or CRS other than CCR #1, it works normal. The only one that is presenting regular packet loss every 10 seconds is the CCR #1.

I actually have another CRS317 above that CCR #1, that is where I connect our services. And even from that CRS317 above (let’s say CRS #0) to the CCR #1 the loss still happens.

All management interfaces (ether8 on CCRs and ether1 on CRSs) have addresses, and when I ping the address of that interface on CCR #1 from any source, the same problem happens. There is no high CPU load, there is no high PPS (out of normal) on any interface.

The CCR #1 have peaks of 5Gbps maximum, but the problem happens even on low loads (like 2Gbps or less).

I have updated the diagram, and on CRS #0 I added a local address to the port connected to CCR #1, and the packet loss are still there.

I don’t have any fancy firewall configures:

[usafibra@RB_BGP] /ip firewall filter> pr
Flags: X - disabled, I - invalid, D - dynamic 
 0    ;;; Someone directly connected to us (probably from IX SP) trying to use our router as gateway, This is just wrong.
      chain=forward action=drop in-interface-list=WAN out-interface-list=WAN log=no log-prefix="WAIT WHAT???" 

 1    ;;; Accept traffic from BGP peers, so the BGP session can exchange information normally
      chain=input action=accept src-address-list=bgp log=no log-prefix="" 

 2    ;;; Accept ICMP on WAN limited to 100 pps, so pings and other ICMP messages can work normally
      chain=input action=accept protocol=icmp in-interface-list=WAN limit=100,5:packet log=no log-prefix="" 

 3    ;;; Drop all traffic in input chain that comes from internet
      chain=input action=drop src-address-list=!addr-access in-interface-list=WAN log=no log-prefix="" 

 4    ;;; IX.BR rule - Drop all IPv4 packets that would go out by IPv6 vlan
      chain=output action=drop out-interface=vlan-PTT-IPv6 log=no log-prefix="" 

 5    ;;; IX.BR rule - Drop Multicast packets that would go out IPv4 vlan
      chain=output action=drop dst-address-type=multicast out-interface=vlan-PTT-IPv4 log=no log-prefix="" 

 6    chain=forward action=drop dst-address-type=multicast out-interface=vlan-PTT-IPv4 log=no log-prefix="" 

 7    ;;; Add dst-address of DNS requests to addr-access address-list, so when the DNS server responds, it's packets are not dropped
      chain=output action=add-dst-to-address-list protocol=udp address-list=addr-access address-list-timeout=1m out-interface-list=WAN dst-port=53 log=no log-prefix="" 

 8    ;;; Accept traffic comming from vlan-oper2 from oper2's public addresses
      chain=forward action=accept dst-address=a.b.c.d/29 in-interface=vlan-telium log=no log-prefix="" 

 9    ;;; Drop traffic comming from WAN interfaces that are not destinated to our AS addresses
      chain=forward action=drop dst-address-list=!bgp-announcements in-interface-list=WAN log=no log-prefix="" 

[usafibra@RB_BGP] /ip firewall raw> pr
Flags: X - disabled, I - invalid, D - dynamic 
 0    ;;; Traffic comming in WAN interfaces can't have our AS addresses as source
      chain=prerouting action=drop in-interface-list=WAN log=no log-prefix="" src-address=a.b.1.0/24 

 1    chain=prerouting action=drop in-interface-list=WAN log=no log-prefix="" src-address=a.b.2.0/24 

 2    chain=prerouting action=drop in-interface-list=WAN log=no log-prefix="" src-address=a.b.3.0/24 

 3    ;;; Traffic comming in WAN interfaces can't have local addresses as source
      chain=prerouting action=drop in-interface-list=WAN log=no log-prefix="" src-address-type=local 

 4    ;;; Documentation Addresses
      chain=prerouting action=drop log=no log-prefix="DOC" src-address-list=documentation 

 5    ;;; Traffic comming in WAN interfaces can't have LAN addresses as source
      chain=prerouting action=drop in-interface-list=WAN log=no log-prefix="" src-address-list=local-net 

 6  D ;;; /ip firewall connection tracking set enabled=no
      chain=prerouting action=notrack 

 7  D ;;; /ip firewall connection tracking set enabled=no
      chain=output action=notrack

I have tryed disabling all firewalll rules, and the loss doesn’t stop.

I have, for now, no other clue on where to look to investigate what could be the cause.
diagram.jpg

Have you checked the Logs on all your devices ?

Yes. All the time. There is no error nor warning of any kind.
I have a CCR1072 which I can use to replace that CCR #1 (it is a CCR1036-8G-4S+), but I want to find where the issue is, so I know if I can keep using that CCR1036 or send it to warranty.
I’m starting to think it is some kind of physical issue, but it is not on a single ethernet chip for example, because it happens on any port, ethernet or sfp.

I should add also that I was using stable 6.48.x (don’t remember exatcly what release), and I updated to long-term 6.47.10 and the problem persists.