Community discussions

MikroTik App
 
iradev
just joined
Topic Author
Posts: 8
Joined: Thu Oct 28, 2021 9:59 am

Unexpected rebooting of CRS326-24S+2Q+RM

Thu Oct 28, 2021 11:17 am

Hello,

I have deployed a network of CRS326-24S+2Q+RM working as L3 switch with l3-hw-offloading=yes for better performance.
To use the hw-offloading the CRS326-24S are upgraded to version 7.1rc4
The OSPF routing protocol is used and after deploying the several CRS326 in production environment they began to experience unexpected rebooting.

In our monitoring system the following behavior is observed:
  • the used RAM memory slowly increase over time and in a certain moment it reach out of memory and the RouterOS crash and CRS326 is rebooted.
We stopped the OSPF protocol and configure static routes but the problem were still present.
After that we upgraded in our LAB the CRS326 are upgraded to version 7.1rc5.
It helps with some of L3 switches, but one of them still has the issue with the out of memory reboot.

I`ve attached a graph with memory usage.
You do not have the required permissions to view the files attached to this post.
 
User avatar
raimondsp
MikroTik Support
MikroTik Support
Posts: 267
Joined: Mon Apr 27, 2020 10:14 am

Re: Unexpected rebooting of CRS326-24S+2Q+RM

Thu Oct 28, 2021 4:02 pm

Hi there,

Are you the one who created the SUP-64006 ticket?
 
iradev
just joined
Topic Author
Posts: 8
Joined: Thu Oct 28, 2021 9:59 am

Re: Unexpected rebooting of CRS326-24S+2Q+RM

Thu Oct 28, 2021 5:35 pm

Hello,

My college open SUP-64006 ticket.
 
User avatar
raimondsp
MikroTik Support
MikroTik Support
Posts: 267
Joined: Mon Apr 27, 2020 10:14 am

Re: Unexpected rebooting of CRS326-24S+2Q+RM

Fri Oct 29, 2021 10:40 am

We are investigating the issue. Supout data does not show anything unusual. We will put CRS326-24S+2Q+RM under heavy load for the weekend to see if the issue is reproducible in our lab.

What is the common route rate received via OSPF on your side (i.e., an average number of routes per second or minute received via OSPF by CRS326)?
 
iradev
just joined
Topic Author
Posts: 8
Joined: Thu Oct 28, 2021 9:59 am

Re: Unexpected rebooting of CRS326-24S+2Q+RM

Fri Oct 29, 2021 11:58 am

Hello,

In the attached picture is our topology from the first lab test.

In our LLD we implement ECMP(Equal-cost multi-path routing) with OSPF protocol.

In the current situation we have production test environment with OSPF and version 7.1rc4 and lab environment with static routes and version 7.1rc5.
I changed the routing in our lab to reduce the control plane cpu load but this didn`t resolve the problems.
After upgrading to version 7.1rc5 we have the following situation the CoreSW1 is working fine, but CoreSW2 still has the problem.
We swap the configuration between CoreSW1 and CoreSW2 and still have the same issue with CoreSW2.

Today i plan to stop using ECMP and configure tradition redundancy static routing.
About common route rate received via OSPF with default 30 minutes to refresh the LSA i calculated it with average 1.4 routes/min.
You do not have the required permissions to view the files attached to this post.
 
blazarov
just joined
Posts: 12
Joined: Mon Jun 27, 2016 2:12 pm

Re: Unexpected rebooting of CRS326-24S+2Q+RM

Fri Oct 29, 2021 1:26 pm

Hi,
i am also part of the team working on this issue, and am trying to understand how everything works around l3hw offload. In this respect can you share some light on L3HW offloading and how it relates to fast-path and route cache?

Could the issue we are suffiring be somehow related to fast-path and/or route cache?
So far in this same lab it seems to me that everything works a little bit better with fast-path and route cache disabled. Am i on the right track?
Unfortunately in this RouterOS there is no command /ip route cache ...
 
User avatar
raimondsp
MikroTik Support
MikroTik Support
Posts: 267
Joined: Mon Apr 27, 2020 10:14 am

Re: Unexpected rebooting of CRS326-24S+2Q+RM

Mon Nov 01, 2021 11:47 am

Thank you for the detailed feedback!

We had put CRS326-24S+2Q+RM under a heavy stress testing load in our lab for the weekend. Unfortunately (or fortunately?), we were unable to reproduce your issue and didn't detect memory leaks (RAM usage kept stable during the entire session). However, we didn't use ECMP - maybe that's the case.

I suggest you try (if possible):
  1. Disable ECMP - use only single nexthop gateways.
  2. Disable OSPF - try with static routing.
  3. Disable both ECMP and OSPF.
Of course, I understand that the above suggestions are not solutions - but those may help find the root cause of the issue by reducing the scope.


Answering your questions:
  • RouterOS 7 does not support legacy route caching. Here is my post explaining why.
  • Layer 3 Hardware Offloading Wiki
  • In Full Hardware Routing mode (your case, "l3-hw-enabled=yes" on the switch and ports), packets are routed solely by the switch chip. Forwarded packets do not enter the CPU, and therefore, firewall nor fast path (FastTrack) do not affect them.
  • In Firewall-compatible mode ("l3-hw-enabled=yes" on the switch but "no" on ports), packets get processed by the CPU first, then FastTrack connections get offloaded to the hardware. This is where the fast path takes place. However, that is not your case.
 
blazarov
just joined
Posts: 12
Joined: Mon Jun 27, 2016 2:12 pm

Re: Unexpected rebooting of CRS326-24S+2Q+RM

Wed Nov 03, 2021 1:01 pm

Hi Raimond and thanks for your feedback.

we did as instructed and so far it seems ECMP is the main culprit, as disabling it in the lab resulted in major improvement.

However exactly the same setup in production (not really production - no active users, but few L2 access switches connected) is still very unstable with obvious memory leaks causing reboot every 2-3-4 hours.

We have spent hundreds of manhours so far trying to find the cause with little success. Any ideas are welcome!
 
User avatar
raimondsp
MikroTik Support
MikroTik Support
Posts: 267
Joined: Mon Apr 27, 2020 10:14 am

Re: Unexpected rebooting of CRS326-24S+2Q+RM

Wed Nov 03, 2021 1:04 pm

Hi Raimond and thanks for your feedback.

we did as instructed and so far it seems ECMP is the main culprit, as disabling it in the lab resulted in major improvement.

However exactly the same setup in production (not really production - no active users, but few L2 access switches connected) is still very unstable with obvious memory leaks causing reboot every 2-3-4 hours.

We have spent hundreds of manhours so far trying to find the cause with little success. Any ideas are welcome!
Hi,

Did disabling ECMP completely solve the issue, or there are still router reboots even with ECMP disabled?

Meanwhile, our engineers will double-check the ECMP implementation.
 
blazarov
just joined
Posts: 12
Joined: Mon Jun 27, 2016 2:12 pm

Re: Unexpected rebooting of CRS326-24S+2Q+RM

Wed Nov 03, 2021 1:08 pm

in the lab - where we have just 3 switches - 2 core + 1 distribution it seems it completely solved it. we have 2-3 days uptime which we have never seen.

However the strange thing is that the same config in the real environment is still very unstable.

We are disabling 1 by 1 any feature or command that might cause that, but still no luck.

Are there any commands or steps which can give us a hint where this leak is coming from?
We tried enabling all logs, but even there we see nothing interesting.
 
User avatar
raimondsp
MikroTik Support
MikroTik Support
Posts: 267
Joined: Mon Apr 27, 2020 10:14 am

Re: Unexpected rebooting of CRS326-24S+2Q+RM

Thu Nov 04, 2021 1:57 pm

After many attempts, we were finally able to reproduce the issue. By the looks of it, the memory leak is unrelated to L3HW/OSPF/ECMP but MIPS-specific (CRS326-24S+2Q+RM uses MIPS CPU). The case has received top priority and is under investigation to identify the root cause.

Thanks for the feedback, and we're sorry for the inconvenience.
 
blazarov
just joined
Posts: 12
Joined: Mon Jun 27, 2016 2:12 pm

Re: Unexpected rebooting of CRS326-24S+2Q+RM

Thu Nov 04, 2021 3:02 pm

Hi Raimond,

Great news!
We are already running the debug firmware in the lab collecting logs to share.
MIPS CPU-specific issue, rather than feature-specific issue might explain the inconsistent behavior and reproduction of the problem. We have made hundreds of tests and small changes and still couldnt find what exactly causes the problem. For example the last thing we did before we "stabilized" the production network is just change some interface descriptions which is extremely strange.
We are looking forward to a solution and we are ready to contribute with anything that might help Mikrotik team in finding a solution.

Thanks!
 
iradev
just joined
Topic Author
Posts: 8
Joined: Thu Oct 28, 2021 9:59 am

Re: Unexpected rebooting of CRS326-24S+2Q+RM

Mon Nov 08, 2021 11:00 am

Hi,

We have updated the firmware in the production envirement with debug firmware 7.99 from the support.
We also restore the previous setting like OSPF with ECMP.
Debug firmware is more stable then previous ones, but today the Distribution sw4 in attemp to access it has rebooted this morning.
On the sw4 the link to second core sw is not down and the traffic is routed to first core sw.
The memory on sw4 is still increasing.
I have attached a few screenshots from our monitoring system.

Thanks!
You do not have the required permissions to view the files attached to this post.
 
User avatar
raimondsp
MikroTik Support
MikroTik Support
Posts: 267
Joined: Mon Apr 27, 2020 10:14 am

Re: Unexpected rebooting of CRS326-24S+2Q+RM

Mon Nov 08, 2021 12:02 pm

Hey there,

We are sorry that the fix didn't fully address your issues, and we're continuing to investigate the possible problems.

Which is sw4 on the diagram above? I see there CoreSwitch1&2, and DistributionSwitch1,2,3.
 
User avatar
raimondsp
MikroTik Support
MikroTik Support
Posts: 267
Joined: Mon Apr 27, 2020 10:14 am

Re: Unexpected rebooting of CRS326-24S+2Q+RM

Mon Nov 08, 2021 12:16 pm

Try disabling connection tracking and see if that prevents RAM consumption:
/ip/firewall/connection/tracking/set enabled=no
Requires router reboot after disabling conntrack.
 
iradev
just joined
Topic Author
Posts: 8
Joined: Thu Oct 28, 2021 9:59 am

Re: Unexpected rebooting of CRS326-24S+2Q+RM

Mon Nov 08, 2021 1:35 pm

Hi,

I will disable connection tracking.

The diagram is informational, but the switch4 is install on the place of the switch3 for testing purposes.
 
iradev
just joined
Topic Author
Posts: 8
Joined: Thu Oct 28, 2021 9:59 am

Re: Unexpected rebooting of CRS326-24S+2Q+RM

Mon Nov 08, 2021 5:47 pm

Hi,

Connection tracking was already disable and the switch crashed again.

We will install a new firmware version and will try to collect log from the production site.
 
iradev
just joined
Topic Author
Posts: 8
Joined: Thu Oct 28, 2021 9:59 am

Re: Unexpected rebooting of CRS326-24S+2Q+RM

Tue Nov 09, 2021 10:05 am

Hi,

Switch4 is upgraded with new firmware version 7.99-2 and the first observed memory leak problem is present.

We will organize collecting logs from production switch4 for analysis.
You do not have the required permissions to view the files attached to this post.
 
User avatar
raimondsp
MikroTik Support
MikroTik Support
Posts: 267
Joined: Mon Apr 27, 2020 10:14 am

Re: Unexpected rebooting of CRS326-24S+2Q+RM

Tue Nov 09, 2021 12:03 pm

Hi, we have identified another issue. Actually, it is not a memory leak - it is just an increased memory consumption due to unresolved ARP entries (or IPv6 neighbors). We will test the solution and send you a new beta soon.
 
blazarov
just joined
Posts: 12
Joined: Mon Jun 27, 2016 2:12 pm

Re: Unexpected rebooting of CRS326-24S+2Q+RM

Tue Nov 09, 2021 12:09 pm

Great News!
Actually in our environment we have completely disabled IPv6
furthermore we continuously check the ARP cache and it seems pretty much static and empty. Or are you referring to something that is not normally visible via interface?
 
User avatar
raimondsp
MikroTik Support
MikroTik Support
Posts: 267
Joined: Mon Apr 27, 2020 10:14 am

Re: Unexpected rebooting of CRS326-24S+2Q+RM

Tue Nov 09, 2021 12:33 pm

Great News!
Actually in our environment we have completely disabled IPv6
furthermore we continuously check the ARP cache and it seems pretty much static and empty. Or are you referring to something that is not normally visible via interface?

You may check:
/ip/arp/print count-only where !complete
We have identified that incomplete (a.k.a. unresolved) arp entries use excessive memory. While it is not an issue on routers with a high amount of RAM, it might be problematic for CRS326-24S+2Q+RM with only 64MB, where it is possible to run out of memory before the garbage collector kicks in. We have fixed the issue and currently testing the solution.

Would you mind checking the amount of incomplete ARP entries on your problematic device? If there are hundreds or even thousands of entries, then that is the case. Otherwise, we'll need to dig elsewhere.
 
blazarov
just joined
Posts: 12
Joined: Mon Jun 27, 2016 2:12 pm

Re: Unexpected rebooting of CRS326-24S+2Q+RM

Tue Nov 09, 2021 1:25 pm

Great News!
Actually in our environment we have completely disabled IPv6
furthermore we continuously check the ARP cache and it seems pretty much static and empty. Or are you referring to something that is not normally visible via interface?

You may check:
/ip/arp/print count-only where !complete
We have identified that incomplete (a.k.a. unresolved) arp entries use excessive memory. While it is not an issue on routers with a high amount of RAM, it might be problematic for CRS326-24S+2Q+RM with only 64MB, where it is possible to run out of memory before the garbage collector kicks in. We have fixed the issue and currently testing the solution.

Would you mind checking the amount of incomplete ARP entries on your problematic device? If there are hundreds or even thousands of entries, then that is the case. Otherwise, we'll need to dig elsewhere.
Our problematic devices have very few entries in the ARP table in total. And out of those normally 0 are incomplete. So probably we are suffering from different problem, yet unknown unfortunately.

For example the same switch from the above graph:
[confmaster@DistributionSW4] > /ip/arp/print count-only where !complete
1
 
User avatar
raimondsp
MikroTik Support
MikroTik Support
Posts: 267
Joined: Mon Apr 27, 2020 10:14 am

Re: Unexpected rebooting of CRS326-24S+2Q+RM

Tue Nov 09, 2021 1:28 pm

Then looks like a different problem :(

And how many ARP entries are in total?
 
blazarov
just joined
Posts: 12
Joined: Mon Jun 27, 2016 2:12 pm

Re: Unexpected rebooting of CRS326-24S+2Q+RM

Tue Nov 09, 2021 1:31 pm

Since we discovered the instability we moved all users off the new network, so basically we are just keeping 2 core switches and 3 distribution switches with no real users connected.
Normally the ARP table contains like 2-3-4 entries in total. This makes sense as the Distribution switch has ARP for each core switch (L3 links) and the core switch has an entry for each distribution switch and the FW (default gw).
 
User avatar
raimondsp
MikroTik Support
MikroTik Support
Posts: 267
Joined: Mon Apr 27, 2020 10:14 am

Re: Unexpected rebooting of CRS326-24S+2Q+RM

Tue Nov 09, 2021 3:19 pm

Then I suppose that our ARP fixed won't help. Well, thanks for indirectly helping us to identify and solve those issues anyway.

Now back to your problem. We compared side-by-side your provided configuration of DistributionSW4. Since we are using the same hardware and software but cannot reproduce the issue, the problem must be somewhere in the configuration or different usage patterns. Please do the following steps to narrow down the scope:
  • Replace OSPF with static routes.
  • Disable log saving to disk:
    /system/logging/disable [find where action=disk]
  • Disable RADIUS.
  • Disable SNMP. I know that disabling SNMP also reduces monitoring capabilities, but we need to ensure that the monitoring does not cause memory leaks under some conditions. As a temporary alternative, you may login via ssh and run:
    /system/resource/monitor interval=5s without-paging
    This way, you still notice an increase in memory usage or device reboot.
 
iatanasov
just joined
Posts: 3
Joined: Thu Nov 04, 2021 11:10 am

Re: Unexpected rebooting of CRS326-24S+2Q+RM

Wed Nov 10, 2021 10:32 am

Hello Team,

We have progress here. After disable SNMP, logging on disk and Radius on Distribution 4 RAM is stable from 15 hours. Today we started processes: logging from disk and Radius on Distr4(RAM continue be stable). From 2 hours we stopped only SNMP on Distr4-LAB and situation is stable RAM. I attach screenshots from devices.
You do not have the required permissions to view the files attached to this post.
 
iatanasov
just joined
Posts: 3
Joined: Thu Nov 04, 2021 11:10 am

Re: Unexpected rebooting of CRS326-24S+2Q+RM

Tue Nov 16, 2021 11:53 am

Hello Team,

I update ticket with some information about our issue. From 1 week there no change in situation. DistrSW4 with only SNMP service stopped is super stable. Only in this switch we have attached clients. We have one switch DistrSW1 with active service SNMP and it is stable too. All other switches is with stopped SNMP and no problems with it. The picture is that: devices without clients attached on it worked correctly( with or without SNMP), but if we have clients attached and SNMP is active, RAM started to grow and switch is restarted.
You do not have the required permissions to view the files attached to this post.
 
User avatar
raimondsp
MikroTik Support
MikroTik Support
Posts: 267
Joined: Mon Apr 27, 2020 10:14 am

Re: Unexpected rebooting of CRS326-24S+2Q+RM

Tue Nov 16, 2021 4:10 pm

Hello there,

I'm glad that everything besides SNMP works fine on your end. Our support team should contact you shortly, asking for details of SNMP setup and use-cases. If they will forget, ping them via support email ;)

Have a good day!
 
iradev
just joined
Topic Author
Posts: 8
Joined: Thu Oct 28, 2021 9:59 am

Re: Unexpected rebooting of CRS326-24S+2Q+RM

Thu Nov 18, 2021 10:51 am

Hello,

We install additional 3 CRS326-24S+2Q+RM with version 7.99-2 in the Production environment we had the problem with OSPF routing protocol with
following error and the neighbors stuck in Exchange status.

OspfInterface { { 2 *18 0.0.0.0 0 10.48.248.194 } Backup DR Broadcast } auth data corrupted from 10.48.248.193

After we disable OSPF authentication it works fine.
You do not have the required permissions to view the files attached to this post.
 
User avatar
raimondsp
MikroTik Support
MikroTik Support
Posts: 267
Joined: Mon Apr 27, 2020 10:14 am

Re: Unexpected rebooting of CRS326-24S+2Q+RM

Wed Nov 24, 2021 9:33 am

Hi,

An issue with OSPF authentication will be fixed in the next version.
Also, we were able to reproduce an increasing memory usage by SNMP, and developers are looking for a solution.

Thanks for the feedback!
 
User avatar
raimondsp
MikroTik Support
MikroTik Support
Posts: 267
Joined: Mon Apr 27, 2020 10:14 am

Re: Unexpected rebooting of CRS326-24S+2Q+RM

Fri Nov 26, 2021 11:26 am

Please upgrade to RouterOS v7.1rc7. The OSPF authentication issue should be fixed there.

Meanwhile, the developers have been identified the SNMP increasing memory issue and are working on the fix.
 
iatanasov
just joined
Posts: 3
Joined: Thu Nov 04, 2021 11:10 am

Re: Unexpected rebooting of CRS326-24S+2Q+RM

Tue Nov 30, 2021 6:17 pm

Hello,

Two days ago we update units in our lab with test version 7.2beta21 provided for as from Mikrotik Support. The routers is with stable RAM and no problem with OSPF encryption. Today we push SNMP walk dozens times and in the same time watch 4K tv and switch is Stable. Tomorrow we start update devices in production.
You do not have the required permissions to view the files attached to this post.
 
User avatar
mozerd
Forum Veteran
Forum Veteran
Posts: 872
Joined: Thu Oct 05, 2017 3:39 pm
Location: Canada
Contact:

Re: Unexpected rebooting of CRS326-24S+2Q+RM

Tue Nov 30, 2021 6:56 pm

version 7.2beta21 …… now that is interesting :)
 
alphaonezero
just joined
Posts: 3
Joined: Sat Mar 27, 2021 7:24 am

Re: Unexpected rebooting of CRS326-24S+2Q+RM

Tue Nov 30, 2021 7:08 pm

I've been watching this thread since the beginning, and I'm very interested to hear how stable the switches are in production with the now updated RouterOS. We were looking to deploy these switches in the core part of our network, but were having issues initially during testing that I believe is related to this SNMP memory leak issue. I'd love to hear an update once you've had a chance to run them in production for a while.

Who is online

Users browsing this forum: No registered users and 16 guests