CRS317 - HW Offloading and Virtual Cluster IPs

Good morning!
I am wondering if somebody could help me with the config of my “core” switch? Its a CRS317-1G-16S+ running RouterOS 7.5.
The device is responsible for Switching and InterVLAN-Routing and it works pretty well, however: I have an issue with my Failover-Cluster.
Under normal circumstances “It just works” ™, however if a failover occurs the cluster-IP moves from one node to another and problems arise.
Its okay for all devices in the same subnet/VLAN but it stopps being reachable for Machines from another subnet/VLAN.
I read http://forum.mikrotik.com/t/how-does-l3hw-actually-works/155752/1 and came to the conclusion i probably need FastTrack Connection HW Offloading for the Cluster-Links in order to get the switch to realize the mac for that ip just changed.
So: i turned off l3-hw-offloading for the ports (on the switch, not on the bridge) for all links going to cluster nodes and created a firewall rule for that:

/interface ethernet switch
set 0 l3-hw-offloading=yes
/interface ethernet switch port
set 4 l3-hw-offloading=no
set 5 l3-hw-offloading=no
set 6 l3-hw-offloading=no
set 7 l3-hw-offloading=no



"add action=fasttrack-connection chain=forward connection-state=established,related,untracked hw-offload=yes"

but that didn’t work. ( i rebooted the switch after doing that )
I can ping the virtual-cluster IP from the switch, but i get two echo replys per echo request.
Kind regards!

Hi,
Do you use bonding by any chance? I have same issue as you with my Proxmox cluster and live migration, but it occurs only when a node is connected with LACP and when l3-hw is on. With single connection everything works fine.
I spent hours trying to tweak the configuration of my CRS317 to overcome this problem, but I did not find any solution, so I gave up on LACP for the moment. I was just going to describe here my problem, when I noticed your post. By the way, I found out that quickly disabling & enabling again the Vlan interface the VM is attached to makes inter-vlan traffic flowing again. Same effect has disabling & re-enabling the L3-HW offloading on the switch level. This is of course not a solution , but maybe it will help someone knowledgeable to find one:)

Morning gents!
With some help of a friend we figured out that I have to disable l3-hw-offloading on all switchports to get it to work.
Kind regards

Oh Boy…
The device is restarting randomly:

oct/10/2022 11:11:02 system,critical,info ntp change time Oct/10/2022 11:10:10 =>
Oct/10/2022 11:11:02
oct/10/2022 15:54:09 system,error,critical router was rebooted without proper shut
down
oct/10/2022 15:54:11 system,error,critical kernel failure in previous boot

Thats all i can find in logs because they are flooded with interface up/down events.

Have you tried 7.4.1? 7.5 did some buggy things for me on my CCR2116. All my (critical) production gear is 7.4.1 (for arm64).

Morning!
Thanks @sirbryan! That one turned out to be working at last!!!
I tried 7.6RC1 as well, but: uiuiuiuiui. Thats seriously not a good one.
The crs317 corrupted packets when routing. Simple intervlan traffic became “invalid” after the crs.
We tried to figure out what exactly was wrong with the packets but it was hard to catch / debug and i was desperate to get a working network :wink:

urgh. seing the same again on 7.4.1
i am going to pull it out of the rack and put it on the shelf again.

Hi @thefriendlyguy, please check second post in the thread - I wrote it on Sunday, but it has been only now released by the mod. I am still curious if you have bonding in your setup.

Hi martinidry!
Thanks a lot for your participation in this discussion.
Well, i do have lacp trunks.
BUT: those are not related or connected to the cluster:

  • LACP Trunk coming from my Access Switch - Cisco 2960x
  • LACP Trunk going to my Edge Router - CCR2004-1G-12S+2XS

The cluster / its node use Switch Embedded Teaming (SET) to connect to the CRS317.
→ about SET: https://charbelnemnom.com/deploying-switch-embedded-teaming-set-on-hyper-v-using-powershell-dsc-powershell-dsc-hyperv/
So, each node has two links to the crs but there is no LACP involved there.

OK, so it is not quite the same setup as mine, because one of my nodes in cluster was indeed connected via LACP. Nevertheless, without LACP everything works with l3-hw enabled. I can live-migrate the VMs from one node to another back and forth without losing connectivity.

It might be a long shot, but maybe it is worth trying again without LACP and SET before putting the switch on the shelf? :slight_smile: