I’ve got two CRS326’s and I’m trying to set them up as a pair in a lab so I can have servers run LACP connections across both.
My config is below. If I enable MLAG across the SFP+1 ports, it connects but no VLAN1 traffic flows across so with my ethernet connection to one of them, I get disconnected from the other both by IP and winbox. If I connect Ether24 as an additional link, STP fails to block ports and it loops taking out the entire network and everything connected to it. If I remove the MLAG config, STP still fails (on connection of additional Ether24 link) like this until I reboot the switches. Layer3 HW-Offload is disabled and I’m using RSTP (although MSTP has been allowed in this release but if anything makes it worse). Am I missing something fundamental or is this very broken?!
But the PVID of your MLAG port has to be a different VLAN. You’re tagging VLAN 1 across what I assume is supposed to be your MLAG port (your MLAG line is missing), and that won’t work.
Here is the relevant (scrubbed) config from one of my working MLAG setups with a CRS312 and CRS354 in the lab. This has two MLAG-enabled bonds to other devices, a CCR2116 and a pair of CRS317’s configured as another MLAG stack in my networking core.
I’m using an LACP bond between the two, but with only one port in the bond for now. A bond isn’t necessary, of course, but configuring it this way means I can add or change ports to the bond without changing the rest of the MLAG config.)
Also, all settings corresponding to RTSP (priority, path costs, edge ports, etc.) have to be identical on both switches. I was chasing down odd one-way issues on this stack just this past weekend until I realized some of those settings were different because of previous configuration.
/interface bridge
add admin-mac=XX:XX:XX:XX:XX:XX auto-mac=no name=bridge port-cost-mode=short vlan-filtering=yes
/interface ethernet
# This is the CRS312 side. The config on the CRS354 is identical, except for the port names
# This port goes to our MLAG peer
set [ find default-name=combo4 ] combo-mode=sfp name=combo4-lag-crs354
# This port is physically connected to CRS317 that is part of a separate MLAG stack
set [ find default-name=combo3 ] combo-mode=sfp name=combo3-tsdc-mlag-02
# The aforementioned ports are put into two LACP bonds
/interface bonding
# This is the MLAG peer link with a single port in the bond for now
add lacp-rate=1sec mode=802.3ad name=bond-mlag-0-peer slaves=combo4-lag-crs354
# This bond spans our local MLAG set up, with one port on each switch going to the other switch stack
add lacp-rate=1sec min-links=1 mlag-id=20 mode=802.3ad name=bond-tsdc-mlag slaves=combo3-tsdc-mlag-02 transmit-hash-policy=layer-3-and-4
# Set up MLAG capability locally
/interface bridge mlag
set bridge=bridge peer-port=bond-mlag-0-peer
# I'm using VLAN 4094 for the MLAG peer port VLAN
/interface bridge port
add bridge=bridge interface=bond-mlag-0-peer pvid=4094
# MLAG is now set up; Add the other bonds to the bridge
add bridge=bridge interface=bond-tsdc-mlag
add bridge=bridge interface=bond-2116
# To make it useful, add relevant VLAN tags to both the bond and to any other ports in the bridge
/interface bridge vlan
# VLAN 1 as "Native" (PVID) for office/lab network
add bridge=bridge tagged=bond-mlag-0-peer untagged=bridge,bond-2116,bond-tsdc-mlag vlan-ids=1
# The following VLANs go from the CCR2116 to the CRS317 stack by way of our MLAG stack
# This goes on both switches in the MLAG pair; each peer needs to see all VLANs
add bridge=bridge tagged=bond-mlag-0-peer,bond-2116,bond-tsdc-mlag vlan-ids=75-76
add bridge=bridge tagged=bond-mlag-0-peer,bond-2116,bond-tsdc-mlag vlan-ids=101
add bridge=bridge tagged=bond-mlag-0-peer,bond-2116,bond-tsdc-mlag vlan-ids=102
Thanks for replying and including detailed config.
Yes, this is the config without MLAG active as doing so breaks STP.
Yes, my MLAG peer port is sfp-sfpplus1 and I’d set its PVID to 999 which was to be my dedicated ICCP VLAN.
I’d tagged VLAN1 as one of the guides I was following said it was required to tag all vlans across the ICCP link…but I’m pretty sure I tried it first without. Aren’t you doing the same though:
One of the things I’ve not found anywhere is mention of how much bandwidth there needs to be on this link - is it just control traffic and a 1G link will cover it or does it potentially need the max possible? If less bandwidth is necessary, could it be because I’m using a 10Gbps DAC and those ports may be connected differently to the 1G ports (haven’t checked the block diagram but assume that’s the case)?
I factory reset everything at a few points because I was chasing down problems so the RTSP ought to be the same both sides - although not in the network the pair was attached to for lab’ing it out and updating things.
What version of ROS are you running on both those boxes, that could also be relevant?
The only differences I can instantly see - and please, correct me if I’m wrong - are that you’ve turned on MLAG, you’re using 1G ports instead of 10G ports and you’re bonding them (a group of one and I understand why). It’s either got to be one of those, something I’ve missed, the ROS version or a hardware difference between your 312/354 and my 326’s. Yours are both MIPSBE and mine ARM32 and the switch chips are different in all.
Thanks Effitall. I did spot some posts along the same lines - after I’d posted, obviously My gut is that it’s either broken or just too risky when it downs the entire network beyond remote repair. I am curious as to why it works for some though. Maybe like you say, it works on the slightly older versions…but then I’d be stuck as to whether it was safe to update or not.
It’s a shame as the CRS326 has been pretty reliable for customers that didn’t need MLAG…but because of the remote location and the required uptime, it’s a deal-breaker not to have it. Hopefully VRRP on the CCRs doesn’t pull something similar on me!
Without the rest of the MLAG config I assumed that you might have wanted to use 1 as your ICCP VLAN.
Think of this link as similar to a backplane link on a switch stack. This link will be a bottleneck for all connected LACP peers. If all you need across all connected devices is 1-2Gbps, then you can use 1Gbps ports. Otherwise, an SFP+ would be recommended.
I’m using 7.13.2 on one pair of CRS317’s, 7.14.1 on another pair of CRS317’s, and 7.14.2 on the 354 and 312. I have a number of 2116’s and other CRS300’s connected to them using LACP.
The trick is to start with a clean (blank) switch setup, get the MLAG peer ports configured and running first (the basics I posted), then start adding LACP peers across them.
One pair of 317’s were fresh out of the box. The second set of 317’s started with one in production, and I added a second switch to it. Then I migrated some of the other on-site devices over to the pair.
All are using SFP+ ports as peer ports. The two 317 setups are using a pair of DACs on two SFP+ ports in a bond, and then peering over the bond. That gives me a 20Gbps “backplane”. The CRS354 and CRS312 have just one DAC between them on SFP+ ports.
For the lab, it was a bit trickier because I had lots of devices connected to them (they double as my desk/office switches) and had muddled with the bridge settings. But all the blockages, loops, and other issues on that pair were 100% related to bad/dissimilar STP settings. I can now reboot any of those two and/or the 2116 and things work as expected.
I have a pair of 326’s I could lab up for you to see if I can come up with a working config.
Thanks SirBryan. I really appreciate the offer but I think the situation just changed. I panicked a bit when it went so badly wrong and my colleague has just got them RMA’d. I think the plan now is to go 2nd hand Cisco stacked pair - fewer bells and whistles (no wireguard etc) but a known-good for this setup.
In this lab, I have two CRS326’s that have been reset to defaults and upgraded to 7.14.3. There is no configuration, which means no bridge and no IP addresses. This way all ports are disconnected from each other, ensuring no bridge loops or other oddities during configuration and still have everything plugged in the way you want. This requires configuring them with Winbox or via the console.
Ports 1 and 23 are connected to a CRS354. They receive power and Winbox management through Port 1, and Port 23 on the 326’s are plugged into a pair of copper ports (43 & 44) on the 354 that were preconfigured as an LACP bond.
I created a single-port LACP bond using SFP+1 so that I could add to or change the ports in the bond in the future without breaking the MLAG. That bond will be the MLAG peer link (ICCP).
I created another LACP bond interface, one member on each switch, using Port 23, which goes to the 354, and its LACP settings match those of the 354. I arbitrarily chose 25 for the MLAG ID (which only goes on the 326’s, not the 354).
VLAN 4094 is my ICCP VLAN, so the peer interfaces are set to PVID 4094. VLAN 1 is tagged on the peer interfaces and untagged to the other interfaces (the bridge and the bonded uplink to the 354).
I added a DHCP client to the bridge to ensure that each switch is able to communicate all the way back to the router through the 354.
I don’t see where you specified an STP mode on the bridge…I’m not sure what that defaults to…But you need to define that for “proper” operation.
Try enabling RSTP or MSTP on the bridge and see how it goes. Note that MSTP only works as of 7.14.0 I think.
I would further suggest you start at say 7.3, setup an MLAG pair, then step it through some upgrade cycles and see how well it holds up. You’ll find it breaks pretty bad on 7.7. After that point, you might have to fully wipe your config, and rebuild from scratch to get it to be “stable” again. That’s not "production ready.
Note that by “break” I don’t mean the network grinds to a halt immediately…You’ll need to be doing some packet captures to see the trashed data or watch some actual production systems struggle. Some sessions will still flow through, but things will get progressively worse. Again…Just “labbing” up the connections and seeing the “state” as “up” isn’t enough to see how badly this implementation fails.
I don’t know if you’re talking to me or to the OP (who has returned his 326’s), but I’m running MLAG in production on three pairs of switches, two of which are CRS317’s pushing 2-3Gbps continuously for 600+ customers in two different data centers. One pair has been running on 7.14.2 for two months and a second for three months on 7.13.2. The third pair is handling my office, so while that one doesn’t carry a lot of traffic, I certainly notice when things go awry.
I’m not discounting nor disputing the failures and issues others have reported on the forum, but I personally haven’t encountered anything that wasn’t caused by misconfiguration on my part. I’m also not using large switches or routers from other vendors, and I have gone through and limited the size/scope of my STP domains.