Balance-xor bonding on CRS328

I am having a frustrating day with link aggregation. On a CRS328 I have two (static) bonded interfaces configured, one connected to a Linux server and the other to an upstream router. The Linux server is configured to use balance-xor with layer3+4. The bonds on the switch are configured as balance-xor and are members of a bridge with hardware offloading enabled. Assuming I have understood the docs correctly, this means they are using the switch chip's layer2+3+4 hash.

To test the configuration I have created an appropriate load on the Linux host (several parallel iperf3 instances to diverse remote hosts). You can see below (instantaneous snapshot of the relevant interfaces) that, as expected, traffic pushed by the host is divided between the two bonded interfaces (ether9, ether15). This, of course, is based on the decision making of the Linux host not ROS.

You can see, however, that this traffic is then squeezed by the switch down a single interface of the bond to the upstream router (ether22, ether24). I have not been able to create simulated traffic which does not exhibit this behaviour, and I am not sure why.

I would very much welcome any help to figure out why this might be. I wonder if the packets being VLAN tagged might have some part to play, but there's nothing in the docs I can find that suggests this would be a problem and it's not a particularly exotic scenario.

                       name: ether9.xxx ether15.xxx ether22.trunk ether24.trunk
      rx-packets-per-second:     43 087      38 056            40        23 045
         rx-bits-per-second:  524.1Mbps   463.3Mbps      74.3kbps      13.7Mbps
   fp-rx-packets-per-second:          0           0             2             0
      fp-rx-bits-per-second:       0bps        0bps       4.3kbps          0bps
      tx-packets-per-second:         48      23 032            63        81 077
         tx-bits-per-second:  150.0kbps    13.6Mbps     144.3kbps     987.4Mbps
   fp-tx-packets-per-second:          0           0             0             0
      fp-tx-bits-per-second:       0bps        0bps          0bps          0bps
  tx-queue-drops-per-second:          0           0             0             0
  • ROS 7.23
  • I have tried disabling hardware offloading and have observed the same behaviour (but this time, of course, pegging the CPU)

Thanks

Maybe the switch does not look deeper in a VLAN-tagged frame and falls back to Layer2 as the only option. Try with untagged traffic and will know or try antoher algorithm like LACP.

PS: I would not hash over Layer4 bacuase not all frames contain a port (like ICMP...). A more predictable setup is L2+L3

Thank you for your thoughts. Yes, I agree it does seem as though it's falling back to a layer two hash. With hardware offloading enabled (which it must be given the CRS328's modest CPU) there is no choice of hashing algorithm, so I am somewhat restricted from moving away from layer four (or whatever it's actually doing) there. The VLAN-less arrangement sounds like a good test to perform and I will do so when I next have the necessary free time.

A further data point: I've just tried disabling hardware offloading (and thus restoring my ability to select a hashing mechanism) and switching the trunk to layer-2-and-3. I was then able to observe the expected balancing, but alas throughput was limited by the CPU to an aggregate ~600 Mbps. Given that my simulated traffic is passing through a router, MAC is not going to be contributing much—this gives further weight to the idea that 3+4 isn't doing what one would expect it to.

Summary:

Mode Hash HW Offload Result
balance-xor layer-2-and-3 no Both ifaces used
balance-xor layer-3-and-4 no Only one iface used
balance-xor N/A (hardware 2+3+4) yes Only one iface used

Did you try to hard-boot device after setting up bonding? Bonding manual does mention that changes in bonding settings while bond is active don't have effect. Which leads me into thinking that hard-boot might be in order ...

Regarding L4 hashing: usually it's L3+L4 ... or even L2+L3+L4 (as in case of ROS and Marvel Prestera), so non-L4 protocols will be covered as well. And even if they weren't, normally they form such a low portion of total traffic that they won't affect overall bond performance. So IMO suggestion by @Guscht is not a good one.

Thank you for your thoughts. I did, yes. This has not made any difference to the observed behaviour.

Dont forget fragmented frames whoch do not have a Layer4 too:

For fragmented TCP or UDP packets and all other IPv4 and IPv6 protocol traffic, the source and destination port information is omitted. For non-IP traffic, the formula is the same as for the layer2 transmit hash policy.

I would still say, dont hash over Layer4! We had so much trouble with the interoperability with Layer4-bonds.

If L4 is missing, then L3+L4 hash will still do hashing based on L3 information. I'd be wary of L3+L4 hash policy for non-L3 traffic (because if device takes hash policy L3+L4 seriously, it won't consider L2 for non-L3 frames) ... but I'm still convinced that in normal networks non-L3 traffic is almost negligible. So IMO @Guscht is overthinking hash policies.

Devices on both sides of bond must have same bond type but in case when hash policy is used, bodh ends can have different hash policies. After all, hashing only determines which of (active) bond members will be used to transmit a particular ethernet frame. It doesn't matter in receive direction at all, bond has to consider every received frame the same regardless the bond member used to transport it. If both ends of bond use (wildly) different hash policies, then load on individual bond members will be different in both directions ... but that will likely happen also if hash policies are identical due to asymmetrical nature of individual connections.