MTU is L3 setting … which means at least these two things:
switches (as L2 entities) don’t have much to do with it, they just have to be able to pass those jumbo frames (L2MTU has to be at least MTU+ethernet overhead+VLAN overhead isf used)
whole IP subnet has to use same MTU … all devices and router. It’s OK if router doesn’t do fragmentation, but ICMP packets have to be able to get back to sender of jumbo frames. And router has to be able to geberate those, I’m not sure if L3HW offloaded routes can do it (if my fears are real, then CRS is not a feasible router).
So using anything but industry standard MTU of 1500 bytes can be a real PITA.
If using standard MTU overloads “your poor CPUs on the Proxmox Server”, then you won’t notice the difference anyway
Please clarify what is the issue you encounter - I can see you have set the (L3) MTU to 3000 on all your Ethernet interfaces as well as the bond itself. Do you get an error when you try to set 9000, or do the packets not actually pass through when you set 9000? The L3 MTU must fit into the L2 MTU less the basic Ethernet header less the VLAN tags; the max-L2-mtu of CRS310-8G+2S+ is 10218 bytes so L3 MTU of 9000 bytes fits with a generous margin.
Plus if you dedicate a VLAN for the “backend” Ceph traffic (the data synchronisation between physical storages), only the Ceph machines themselves need to send IP traffic via that VLAN, so the L3 MTU on the switch is irrelevant for that VLAN, it is only necessary to set the L2 MTU on the ports to 9018 if I count properly (L3 MTU 9000 + 14 byte basic Ethernet header + 4 byte single VLAN tag). And to avoid the PITA that @mkx has mentioned, you can set the (L3) MTU to the conservative 1500 for all the other VLANs.
Other than that, Ceph also recommends that you use a dedicated physical interface for the “backend” traffic where data is synchronized between storages, so that it wouldn’t have to compete for bandwitdh with the general networking traffic of the VMs, and if the storage is not colocated with the VMs, it is even better if the “frontend” Ceph traffic (the VMs access to the virtual disks) is also physically separated from the general network traffic of the VMs. So in my case, each physical host has 4 Ethernets, two of which are used to connect it to the switches and the two remaining ones are used to set up a ring of 3 members where IP routing is used for failover rather than any L2 mechanism, because the goal is that the direct path between any pair of hosts was used if it is available, whereas neither OVS nor Linux bridge support MSTP or any flavor of mesh, and to use bonding, MLAG would have to work flawlessly plus some other limitations would kick in.
Bonding does not add any headers to the frames, it is just a dispatcher that chooses which physical path to use for a particular physical frame, based on the contents of the existing headers. So the L2MTU is the same like the one of its physical member interfaces (all of which must be identical). I.e. L2MTU 9018 should indeed be enough for VLAN frames carrying 9000 byte IP packets to pass through.
What keeps me wondering is the actual MTU doesn’t seem to have an infulence in terms of stopping everything.
Let me explain: As I was playing around I set the Actual MTU to 9000 and my servers where configured similarly. After everything was working I couldn’t resist trying to brake everything. So MTU got configured to 3000 on the Servers and Actual MTU set to 2000 but it worked flawless (L2MTU was ofcourse 9018). Why did’t it break my ping command ‘‘ping -M do -s 2972 10.0.10.1’’?
First, I cannot see how do you setactual-mtu as to me, it is a read-only value.
Second, the MTU is basically an informative parameter, the interface informs the IP stack that it has to create packets smaller than that (and eventually transform it into an information for the remote conversation peer on application protokol level, e.g. into TCP MSS). So if the actual MTU of an interface shows 2000, indeed a ping -M do -s 2972 should respond ping: local error: Message too long, mtu=2000.
To which interface is the IP address attached, to the bonding? And on which interface have you set the 2000 and how exactly?