Hello everyone,
this is my first post. Hope I'll do everything right
This is my current network setup:
- Ceph hosts with 40Gbit LACP bond each
- Clients with 20Gbit LACP bond each
- CHR with 200Gbit LACP bond ( connected to 2 stacked switches where all other hosts are connected )
CHR is virtualized on a Proxmox server where nothing else is running, has the following resources:
- 40cores ( type: host )
- RAM 32GB
- 1GB NVMe disk
- 2x100Gbit NICs are passed as PCI devices ( I noticed a better performance compared to bridge ) and LACP bonded CHR side
What I’m experiencing is high latency between clients and Ceph OSDs routed through CHR ( spans from 10ms to 50ms ), even though ping times and transfer tests are good:
- pings (from client network to Ceph network) take about 0.25ms
- running iperf3 I'm able to saturate the NICs ( 40Gbit bond LACP ) and CHR CPU usage stays within 15%
Unfortunately during a Flent test, I observed significant UDP packet loss which in case of TCP packets would cause more issues of course (attached image).
Something I noticed that when traffic increase between clients and Ceph hosts, the latency would decrease. This made me think that the latency could be a problem of CPU frequency scaling and that the scenario was going better when higher frequencies were hit. So I set a narrow range of higher frequency, from 3000Mhz to 3500Mhz, but this changed nothing.
I also noticed that the C-STATES of the Proxmox host is very high on C6, but and this could reflect on CHR routing speed ( maybe ) but I don't have counter proofs for this as I didn't try to force C-STATES to C1 at most. I would need some scheduled maintenance for this as CHR is my primary router.
I use CheckMK to monitor CHR via SNMP and I don't see anything strange from there, NICs are all up and running and no errors occur.
Do you have any experience in a setup like this one: Proxmox + CHR + Ceph ?
I'd need to understand if I'm doing something wrong in CHR or if the problem is somewhere else.