Hi.
Strange problem this morning.
Remote working users complained everything is slow.
Early this morning I had a new backup running. It is Veeam on the LAN which takes data from another server, and stores to a Linux SMB server. The backup was very long-running and slow, but I left it going anyway. This shouldn’t affect the remote users.
When I looked at the router, I see the LAN is showing 100mbps constant. I thought maybe it was just the managers streaming CCTV to their computers, but when I looked closer, I see the Veeam box → Linux SMB. Very strange. Also, all the interfaces and NICs are 1Gbps anyway so doubly strange.
I think it may have something to do with the bond configuration on Linux which maybe isn’t right. I am going to drop it anyway because between NFS and Samba, it doesn’t get utilised very well.
I have an IP address on the bond, but I also have IP addresses on the slave interfaces, maybe this is the source of the problem ? I have IPs on the slaves because mutlichannel SMB doesn’t use a bond, but is supposed to use multiple interfaces, however the NFS should benefit from the bond when multiple hosts are writing, so I sort of wanted both.
I can’t see 192.168.1.50 in the ARP table on the Mikrotik at all by the way, which is odd isn’t it ?
Anyway, here are some screenshots. Any ideas what’s going on?
The bond was down because the interface names had changed after the quad-port NIC was moved to a different PCIe slot some time ago, meaning the network-scripts had incorrect interface names in them enp4s0f0 instead of enp2s0f0
Still don’t understand why this pushes all the packets through the router, at 100mbps, and brought the network to a halt.
There is a Panasonic smart TV in the boardroom, which, when connected to WiFi (the Boardroom CAP has been turned off for the last year during lockdown, until last week), seems to be doing some sort of broadcasting. I guess this has something to do with device discovery on Windows - where media devices just show up automatically. Or it could be something more sinister, I’m not sure. This device shouldn’t be on the main wifi anyway, should be on isolated wifi.
but, anyway, once this device is allowed to connect to the WiFi again, I start to see local network data bouncing off the CAPs access point, which makes no sense.
Here you can see 192.168.1.35 (RRAS dial-in IP on Windows server) connecting to a local RDP box - somebody’s PC. Why is this going across the caps interface, and only when this Panasonic TV is allowed to connect to the WiFI? Is just just because the AP isn’t running until the TV connects (there are no other devices in range of it)?
Even despite turning off that CAPs interface, 1hr22 into running the backup again, the network grinds to a halt and I see 100mbps on eth1-LAN for 192.168.1.34 → 192.168.1.50. Arghh. Any ideas folks?
I will get some more info, but I think it is resolved now.
Some things I noticed and changed
There are two switches - [1] a Procurve V1910-48G and [2] a HP Procurve 2810-48G. They are linked with 4x1GbE ‘dynamic trunk’, whatever that means in HP speak.
The Source server was in switch [1] with 4x 1GbE, ‘Teamed’ in Windows server 2016
The Veeam machine, which reads from source, and writes to target, was in switch 1, with a single NIC, also via a cheap 8 port unmanaged switch, since it is sat in a temporary location for setup and tweaking.
The target Linux SMB box was connected in switch [2] with 4x 1GbE, in a LACP bond, but also had its onboard NIC connected (maybe to switch [1], I can’t remember.. but it is disconnected now), and after some time, it seems Veeam was sending to it via the onboard NIC and its dynamic IP. Note that Veeam is set to write to it via its hostname ‘backup’, rather than direct to an IP. I should probably set it to the IP of the bond.
I moved all three machines into the same 2810 switch, and disconnected that separate onboard NIC from the Linux SMB box. After that I didn’t notice any of the traffic bouncing off the router any more.
It seems there may be something strange with the switch->switch trunking and the Mikrotik router.
But even if that’s the case, unless Proxy ARP is enabled on the eth-LAN on the Mikrotik, I don’t understand why the packets would be bouncing off the router. It seems the 100mbps limit was possibly due to the CAPs interface, which is on 100mbps ethernet, being a member of the LAN bridge.
HPE ‘dynamic trunk’ is dynamic LACP / 802.3ad. There are limitations as detailed in the documentation, ‘static LACP’ or ‘trunk’ are the other options.
NIC Teaming has several MAC address use methods and caveats as detailed in the Microsoft documentation. Inconsistent use of MAC addresses can lead to traffic in one direction being flooded across parts of the network as FDB entries age out.
Thank you, it seems I need to do some reading on the various caveats and limitations. I am aware that a single connection will rarely make use of more than one interface at a time, and some other things (out-of-order packet possibilities and stuff), but I need to read up on the NIC teaming and the switch to switch trunking.
I suppose it’d be helpful if I was to capture the mac forwarding tables of the switches during the problem.
I’m not actually sure the 802.3ad between Linux and the HPE 2810 switch is working.
The switch says ‘LACP Partner: No’ for the 4 ports that are connected to Linux.
Yet it is happily creating a dynamic configuration for the 4 ports linking both switches, and also the 4 ports that go to Windows.
LACP Partner: No - LACP is enabled on the switch, but either LACP is not enabled or the link has not been detected on the opposite device.
On the linux host cat /proc/net/bonding/bond0 may shed some light on what’s going on. Having distinct IP addresses on the members of the bond is not a usual setup and may have side-effects.
Yeah I did have ipv6 autoconfiguration still enabled on 3 of the slaves, by accident.
It has made no difference though.
I have now changed to mode 0 (round-robin) on Linux, and set the port to ‘trunk no-protocol’ i.e. not LACP on the switch.