SSH negotiation not completing in one direction over an EoIP tunnel

I have an EoIP tunnel overlaid on an L2TP/ipsec VPN. The L2TP/ipsec VPN connects from a hEX to an RB3011. The hEX is NAT’d behind an ADSL modem operating in PPPoE mode. The RB3011 has a static public IP address. The hEX is connecting to the L2TP server on the RB3011. The EoIP tunnel is then negotiated across the L2TP VPN. The EoIP tunnel is added to a bridge on both ends. The L2TP interface is not part of a bridge on either end.

The RB3011 has an additional SSTP VPN used for external management.

The RB3011 is at 10.10.30.1. The hEX is at 10.10.30.32.

There is a FreeNAS device at 10.10.30.254 which is on the same physical network as the RB3011.
There is a second FreeNAS device at 10.10.30.253 which is on the same physical network as the hEX.

The SSTP VPN is at 192.168.100.2.
The L2TP/ipsec VPN has a 192.168.105.1 on the RB3011 and 192.168.105.2 on the hEX.

SSH from a Ubuntu 20.04 box at 192.168.100.1 can log in on SSH to both FreeNAS boxes.
10.10.30.253 can successfully SSH into 10.10.30.254 and 10.10.30.32 and 10.10.30.1
10.10.30.254 can successfully SSH into 10.10.30.1
10.10.30.254 can connect to port 22 on both 10.10.30.32 and 10.10.30.253, but the connection stalls before it gets to the login prompt. The connection stalls at different progress points on each of these devices.

When connecting to 10.10.30.32, it gets to:
debug1: SSH2_MSG_KEX_DH_GEX_REQUEST(2048<7680<8192) sent
After about a minute or two, it’ll get connection timeout

When connecting to 10.10.30.253, it gets to:
debug1: expecting SSH2_MSG_KEX_ECDH_REPLY
After about a minute or two, it too will get a connection timeout

I’ve checked firewall and bridge settings. The firewall settings are non-existent on the hEX side. The firewall settings are still pretty stripped down and nothing out of the ordinary on the RB3011 side.

The biggest differences I can see between the two sides are that the RB3011 side has root bridge unchecked for the bridge, and the hEX side has root bridge checked for the bridge. Also, on the RB3011 side, the EoIP tunnel has root port status, and on the hEX, it does not.

Anyone have any ideas as to why the traffic is 100% successful in one direction and in the other direction, it stalls during protocol negotiation? Any ideas on how I might troubleshoot the root cause?

Here’s a detailed SSH log from when 10.10.30.254 is trying to connect to 10.10.30.253:

tunneldevice any:any
controlpersist no
escapechar ~
ipqos lowdelay throughput
rekeylimit 0 0
streamlocalbindmask 0177
root@officenas[~]# ssh -v 10.10.30.253
OpenSSH_7.5p1, OpenSSL 1.0.2s-freebsd 28 May 2019
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: Connecting to 10.10.30.253 [10.10.30.253] port 22.
debug1: Connection established.
debug1: permanently_set_uid: 0/0
debug1: Fssh_key_load_public: No such file or directory
debug1: identity file /root/.ssh/id_rsa type -1
debug1: Fssh_key_load_public: No such file or directory
debug1: identity file /root/.ssh/id_rsa-cert type -1
debug1: Fssh_key_load_public: No such file or directory
debug1: identity file /root/.ssh/id_dsa type -1
debug1: Fssh_key_load_public: No such file or directory
debug1: identity file /root/.ssh/id_dsa-cert type -1
debug1: Fssh_key_load_public: No such file or directory
debug1: identity file /root/.ssh/id_ecdsa type -1
debug1: Fssh_key_load_public: No such file or directory
debug1: identity file /root/.ssh/id_ecdsa-cert type -1
debug1: Fssh_key_load_public: No such file or directory
debug1: identity file /root/.ssh/id_ed25519 type -1
debug1: Fssh_key_load_public: No such file or directory
debug1: identity file /root/.ssh/id_ed25519-cert type -1
debug1: Enabling compatibility mode for protocol 2.0
debug1: Local version string SSH-2.0-OpenSSH_7.5 FreeBSD-20170903
debug1: Remote protocol version 2.0, remote software version OpenSSH_8.0-hpn14v15
debug1: match: OpenSSH_8.0-hpn14v15 pat OpenSSH* compat 0x04000000
debug1: Authenticating to 10.10.30.253:22 as ‘root’
debug1: SSH2_MSG_KEXINIT sent
debug1: SSH2_MSG_KEXINIT received
debug1: kex: algorithm: curve25519-sha256
debug1: kex: host key algorithm: ecdsa-sha2-nistp256
debug1: kex: server->client cipher: chacha20-poly1305@openssh.com MAC: compression: none
debug1: kex: server->client cipher: chacha20-poly1305@openssh.com MAC: compression: none
debug1: expecting SSH2_MSG_KEX_ECDH_REPLY

So far it sounds like a MTU issue to me. As it is an L2 connection, PMTU discovery cannot work as there is no router between the client and the server to notify the sender that the packet is too large to fit (and there is no L2 message with a similar meaning). So you have to set the EoIP’s MTU to 1500 - it will cause fragmentation, but the frames will get through.

Regardless the above - as you use L2TP, do you need to use vlan-filtering on the bridges interconnected using the L2 tunnel? If not, you can save some overhead by using the L2 tunneling capability of L2TP itself rather than transporting EoIP via an L3 tunnel.