[Solved] TLS SSL_ERROR_SYSCALL with bonding

After switch to bonded DSL lines I have problem with TLS handshakes with some sites. Some works (e.g. google), some not.
Setup: 4 DSL modems + LTE modem (backup) → ISP’s RB3011 doing bonding → our RB3011 → NATed LAN
When we used single DSL line of other ISP, it worked with same configuration.
Now it’s not possible to do successful TLS handshake with some sites:

curl -ILv https://p3.zdusercontent.com
* Rebuilt URL to: https://p3.zdusercontent.com/
*   Trying 185.12.82.12...
* TCP_NODELAY set
* Connected to p3.zdusercontent.com (185.12.82.12) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
* OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to p3.zdusercontent.com:443 
* stopped the pause stream!
* Closing connection 0
curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to p3.zdusercontent.com:443

Connected to ISP’s RB3011 directly:

 ...
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
* TLSv1.2 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Client hello (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-AES256-GCM-SHA384
* ALPN, server accepted to use http/1.1
...

I tried to add rules to allow all traffic of that IP and log it:

chain=forward action=accept dst-address=185.12.82.12 log=yes log-prefix="DEBUG <" 
chain=forward action=accept src-address=185.12.82.12 log=yes log-prefix="DEBUG >"

Log is:

jul/22 01:01:58 firewall,info DEBUG < forward: in:bridge1 out:wan1, src-mac xxx, proto TCP (SYN), lan_client_ip:42540->185.12.82.12:443, len 60 
jul/22 01:01:58 firewall,info DEBUG > forward: in:wan1 out:bridge1, src-mac yyy, proto TCP (SYN,ACK), 185.12.82.12:443->lan_client_ip:42540, NAT 185.12.82.12:443->(our_rb_ext_ip:42540->lan_client_ip:42540), len 60 
jul/22 01:01:58 firewall,info DEBUG < forward: in:bridge1 out:wan1, src-mac xxx, proto TCP (ACK), lan_client_ip:42540->185.12.82.12:443, NAT (lan_client_ip:42540->our_rb_ext_ip:42540)->185.12.82.12:443, len 52 
jul/22 01:01:58 firewall,info DEBUG < forward: in:bridge1 out:wan1, src-mac xxx, proto TCP (ACK,PSH), lan_client_ip:42540->185.12.82.12:443, NAT (lan_client_ip:42540->our_rb_ext_ip:42540)->185.12.82.12:443, len 275 
jul/22 01:01:58 firewall,info DEBUG > forward: in:wan1 out:bridge1, src-mac yyy, proto TCP (ACK), 185.12.82.12:443->lan_client_ip:42540, NAT 185.12.82.12:443->(our_rb_ext_ip:42540->lan_client_ip:42540), len 52 
jul/22 01:01:58 firewall,info DEBUG > forward: in:wan1 out:bridge1, src-mac yyy, proto TCP (ACK,PSH), 185.12.82.12:443->lan_client_ip:42540, NAT 185.12.82.12:443->(our_rb_ext_ip:42540->lan_client_ip:42540), len 632 
jul/22 01:01:58 firewall,info DEBUG < forward: in:bridge1 out:wan1, src-mac xxx, proto TCP (ACK), lan_client_ip:42540->185.12.82.12:443, NAT (lan_client_ip:42540->our_rb_ext_ip:42540)->185.12.82.12:443, len 64 
jul/22 01:02:59 firewall,info DEBUG < forward: in:bridge1 out:wan1, src-mac xxx, proto TCP (ACK), lan_client_ip:42540->185.12.82.12:443, NAT (lan_client_ip:42540->our_rb_ext_ip:42540)->185.12.82.12:443, len 64 
jul/22 01:03:10 firewall,info DEBUG > forward: in:wan1 out:bridge1, src-mac yyy, proto TCP (ACK,RST), 185.12.82.12:443->lan_client_ip:42540, NAT 185.12.82.12:443->(our_rb_ext_ip:42540->lan_client_ip:42540), len 40

It seems to be problem caused by combination of bonding on ISP’s router and NAT on our, as single line with our router works, bonded lines with directly connected client as well.

NAT is done by:

chain=srcnat action=masquerade out-interface-list=WAN_all

WAN_all contains wan1 and wan2 while wan2 is not connected anymore.

Any ideas what exactly can cause the problem and how to fix it?

Problem was caused by MTU sizes in combination with DF (don’t fragment) bit of some TLS streams.
Fixed by changing MTU values.