Long lived TCP sessions dying in 6.35?

I usually leave a lot of SSH sessions logged into remote hosts. Many of them are either idle or watching log files that receive data only sporadically. Recently I’ve been getting lots of random disconnections that I’ve confirmed are not due to network outages on either the local or remote ends. The only thing I can think of doing recently that might cause this is upgrading to 6.35.4.

It seems ROS is losing track of the NAT mappings at a greatly accelerated rate. I have a TCP established timeout of 1d but I’m seeing TCP sessions being dropped in under an hour:

4616.124510 192.168.88.209 x.x.x.x TCP 51324 -> 22334 [PSH, ACK] Seq=1181 Ack=1177 Win=252 Len=36 252 36
4616.139492 x.x.x.x 192.168.88.209 TCP 22334 -> 51324 [ACK] Seq=1177 Ack=1217 Win=163 Len=0 163 
4616.139493 x.x.x.x 192.168.88.209 TCP 22334 -> 51324 [PSH, ACK] Seq=1177 Ack=1217 Win=163 Len=36 163 36
4616.337183 192.168.88.209 x.x.x.x TCP 51324 -> 22334 [ACK] Seq=1217 Ack=1213 Win=252 Len=0 252 
6623.219482 192.168.88.209 x.x.x.x TCP 51324 -> 22334 [PSH, ACK] Seq=1217 Ack=1213 Win=252 Len=1084 252 1084
6623.233160 x.x.x.x 192.168.88.209 TCP 22334 -> 51324 [RST] Seq=1213 Win=0 Len=0 0

As shown here, the connection is working fine at 4616 seconds into the capture. Then there is idle time during which no data is sent or received. Then at 6623 seconds, I attempt to use the session only to get an RST sent back. This is only 33 minutes since the previous transmissions, far below the 1 day timeout established connections are supposed to be tracked for! After re-establishing the connection, I see that the remote end tried to send data and was not successful, likely due to a NAT entry timeout, thus the SSH session was destroyed with no RST coming from the remote side.

Is anyone else seeing this behavior?

Check the following topic, it looks related: http://forum.mikrotik.com/t/tcp-session-connection-tracking-bug/99283/1

I checked and found that I did indeed have some SSH connections with tracking entries stuck at exactly 5 mins (nf_conntrack_tcp_timeout_unacknowledged) in the tracking table, along with an assortment of other non-SSH TCP connections. The connection is still active and bi-directional communication is working just fine. Nothing I could do (lots of traffic in both directions etc) seemed to make the connection go back to the 24h timeout.

For now I will apply the same workaround and increase the unacked timeout to 1d also, but this definitely looks like a bug. As it seems trivial for me to reproduce this, I can provide tcpdump or any other relevant information if necessary, just let me know what is useful.

I’d start with generating the supout and sending it to support@. Make sure there are some buggy entries in the connection tracking table at the moment you generate supout.

I agree it appears to be the same as in my posting and I have seen the problem occur with SSH as well.
However, it is still unclear to me how to trigger this problem as by far not all TCP sessions are affected.
My plan is to look in the kernel sources to see what this state exactly means (it is not documented!) but
have not yet found an opportunity for that.

As far as I could tell from the kernel sources, unacknowledged means that data is in flight that hasn’t yet been acked OR that the connection was a hot-pickup (no existing translation entry, eg due to router reboot) and hasn’t been replied to yet. Neither of these scenarios explain why a working bidirectional TCP stream would remain in that state though.

Relevant commit: https://github.com/torvalds/linux/commit/ae375044d31075a31de5a839e07ded7f67b660aa and https://github.com/torvalds/linux/commit/6547a221871f139cc56328a38105d47c14874cbe

That is what I would have guessed, but indeed it is not matching what we see.
My guess is that it my fail when a “TCP Keepalive” appears on the session, which would explain
why it fails on long-living SSH sessions. First everything is OK, then a keepalive appears and it
somehow is mis-interpreted and from then on the session is treated as unacked, when nothing
is transmitted now within those 5 minutes the tracking entry disappears, and the next time the
session comes alive from the server end it fails after re-trying due to disappeared tracking entry.

However, this does not match my problem in the other thread where it is immediately in the
wrong state after the connection has been made and some GET requests are handled, without
any keepalive as far as I see in my trace.

One other thing, did you also notice the problem occurring only recently? I’m pretty sure I never saw this behavior under 6.34.

I think I have seen the problem before on other Linux environments (not RouterOS), e.g. Debian Wheezy (3.2.0 kernel).

At the time, I have changed all my ssh servers to include:
ClientAliveInterval 120
ClientAliveCountMax 3

That solves all issues with NAT and connection tracking because there is regular traffic.
Of course it also means that the sessions die when the connection is temporarily down e.g. due to ISP maintenance :frowning:

However, my current problem described in the other thread was only tested on 6.35.2 (it is a new setup).

I suspect similar strange behavior on some test board (working with actual bugfix 6.34.6 and current 6.35.x) .. on production I’m always very conservative and still using old bugfix 6.32.x and 6.30.4 everywhere and there I never notified problems.
Do you have any updates on this?

No update unfortunately, I only seem to think it appeared in recent versions just based on the amount of SSH interruptions I was getting. Increasing the unacked timeout to 1d has effectively mitigated the problem until Mikrotik are able to reproduce and fix it.

I saw this in the RC changelog which looks like it may be designed to help fix this:

!) fastpath - let one packet per second through slow path to properly update connection timeouts;

Unfortunately I don’t think this will help, since this particular low timeout is triggered by conntrack seeing a packet which requires a specific ack. Thus if that ack is still going via fastpath / fasttrack, netfilter considers the ack as never received so the timeout remains low. Unfortunately I don’t have any boards on which to experiment with RCs, but it’s encouraging that it looks like this issue has some attention.

I have seen that changelog entry and I have my doubts as well. At least it explains why it is not working correctly,
but it is doubtful that it is now completely fixed.
Fortunately on the network where most of the MikroTiks are, I always disable fastpath/fasttrack because of too
many issues and limitations and we don’t need it for performance. On the network where I had this issue, I
have just bumped up that specific timeout and when another problem occurs, out goes the fasttrack/fastpath.