I have been debugging a problem that partly was caused by some other equipment, but a MikroTik
router was involved in it in the following way:
The router contains a Forward ruleset that allows Established/Related, drops Invalid, and allows New
connections in one direction. Not too unusual. It is not a NAT environment, pure routing.
What I have observed: some TCP sessions that have no traffic are ticking down from the
“TCP unacked” timeout instead of the “TCP established” timeout set in the connection tracking.
This timeout is only 5 minutes by default, the established timeout being one day.
After those 5 minutes the tracking entry is removed, and some time later the remote end closes
the idle connection with a FIN ACK which is dropped as “invalid” (correct), the local side never sees
the connection being closed and assumes it is still open. When the local side wants to send more data, a new
tracking entry is created by the router and the data is sent. However, there is another firewall further down the
path (not MikroTik) which has seen that unanswered FIN ACK and has by now deleted its tracking entry, and it drops
this data because it is invalid. It does not send any reply to inform about this. The local system
sees a dead TCP connection and complains.
I have worked around this, first by sending a TCP RESET on TCP traffic that would be dropped
as invalid (this makes the local system realize something is wrong and re-establish the session),
but then I noticed the root cause of the problem, increased the “TCP unacked” timeout and now
the session is neatly closed.
However, the root cause of this problem is a mis-classification of an idle session.
I sort of recall I have seen this before. SSH sessions that died when not used for some time.
In the case I was researching it was a connection from a Linux to a Windows system, but in the
SSH case it was from Linux to Linux.
What does “TCP Unacked” mean anyway? There is another timer for “TCP retransmit”, but this
is not involved (I changed the timeout values to see from which one it was ticking down).
It is not documented on the WiKi.
Could it be that this one misfires when a “TCP Keepalive” is sent by one side and not the other?