I have a tcp connection that tears itself down and reconnects constantly and I can’t quite comprehend the pcaps enough to fully understand where the disconnect is. It is sip and the sip helper is turned off. The bizarre thing is that if I double nat the connection with another router behind the MT, it works fine. I did try and use srcnat instead of masquerade with no change. Here is a sample from the pcap. Everything is fine until that Encrypted Alert and then both server and client fin, ack and then the server sends the resets. Then they reconnect fine and this process repeats. Any tcp masters out there that can decipher this?
You don’t need a TCP master - the TCP does what the applications ask it for. And the application here is the TLS - the server dislikes something in the initial exchange with your client, and sends you the FIN. But maybe the reason is the Encrypted Alert your client sends to the server.
The double NAT affecting that behaviour is interesting. Is 10.170.1.113 the WAN IP of the Mikrotik or the IP of the phone/PBX?
Replacing action=masquerade by action=src-nat effectively doesn’t change anything except that you can control what the new IP address will be (masquerade chooses it automatically) and that the NAT connection is automatically removed if the reply-dst address of that connection disappears or changes, which is not your case.
Hi sindy. Thank you for the reply. The encrypted alert may have some more information. I’ll see if I can get to the logs on the client 10.170.1.113 to see if I can uncover that alert. The client 10.170.1.113 is plugged directly into a MT bridge port. I then have another router (non MT) plugged into another bridge port and more clients behind that. Those clients all connect to the server 209.217.207.40 just fine. Is there any sort of services or features on the MT that would make it manipulate or change any behavior for TLS applications?
First, compare apples to apples. You cannot say that it works with double NAT if you don’t try with the same client behind the second NAT.
The encrypted alert is encrypted, so if you can see it’s actual contents somewhere, it is the log of the client, not the pcap. I’d expect the client to have a problem to recognize the certificate of the server.
If there was a bug in Mikrotik, it could malform the contents of the TCP exchange, but not modify the payload in such a way that it would still make enough sense for both the client and the server to proceed with the TLS negotiation to some extent. And there is no intentional tampering with TLS contents, Mikrotik is not an application layer firewall like Fortigate and alike.
Hey sindy,
If I put the client behind the nat’d router that’s behind the MT it works as expected. If I move it back to the MT it continues this behavior. I found a thread that you participated in (http://forum.mikrotik.com/t/router-blocks-some-sip-invites-but-not-all-misconfiguration-or-bug/142083/6) last year and tried all of those steps this morning (also changed the client to 1.5 to get it below .10 in case length played a roll). Looks like the OP fixed it by double nat’ing it. I would really like to figure this thing out so I don’t have to do that. I did change the handset to tcp with no encryption as well as straight udp and did some captures. Same symptoms both protocols. I am going to work with the server side vendor today to hopefully understand why that fin get’s sent by the server. The TCP pcap below in case you see something I missed. The sip (remove 1 binding) is odd. Almost reads like the phone is unregistering itself? I’m not super familiar with the protocol.
Well, that’s two different cans of worms. With plaintext SIP (no matter whether TCP or UDP), if the SIP helper (ALG) in the firewall is enabled, the length issues could play a role when handling a payload (SDP) whose length is explicitly stated in the packet, so there may be a mismatch of the declared length and the actual one. But since it breaks already when handling a REGISTER which carries no SDP, it cannot be this.
“Remove 1 binding” means that the phone sends a specific contact and asks to register it for 0 more seconds, which is the way to unregister that single contact selectively. Leaving aside the case when the phone is shutting down or the user explicitly asks it to unregister, this may happen when the phone is too picky and the registrar returns a different contents of the Contact header in its 200 response than the one the phone has sent. But here, the pcap shows no 200 response to the REGISTER at all, the server sends an ACK to the packet carrying the initial REGISTER and then sends a FIN, so it doesn’t like something about the contents of the REGISTER.
To tell you more, I’d have to see the very same REGISTER as it enters the router from the phone and as it leaves the router towards the registrar, i.e. you’d have to capture at both the LAN and WAN interface of the Mikrotik and filter on the registrar IP address, and I’d have to see the complete contents of the packets, not the single-row excerpts, to be able to diff them. I know you say the SIP helper is off, but it may not be off enough (seen that elsewhere, not on Mikrotik in particular).
But once the SIP exchange is encrypted, the SIP helper (ALG) has no way to manipulate anything inside the SIP message contents even if it was on, so it again suggests something related to the LAN address being sent inside the SIP message. Just an idea, can you create a dedicated subnet inside 192.168.0.0/16 or 172.16.0.0/12 (i.e. a private one but outside 10.0.0.0/8) on the Mikrotik, move the phone to this subnet, and see the outcome? I.e. do a single NAT but from a very different LAN address, ideally similar to the one used at the LAN side of the second router?
Hey sindy, Happy New Year and thank you for your help! I don’t want to get to far ahead of myself but I setup a new bridge and created a standard class c address space. 192.168.5.0/24. Added the test phone to the bridge and it connected and I haven’t seen it dump the session like it was before. I’ll still need to do some testing to confirm but it’s the most progress I’ve had in weeks! We were using a non standard address space with the first octet 10 and mask of /24. Maybe that call server doesn’t appreciate it. I will dig into that further to confirm since we have many of these deployments on MT’s with similiar address schemes. I’ll have an update next week. Thanks!