SIP client cannot re-register in the SIP server after switching ISP (different NAT)

Issue:

SIP client cannot re-register in the SIP server after switching ISP (different NAT).

Description:

In our setup we have two ISP providers, a SIP client with a private IP, and we’re using NATs (a different NAT for each ISP provider) with SIG ALG translation, aka SIP nat helper.

When changing the default route from one ISP provider to the another one (manually, or because the ISP link goes down), the Mikrotik applies the wrong NAT rule. Because of this, the SIP register messages cannot reach the SIP server and the SIP connection drops.

If we clean the NAT table or even reboot the router, everything is gonna be ok again.

Versions affected:

6.38(mibspe),6.38.5(chr),6.39.3(mibspe),6.41(chr)

Note: We have tested some real Mikrotiks (mibspe) and run some simulations in GNS3 with routerosx86 Mikrotik virtual machines (chr).

How to reproduce:

  • Plug a router to two different ISPs (each one giving you a different real IP) and to an internal network;
  • Create proper NAT rules for each ISP;
  • Create proper default routes (static routes) for each ISP (the first ISP with the smaller distance);
  • Set up a SIP client (in the internal network) to register in an external SIP server and do the register;
  • Change the distance of the default routes so the second ISP will be the active route (smaller distance);
  • Try to re-register the SIP client in the SIP server and you will see that no SIP message returns and the re-register fails;
  • Check the NAT table and run a sniffer in the router and you will see that the router is routing the package via the second ISP but it’s still applying the old NAT rule (for the first ISP) instead of the correct NAT rule.

Network setup and detailed how to reproduce:

I have a production setup somebit complicated. However, I run a much more simple setup using GNS3. So, I’m showing this simplified setup here.

The screenshot of my GNS3 setup is above:

The isp1 and isp2 nodes simulate the two different ISPs. They connect to the Internet via GNS3 NAT nodes (if you doesn’t know how GNS3 works, just consider that the isp1 and isp2 nodes just behave as real ISPs routers).

Our router (router) is connected to both ISPs and also to the sip-client node (an Ubuntu 14.04 docker node that simulates a SIP client).

A more detailed diagram is showed above:

The implementation of the router node is:

/ip address
add address=10.10.1.2/24 interface=ether1 network=10.10.1.0
add address=10.10.2.2/24 interface=ether2 network=10.10.2.0
add address=192.168.0.1/24 interface=ether3 network=192.168.0.0
/ip firewall nat
add action=src-nat chain=srcnat out-interface=ether1 to-addresses=10.10.1.2
add action=src-nat chain=srcnat out-interface=ether2 to-addresses=10.10.2.2
/ip route
add distance=1 gateway=10.10.1.1
add distance=2 gateway=10.10.2.1
/system identity
set name=router

To make NAT tests easier, we also have increased the NAT ICMP timeout:

/ip firewall connection tracking
set icmp-timeout=1h

And this is the network configuration of the SIP client (/etc/network/interfaces):

auto eth0
iface eth0 inet static
	address 192.168.0.100
	netmask 255.255.255.0
	gateway 192.168.0.1
	up echo nameserver 192.168.0.1 > /etc/resolv.conf

Our SIP client connects to the SIP server using NAT with the help of the SIP ALG translation, aka SIP nat helper:

[admin@router] > /ip firewall service-port print where name=sip
Flags: X - disabled, I - invalid 
 #   NAME                                                                 PORTS
 0   sip                                                                  5060 
                                                                          5061

Now, in the client, we will ping (ICMP) the SIP server and also send a SIP message to our SIP server. After this, we have the following entries in the firewall NAT table:

[admin@router] > /ip firewall connection print detail where protocol=icmp      
Flags: E - expected, S - seen-reply, A - assured, C - confirmed, D - dying, 
F - fasttrack, s - srcnat, d - dstnat 
 0  S C  s  protocol=icmp src-address=192.168.0.100 dst-address=199.87.121.233 
            reply-src-address=199.87.121.233 reply-dst-address=10.10.1.2 
            icmp-type=8 icmp-code=0 icmp-id=521 timeout=58m16s orig-packets=4 
            orig-bytes=336 orig-fasttrack-packets=0 orig-fasttrack-bytes=0 
            repl-packets=3 repl-bytes=252 repl-fasttrack-packets=0 
            repl-fasttrack-bytes=0 orig-rate=0bps repl-rate=0bps 

[admin@router] > /ip firewall connection print detail where connection-type=sip
Flags: E - expected, S - seen-reply, A - assured, C - confirmed, D - dying, 
F - fasttrack, s - srcnat, d - dstnat 
 0  SAC  s  protocol=udp src-address=192.168.0.100:5060 
            dst-address=199.87.121.233:5060 
            reply-src-address=199.87.121.233:5060 
            reply-dst-address=10.10.1.2:5060 connection-type="sip" 
            timeout=58m21s orig-packets=3 orig-bytes=1 347 
            orig-fasttrack-packets=0 orig-fasttrack-bytes=0 repl-packets=3 
            repl-bytes=927 repl-fasttrack-packets=0 repl-fasttrack-bytes=0 
            orig-rate=0bps repl-rate=0bps

To reproduce the problem, let’s change the active default route to the second ISP:

[admin@router] > /ip route print where dst-address=0.0.0.0/0
Flags: X - disabled, A - active, D - dynamic, 
C - connect, S - static, r - rip, b - bgp, o - ospf, m - mme, 
B - blackhole, U - unreachable, P - prohibit 
 #      DST-ADDRESS        PREF-SRC        GATEWAY            DISTANCE
 0 A S  0.0.0.0/0                          10.10.1.1                 1
 1   S  0.0.0.0/0                          10.10.2.1                 2

[admin@router] > /ip route set [find static gateway=10.10.1.1] distance=100    

[admin@router] > /ip route print where dst-address=0.0.0.0/0               
Flags: X - disabled, A - active, D - dynamic, 
C - connect, S - static, r - rip, b - bgp, o - ospf, m - mme, 
B - blackhole, U - unreachable, P - prohibit 
 #      DST-ADDRESS        PREF-SRC        GATEWAY            DISTANCE
 0 A S  0.0.0.0/0                          10.10.2.1                 2
 1   S  0.0.0.0/0                          10.10.1.1               100

The new scenario is showed in the image below:

Now, in the client, we will ping (ICMP) the SIP server and send a SIP message to the SIP server again. We will notice that ICMP ping works, but the SIP message doesn’t returns.

The reason is because the new ICMP packets adds a new NAT entry (via the second ISP) but the SIP NAT still uses the NAT via the first ISP (the router NATs the package using the IP of the ether1 although it sends the package via ether2).

[admin@router] > /ip firewall connection print detail where protocol=icmp      
Flags: E - expected, S - seen-reply, A - assured, C - confirmed, D - dying, 
F - fasttrack, s - srcnat, d - dstnat 
 0  S C  s  protocol=icmp src-address=192.168.0.100 dst-address=199.87.121.233 
            reply-src-address=199.87.121.233 reply-dst-address=10.10.1.2 
            icmp-type=8 icmp-code=0 icmp-id=521 timeout=49m38s orig-packets=4 
            orig-bytes=336 orig-fasttrack-packets=0 orig-fasttrack-bytes=0 
            repl-packets=3 repl-bytes=252 repl-fasttrack-packets=0 
            repl-fasttrack-bytes=0 orig-rate=0bps repl-rate=0bps 

 1  S C  s  protocol=icmp src-address=192.168.0.100 dst-address=199.87.121.233 
            reply-src-address=199.87.121.233 reply-dst-address=10.10.2.2 
            icmp-type=8 icmp-code=0 icmp-id=522 timeout=59m36s orig-packets=2 
            orig-bytes=168 orig-fasttrack-packets=0 orig-fasttrack-bytes=0 
            repl-packets=2 repl-bytes=168 repl-fasttrack-packets=0 
            repl-fasttrack-bytes=0 orig-rate=0bps repl-rate=0bps 

[admin@router] > /ip firewall connection print detail where connection-type=sip
Flags: E - expected, S - seen-reply, A - assured, C - confirmed, D - dying, 
F - fasttrack, s - srcnat, d - dstnat 
 0  SAC  s  protocol=udp src-address=192.168.0.100:5060 
            dst-address=199.87.121.233:5060 
            reply-src-address=199.87.121.233:5060 
            reply-dst-address=10.10.1.2:5060 connection-type="sip" 
            timeout=59m38s orig-packets=5 orig-bytes=2 245 
            orig-fasttrack-packets=0 orig-fasttrack-bytes=0 repl-packets=3 
            repl-bytes=927 repl-fasttrack-packets=0 repl-fasttrack-bytes=0 
            orig-rate=0bps repl-rate=0bps 

[admin@router] > /tool sniffer quick port=5060                           
INTERFACE             TIME    NUM DI SRC-MAC           DST-MAC           VLAN  
ether3              12.427      1 <- B2:C2:B4:18:98:07 00:6B:0E:68:7F:02
ether2              12.427      2 -> 00:6B:0E:68:7F:01 00:6B:0E:6B:12:01
-- [Q quit|D dump|C-z pause]

I don’t know if it’s a bug or if this behavior really makes sense, but I guess that the Mikrotik router (when receiving the new SIP packets) should create a new NAT SIP entry with the new reply-dst-address just as it occurs with the ICMP messages (because now the packages are sent through a new interface - ether2 - that have a different NAT rule).

Some questions:

  • It’s a bug?
  • Someone already saw this problem in another setup - like with another SIP helper or with normal UDP NATs?
  • What is the expected behavior?
  • If it’s a bug, how can I inform the Mikrotik suport team about it?

Known workaround:

We’re using now the following workaround:

  • The router checks from time to time (via a script that runs in /system scheduler) if the default gateway have changed;
  • If it discovers a change, then it runs the following command:
/ip firewall connection remove [find where connection-type=sip]

After this, all SIP connections started working again.

Disable SIP ALG is not an option:

Please, there’s nothing wrong with using SIP ALG (as long as it is implemented without bugs). Actually, our case is exactly the case SIP ALG was created for.

Moreover, our server requires SIG ALG to call the SIP client when necessary.

Unfortunately, I think this is a known issue with Mikrotik users. We are a service provider with SIP phones at our clients’ locations, and if we put a backup connection at the site, the SIP connections do exactly what you’re describing, and our workaround has been the same - to wipe all SIP connections out of the connections table.

I hope Mikrotik fixes this. (Or someone who knows a way to properly set this up chimes in with a real solution)

Some new info:

An employee from my company realized that may exist connection-type=sip2 entries (although “sip2” connections are not documented in MikroTik wiki - only “sip” connections are - https://wiki.mikrotik.com/wiki/Manual:IP/Firewall/Mangle or https://wiki.mikrotik.com/wiki/Manual:IP/Firewall/Connection_tracking).

If it’s true, may be necessary to change the workaround to:

/ip firewall connection remove [find where connection-type=sip or connection-type=sip2]

I also send an email to support@mikrotik.com today and I’m waiting for a response. I will update here when a have some news.

It is not so much SIP-related except that it is most notable with SIP. NAT is an extension of connection tracking, and the SIP helper as well.

When a connection is established which involves a NAT, the socket quadruple is remembered, which describes the sockets of the endpoints as well as local sockets used. A packet from an endpoint’s socket to the local socket looking in that endpoint’s direction is identified as part of the connection and forwarded accordiingly after replacing socket information if needed.

In your case, the endpoints are the same (the VoIP provider’s equipment and your CPE), but the first successful registration builds a connection record which then reuses the socket quadruple even if the actual output route changes, so the packet leaves through one interface indicating the IP address of another one as its source address.

While an icmp “connection” lasts since sending the request until receiving the response, so already the first icmp request (ping) sent after the route change leaves the Mikrotik with a correct source IP, a SIP “connection” lasts for the configured lifetime (1 hour by default), thanks to the SIP helper. If the registration fails, the connection is not destroyed, only its lifetime is reduced to 3 minutes like with an ordinary UDP connection.

Assuming that the IP address of your SIP provider is not relevant to anything else but SIP connections, I would not look at “sip” connections when cleaning up the connection table after detecting a change of the active route, and would use the following instead:

/ip firewall connection remove [find where dst-address~"199.87.121.233"]

Just thinking out loud, are these configs using masquerade instead of src NAT? Maybe try using src NAT. might be caused by the difference in the way they (masquerade vs src NAT) handle connections when the IP changes

Nope. masquerade vs. src-nat only affects from where the new source address of the packet is taken and when, not how the already established connections are handled without the interface going down or changing address.

So with masquerade, each time the interface goes down or its IP address changes, all tracked connections are cleared and newly established connections are src-nated to the new address (there is a video on that as well, explaining why to use masquerade only where really necessary).

But if the interface does not go down or change address, and just the route in question doesn’t go through it any more, as is the case which the OP describes, use of masquerade instead of src-nat doesn’t change anything.

Hi @sindy, thanks for your explanation. You are right: the SIP problem is not a SIP problem, but an UDP NAT problem (a more general problem). It isn’t even a bug: it’s a UDP NAT limitation (a protocol limitation).

I made some NAT tests and understood better how NAT works in MikroTik. I’m publishing my discoveries here.

@ZeroByte, this can help you.

The general idea

The general idea behind NAT is to divide the NAT translation workflow in two different phases: the NAT table (/ip firewall connection) and the NAT rules (/ip firewall nat).

The idea is showed at the image:

The first packet will trigger the NAT table entry creation (since there is a NAT rule for it). From the second packet onwards, the already created NAT rules will be used (if the NAT rule is deleted at this point, for example, the NAT will continue to be applied since already exists a NAT table entry).

This is what causes the NAT problem in SIP: once a NAT table entry is created, it doesn’t matter if the default router changes or what NAT rule would be applied - the already created NAT table entry will ALWAYS be applied (even if it NATs to the wrong IP) until it times out.

In the most simpler implementations, each entry in the NAT table will be similar to those in the next image:

There is an additional column (hidden in my image) that is the timeout (or the time of the last processed packet - the timeout can be calculated if we know when the last packet was processed).

If a new packet arrives with the same set of values (int-src-addr, int-src-port, int-dst-addr, int-dst-port and protocol), the same NAT rule will be applied (the packet will match this NAT rule). If a packet arrives with the set of values (ext-dst-addr, ext-dst-port, ext-src-addr, ext-src-port and protocol), its a reply for a natted packet so the NAT rule will be applied in the reversed order.

The MikroTik implementation

MikroTik has a more efficient NAT implementation. Its NAT table looks like this one (its a simplified schema, but explains well what is done):

MikroTik (as well as other modern routers) uses special columns to make better NAT decisions.

Note: There is no UDP stream flag. But Mikrotik uses more general flags (‘Confirmed’ and ‘Assured’ flags, and maybe others) to check if an UDP package bellongs to an UDP flow. So, I simplified this in the image.

MikroTik also applies different timeout values for each packet type. More info can be found here, in the docs: https://wiki.mikrotik.com/wiki/Manual:IP/Firewall/Connection_tracking#Properties_2.

Now, I will put my test results about NAT in Mikrotik (ICMP, TPC and UDP). I uses the same lab environment in the topic description.

But before, I increased all the relevant NAT timeouts (it’s easier to check the NAT table with greater timeouts):

/ip firewall connection tracking
set icmp-timeout=1h udp-stream-timeout=1h

These is no need to increase the tcp-established-timeout because it is high by default (1d).

ICMP NAT in MikroTik

The ICMP Echo Request package has the format described here: https://en.wikipedia.org/wiki/Ping_(networking_utility)#Echo_request.

It has four fields that are important in the context of the tests: Type, Code, Identity (or ID) and Sequence.

ICMP Echo Requests uses Type=8 and Code=0. When we run the ‘ping’ command in Linux, a new ID number is chosen (different ‘ping’ executions will use different IDs) and each new ping packet sent will increment its sequence numbers. Each received ICMP Echo Reply will have the same values, except it will use Type=0 (https://en.wikipedia.org/wiki/Ping_(networking_utility)#Echo_reply). Mikrotik will use this fact to un-nat a replied ping packet.

In an image:

When I run this test, this was the NAT table entry created in my router:

[admin@router] > /ip firewall connection print detail where protocol=icmp      
Flags: E - expected, S - seen-reply, A - assured, C - confirmed, D - dying, 
F - fasttrack, s - srcnat, d - dstnat 
 0  S C  s  protocol=icmp src-address=192.168.0.100 dst-address=199.87.121.233 
            reply-src-address=199.87.121.233 reply-dst-address=10.10.1.2 
            icmp-type=8 icmp-code=0 icmp-id=521 timeout=58m16s orig-packets=4 
            orig-bytes=336 orig-fasttrack-packets=0 orig-fasttrack-bytes=0 
            repl-packets=3 repl-bytes=252 repl-fasttrack-packets=0 
            repl-fasttrack-bytes=0 orig-rate=0bps repl-rate=0bps

My ‘ping’ program choose the ID 521.

When I changed my default route, my ‘ping’ program stopped to work (it was using the already created NAT table entry, and it was changing the source IP to 10.10.1.2 , a wrong value now). But, when I kill the program and started another ‘ping’, the new instance choose a different ICMP ID, so the new packets didn’t match the old NAT table entry (because of the different ICMP ID) and a new entry was created:

[admin@router] > /ip firewall connection print detail where protocol=icmp      
Flags: E - expected, S - seen-reply, A - assured, C - confirmed, D - dying, 
F - fasttrack, s - srcnat, d - dstnat 
 0  S C  s  protocol=icmp src-address=192.168.0.100 dst-address=199.87.121.233 
            reply-src-address=199.87.121.233 reply-dst-address=10.10.1.2 
            icmp-type=8 icmp-code=0 icmp-id=521 timeout=49m38s orig-packets=4 
            orig-bytes=336 orig-fasttrack-packets=0 orig-fasttrack-bytes=0 
            repl-packets=3 repl-bytes=252 repl-fasttrack-packets=0 
            repl-fasttrack-bytes=0 orig-rate=0bps repl-rate=0bps 

 1  S C  s  protocol=icmp src-address=192.168.0.100 dst-address=199.87.121.233 
            reply-src-address=199.87.121.233 reply-dst-address=10.10.2.2 
            icmp-type=8 icmp-code=0 icmp-id=522 timeout=59m36s orig-packets=2 
            orig-bytes=168 orig-fasttrack-packets=0 orig-fasttrack-bytes=0 
            repl-packets=2 repl-bytes=168 repl-fasttrack-packets=0 
            repl-fasttrack-bytes=0 orig-rate=0bps repl-rate=0bps

And abracadabra alakazam, my ping worked!

TCP NAT in MikroTik

TCP is a connection-oriented protocol. After doing its famous three-way-handshake (https://en.wikipedia.org/wiki/Transmission_Control_Protocol#Connection_establishment), our connection will be at an established state.

I did a test is something like this appeared in the Nat table:

[admin@router]> /ip firewall connection print detail where protocol=tcp
Flags: E - expected, S - seen-reply, A - assured, C - confirmed, D - dying,
F - fasttrack, s - srcnat, d - dstnat
 0  SAC  s  protocol=tcp src-address=192.168.0.100:40000
            dst-address=172.127.30.14:443 reply-src-address=172.127.30.14:443
            reply-dst-address=10.10.1.2:40000 tcp-state=established
            timeout=23h59m51s orig-packets=3 orig-bytes=164
            orig-fasttrack-packets=0 orig-fasttrack-bytes=0 repl-packets=2
            repl-bytes=185 repl-fasttrack-packets=0 repl-fasttrack-bytes=0
            orig-rate=0bps repl-rate=0bps

Note that the tcp-state is ‘established’ and that timeout value is next to one day!

If we close the connection in the correct way (https://en.wikipedia.org/wiki/Transmission_Control_Protocol#Connection_termination) sending a FIN packet, the connection will be closed in both hosts and Mikrotik will change the entry in the NAT table to the time wait state:

[admin@router]> /ip firewall connection print detail where protocol=tcp
Flags: E - expected, S - seen-reply, A - assured, C - confirmed, D - dying,
F - fasttrack, s - srcnat, d - dstnat
 0  SAC  s  protocol=tcp src-address=192.168.0.100:40000
            dst-address=172.127.30.14:443 reply-src-address=172.127.30.14:443
            reply-dst-address=10.10.1.2:40000 tcp-state=time-wait timeout=5s
            orig-packets=10 orig-bytes=536 orig-fasttrack-packets=0
            orig-fasttrack-bytes=0 repl-packets=8 repl-bytes=611
            repl-fasttrack-packets=0 repl-fasttrack-bytes=0 orig-rate=0bps
            repl-rate=0bps

This is an intermediate state before removing the NAT entry. Its important to avoid problems with ACK retransmissions after the FIN sent (https://networkengineering.stackexchange.com/a/19718/17394).

After a few seconds, the TCP NAT entry will be entirely removed.

[admin@router]> /ip firewall connection print detail where protocol=tcp
Flags: E - expected, S - seen-reply, A - assured, C - confirmed, D - dying,
F - fasttrack, s - srcnat, d - dstnat

It’s interesting to see that due to its connection oriented nature, it’s possible to remove a NAT entry without waiting a long time for a timeout expiration.

So, I made the following test: I started a TCP connection and, in the Mikrotik router, I changed the default route. At this time, the communication between the client and the server goes down. After some time, the client connection timeout out. In this moment, this was the state of the NAT table:

[admin@router] /ip firewall connection> print detail where protocol=tcp
Flags: E - expected, S - seen-reply, A - assured, C - confirmed, D - dying,
F - fasttrack, s - srcnat, d - dstnat
 0  SAC  s  protocol=tcp src-address=192.168.0.100:40000
            dst-address=172.127.30.14:443 reply-src-address=172.127.30.14:443
            reply-dst-address=10.10.1.2:40000 tcp-state=fin-wait timeout=2s
            orig-packets=12 orig-bytes=646 orig-fasttrack-packets=0
            orig-fasttrack-bytes=0 repl-packets=4 repl-bytes=315
            repl-fasttrack-packets=0 repl-fasttrack-bytes=0 orig-rate=0bps
            repl-rate=0bps

The ‘fin-wait’ state means that the client sent a FIN to close the connection (once it timeout out). However, the server did not responded (as at this time we’re applying the wrong NAT table entry - which uses the wrong IP).

The client also keeps its state in ‘fin-wait’:

root@sip-client:~# netstat -nt
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0     12 192.168.0.100:40000    172.127.30.14:443       FIN_WAIT1

And just after a few seconds (and some more state changes), the NAT table is cleared:

[admin@router]> /ip firewall connection print detail where protocol=tcp
Flags: E - expected, S - seen-reply, A - assured, C - confirmed, D - dying,
F - fasttrack, s - srcnat, d - dstnat

So, now, it’s possible to start a new TCP connection with the server (even using the same src-port). Note: To always use the same source port, I used the command nc -p 40000 172.127.30.14 443.

So, what happens if we try to reconnect before ‘fin-wait’ entry have timed out (in other words, what happens if a new SYN is send)?

I did this test and I have a great surprise: Mikrotik did not apply the old NAT rule (which would NAT to the wrong IP). Because the new SYN packet is not related to the previous connection (even if it has the same src-addr, src-port, dst-addr, dst-port and protocol), the router knows if belongs to a new connection. So, the router pass the new SYN packet thought the NAT rules and replaces the old NAT table entry with a new one that uses the correct IP.

Before:

[admin@router] /ip firewall connection> print detail where protocol=tcp
Flags: E - expected, S - seen-reply, A - assured, C - confirmed, D - dying,
F - fasttrack, s - srcnat, d - dstnat
 0  SAC  s  protocol=tcp src-address=192.168.0.100:40000
            dst-address=172.127.30.14:443 reply-src-address=172.127.30.14:443
            reply-dst-address=10.10.1.2:40000 tcp-state=fin-wait
            timeout=59m56s orig-packets=11 orig-bytes=580 
            orig-fasttrack-packets=0 orig-fasttrack-bytes=0 repl-packets=2 
            repl-bytes=185 repl-fasttrack-packets=0 repl-fasttrack-bytes=0 
            orig-rate=0bps repl-rate=0bps

And after:

[admin@router] /ip firewall connection> print detail where protocol=tcp
Flags: E - expected, S - seen-reply, A - assured, C - confirmed, D - dying,
F - fasttrack, s - srcnat, d - dstnat
 0  SAC  s  protocol=tcp src-address=192.168.0.100:40000
            dst-address=172.127.30.14:443 reply-src-address=172.127.30.14:443
            reply-dst-address=10.10.2.2:40000 tcp-state=established
            timeout=23h59m55s orig-packets=3 orig-bytes=164 
            orig-fasttrack-packets=0 orig-fasttrack-bytes=0 repl-packets=2 
            repl-bytes=185 repl-fasttrack-packets=0 repl-fasttrack-bytes=0 
            orig-rate=0bps repl-rate=0bps

Note that reply-dst-address was updated to the new IP.

UDP NAT in MikroTik

UDP is not a connection oriented protocol (like TCP). So, MikroTik has no way to know in whats state the connection is (or even know that an UDP packet is trying to establish a new connection). It also has no Type/Code/ID fields (like ICMP) that allows MikroTik to use new NAT table entries when some of these fields changes. Mikrotik only has the five columns (src-addr, src-port, dst-addr, dst-port and protocol) to decide with NAT table entry to use.

It also has the stream state to change the UDP timeout when necessary. But once MikroTik classifies an UDP NAT table entry as a stream, it has no way to discover that the “stream connection” stopped to work and the client is trying to start a new stream. Its a very limited protocol (lightweight, but limited).

A found more useful info about this in the O’Reilly’s book “High Performance Browser Networks” written by Ilya Grigorik: https://books.google.com.br/books?id=tf--AAAAQBAJ&pg=PT66&lpg=PT66&dq=tcp+nat+logic+connection-state&source=bl&ots=-YAFtzxD3o&sig=w4xw5gMe4iriqvqvUQbVxbB0Wtg&hl=pt-BR&sa=X&ved=0ahUKEwia0PfCm5nZAhWIG5AKHQmdDtoQ6AEIKDAA#v=onepage&q&f=false.

In my tests, I only discovered three ways to make UDP NATs work again after the default route have changed:

  • Wait for the timeout to expire (when using SIP Alg, however, huge timeouts are necessary, and the server and the client frequently send keepalive messages that doesn’t allow the NAT table entry to timeout - so, cannot be applied in my case);
  • Sending new UDP packets from a different source port;
  • Manually removing the NAT table entries.

Conclusion

There is not bug in Mikrotik SIP Alg. Actually, it’s a limitation of the UDP + NAT schema. However, a workaround is important to circumvent this limitation.

Workaround

1 - Clean the SIP NAT table on default route changes

As I explained before:

  • The router checks from time to time (via a script that runs in /system scheduler) if the default gateway have changed;
  • If it discovers a change, then it runs the following command:
/ip firewall connection remove [find where connection-type=sip]

After this, all SIP connections will start working again.

Note: As I also explained before, a more complete command (/ip firewall connection remove [find where connection-type=sip or connection-type=sip2]) may be necessary.

2 - Change source port

The second workaround is to use a different UDP port in the client when it realizes the connection with the server is not working anymore. When trying new invites, the SIP client should use a new UDP port to force a new NAT table entry to be created.

Note that this is not the default (SIP clients uses port 5060). Second Wikipedia SIP page (https://en.wikipedia.org/wiki/Session_Initiation_Protocol):

SIP can be carried by several transport layer protocols including the Transmission Control Protocol (TCP), the User Datagram Protocol (UDP), and the Stream Control Transmission Protocol (SCTP).[13][14] SIP clients typically use TCP or UDP on port numbers 5060 or 5061 for SIP traffic to servers and other endpoints. Port 5060 is commonly used for non-encrypted signaling traffic whereas port 5061 is typically used for traffic encrypted with Transport Layer Security (TLS).

Also, changing the source port can break QoS rules or other firewall rules, not be supported by SIP clients, etc. But it may be a good solution.

3 - Change the transport protocol used by SIP

Obviously, I’m not proposing to send the VoIP stream over TCP (it will still using RTP), but only the SIP part.

As an important note, even only using TCP for SIP is highly not recommended (https://www.onsip.com/blog/sip-via-udp-vs-tcp) and can lead to a lot of problems. Moreover, most SIP clients (and possibly servers) doesn’t allow SIP over TCP traffic.

Hats off to the author of this post. One of the most detailed and precise firewall/connection tracking related posts that I have seen in this forum.

Thank you for this post which most likely will explain how NAT works to many users here. Often situation which is recognized as a “bug” is an actual requirement or simply - how things work in networking.

Countless times we have heard - reboot resolves the issue. Reboot simply clears connection tracking table.

This was a fantastic deep dive into the quantum mechanics of why this issue occurs. I already understood the behavior of the connection tables / NAT, etc quite well, and why the NAT rules weren’t using the new WAN address on route changes, but I did learn the ultimate root cause of why UDP streams are affected while TCP sockets are not. (i.e. the ICMP identifier and TCP identifiers being present but the absence of those hooks in UDP). I just never sat and pondered the mechanics to the point where I could come to that conclusion as you did, so thanks a lot. I think your post is wiki-worthy. Kudos!

Now on to my response to this whole thing:

I can tell you that this is not something that creeps up in any other nat/firewall box I’m aware of, such as Cisco IOS, Cisco ASA, Netgear, PFSense, Sonicwall, etc etc. So the conclusion is that the connection tracking engine’s architecture on RouterOS is the root cause of this. From reading your analysis and doing a little bit of thinking, it seems to me that the issue could be resolved by adding two more fields to the connection tracking table: in-interface / out-interface. Obviously the firewall filter and mangle rules are processed for packets flowing through the router even when they’re part of active entries in the state tracking table. The issue here is that Mikrotik’s acceleration tactic of skipping the nat table for packets found in the connections list is the fault. So I would think that if the in/out interfaces were part of the table, then the NAT attributes of the packet would no longer match for datagrams following a different path through the router, which would require the router to evaluate the NAT table again, which would result in a new connection entry having the updated NAT results.

Obviously I’m not an engineer for Mikrotik, and there could be a bazillion reasons why this would either not fix the behavior, or would break other things, or lead to poor performance, but at the end of the day, the engine needs to be smart enough to realize that the table entries no longer apply after topology changes occur, and deal with it automatically. Having scripts that either run on scheduler or on event triggers is exactly what your post calls them: a workaround.

Mikrotik’s position of “I canna change the laws of physics” doesn’t fly with me. Other vendors don’t have this problem. If an undersirable behavior comes from the engine, then the engine needs to be tweaked.

I would also posit that another work-around would be to use a “netmap” NAT entry which matches only SIP packets - as netmap is stateless, requiring the router to evaluate the NAT table every time for those packets, but it would be a solution not requiring scripting to patch up. I’ll have to take a stab at that in my lab. However, this would only work* for a single internal host on a static internal IP, so it’s not really useful in real-world deployments of desktop SIP phones.

  • the reason is that you must also supply an “un-nat” rule when using netmap, so the internal IP must be known in advance for writing the inbound “un-nat” rule.

Hi all, my 50 cents as working solution
Linux/Asterisk crontab script:

=========CUT=========
#!/usr/bin/bash

ret=$(/usr/sbin/asterisk -rx “sip show registry” | grep -c “Request Sent”)
if [ “$ret” -eq 0 ]
then {
echo “SiP OK”
} else {
ret=$(/usr/bin/ssh username@192.168.xx.1 ‘:foreach i in=[/ip firewall connection find dst-address~“:5060” protocol~“udp”] do={ /ip firewall connection remove $i } ; quit’)
}
fi;
=========CUT=========

Strods, being aware of the explanation, can we expect an improvement to the OS engine in this regard or at least a special function that can be applied to SIP traffic?
I have posted my VOIP issue the other day on the forums and am super happy to have my great friend mozer point this out to me before I lost more sleep and made anymore bizarre rules using dstnat and mangle to try and avoid this issue.

Where do we vote for the post of the decade!!!

Zero Byte,
I am interested in your solution as I have a fixed private LANIP static and a single host. Do you have more info?
Remember anybody with a voip modem at home is in this scenario, not so small an audience these days!!

Users with a single fixed IP on a single WAN line should not be affected by the above, so they do not need a fix.
The problem described above only occurs when there are multiple WAN lines and the router switches between them (e.g. due to some failover mechanism) without the client knowing about it.
There is another problem (not the one shown above and not fixed by the above suggestion) that occurs when the external IP address changes (dynamic address) without the router noticing it.
(i.e. no WAN interrface down/up event)
However that is not what you have.

Pe1chl

Where did I state I have one WANIP> I have two WANS and one voip modem on a single private IP behind the router which is static.

You do raise a good question. For my FIBER connection which is a dynamic IP address, how do I know for sure that when the iP changes the connection will still work as I am not manually at the router to see what gatewayIP is in use in case it needs to be changed for routing. ???

I have a single fixed IP on a single WAN and I’ve suffered from this particular bug.

Please elaborate

Hi anav,

When you have two WANs and a VoIP modem using a single private IP behind the router, is the same case discussed in this post. In your case, your modem works as the SIP client, and your router does the SIP NATs (using SIG ALG helper).

However, if you have only one WAN connection, and your router receives a dynamic IP address via DHCP, there are two possibilities:

  1. Using src-nat

In this case, you WILL HAVE PROBLEM when the router IP changed. The wrong SIP NAT rule will be applied, in a very similar way of the case initially discussed in this post, because the SIP NAT entry in the NAT table will still be using the old IP.

So, you will have to apply a complex workaround (similar to the proposed in the post).

  • Check from time to time (via a script that runs in /system scheduler) if the IP associated to your WAN interface have changed;
  • If you discover a change, then clean the SIP entries from the NAT table.
  1. Using marcarade

When using mascarede, your router will automatically clean all tracked connections, including the SIP NAT entries in the NAT table. This way, new SIP NAT entries (that will point to the new IP) will be created and your SIP connections will still be working after this.

As sindi already commented about mascarede:

So, instead of applying a complex workaround, I really recommend you to JUST USE MASCARADE and everything will work.

Hi rarlyson,

Thanks for a great thread by the way!!
I watched this presentation with interest captured some wireshark type data and discovered or so it seems that the modem is already SIP nat aware and before the ALG was applied at layer 7, it was clear the modem is aware of public IP.
https://mum.mikrotik.com/presentations/US17/presentation_4321_1496084451.pdf
https://www.youtube.com/watch?v=tM7wyKdnIKA&feature=youtu.be

In any case I was testing different routers and IP routes and switching back and forth between WANIPs and I noted that the modem would get stuck such that it and the sip server would try to maintain connectivity over the fail over WAN for example and not use the primary which basically shut down operations. This in spite of being a nat aware modem and despite the mikrotik layer 7 ALG.

Thus I think if falls within your scope of issue discussed.
By the way I have two srcnat rules and both use action=masquerade so it is no fix (yes sindy can be wrong, touch wood).

What I am most curious about is the post by ZERO BYTE where he said he could prevent the scenario with netmap nat entry but never came back to explain his findings on the test he said he was going to conduct. Thus far I see no easy path to resolve… Do you know what he was talking about or where I can pursue this line of reasoning?? To be frank, it works 99% of the time and I only noticed the issue due to a. changing routers from a zyxel router to the now in place HEX router and b. of course all my testing. I was kind of happy to know that I was not crazy and that its a legitimate issue, I had started making way out there dstnat rules and mangle rules to no effect.

I also should note that a mikrotik rep, strods noted your post and the discussion but we see no changes in place or forthcoming?? So I added it to the beta suggested issues thread…

I can be wrong and when I am I have no problem to admit it, but in this particular case, you may be mixing several things together.

The original purpose of SIP registration is that the SIP end device informs the exchange about its current IP address so that the exchange knew where to send eventual incoming calls. If the device is on a private address behind a public one, one problem is that the device’s own (private) address in the REGISTER message is useless for the exchange. So there are three ways how to deal with that:

  • the device itself uses STUN to determine how the NAT behaves and what is the public addres (which may not be a simple task, think about load balancing on WANs with different IP addresses) and puts that public address into the REGISTER message
  • the NAT in the router uses an ALG, modifying the SIP message contents (the device puts its private address into the message and the ALG replaces it with the public IP of the WAN it uses to forward the message
  • the exchange notices that the source socket address of the incoming REGISTER message differs from the one received inside the message and remembers both (so it sends messages to the source socket of the REGISTER and uses the address received inside the REGISTER inside the messages it sends)

Besides, some exchanges check whether ougoing calls of a particular user account come from the same socket address through which that user account has previously registered, and do not accept call initiation requests coming from any other socket.

So e.g. if the device registers for 20 minutes, and before the 20 minutes expire, the WAN address changes, the device does not learn about the change so it does not know that it has to re-register. So until the registration expires, the device cannot receive incoming calls because the exchange sends them to the old address, and may even be unable to call itself because the exchange can see the call initiation request to come from a different address than the last preceding REGISTER.

So the use of masquerade instead of src-nat only addresses one possible issue, which is that an already established connection on the firewall remembers the src-nat address assigned when the connection was set up - as explained earlier, use of masquerade causes all connections to be removed if the address changes, so the next REGISTER from the device establishes a new connection rather than updating the old one, and thus the current WAN address is assigned to that connection.

But the use of masquerade only affects the behaviour of the firewall itself. If the device has previously determined, using STUN, the public address it gets, there is a chance that it never updates that information and keeps using the old one in its messages (until it is restarted or at least its Ethernet connection goes down and up again). I have no idea how the Mikrotik’s SIP ALG handles other addresses than the devices’ own ones in the SIP messages from the devices on LAN - maybe it only replaces device’s own addres with the WAN’s address and lets any foreign addresses alone. You’d have to use packet sniffing and analyse the packets using Wireshark to find out.

What I’m trying to say is that it never came to my mind to use more than one of the three methods of dealing with the customer side NAT simultaneously. While ALG and STUN can coexist with exchange-side auto-learning peacefully, I’m afraid they may be incompatible mutually. As STUN is by principle less reliable than ALG, I would choose ALG out of the two. And because many ALG implementations are buggy, if the exchange supports auto-learning, the best approach is let it deal with the customer-side NAT itself and not use even the ALG.

But even if your VoIP provider’s exchange does support auto-learning, the problem of WAN address change between registrations remains. So a script taking down and up the Ethernet to which the phone is connected when the WAN address changes, or a very short DHCP lease time for the phone, may be necessary to shorten the gap. The best in my understanding would be that you could synchronize the lease times of the local DHCP server with the lease times of the DHCP client on the WAN interface, but nothing like this exists in RouterOS.

Hi Sindy, thanks for the explanation!
I too am very curious as to the interplay between the modem configuration which I have no control over and the ALG.
Do you know what zerobyte was alluding too with his netmap comments??