The Problem
The problem probably lies in the connection tracking, interestingly enough, this only happens while using a PPoE connection, on PPTP or simple ethernet connections this issue can't be reproduced.
In this topic the issue was already described and there is a workaround -> reboot the router or create a new PPoE interface every other option (creating the connection manually, deactivate and activate session tracking and so on, please review the thread) has already been tried without success. Various RouterOS Versions have been tried!
A short summary with he most important details was provided by this post:
viewtopic.php?f=2&t=127858
Last but not least: today the problem also occurred while the ppoe connection was down for a couple of seconds, so it wasn't the classic 24 hour forced disconnection. Re-initiating and rebooting the PBX didn't help, also clearing the connection manually etc. etc. did not resolve the issue.
Hi Forum Users, I am delighted to know i am not the only person experiencing this.
Issue Summary
* Same issue. When running PPPoE tunnel over VDSL, if VDSL tunnel drops / re-auths, the trunk becomes unreachable until the router has been rebooted.
* The issue is NOT limited to NAT / PBX's on private networks. This also affects systems on PUBLIC IP's.
* All other TCP/UDP traffic remains unaffected and continues to pass.
* SIP Debug (no apparent SIP responses are recieved by either side, e.g. OPTIONS, INVITE).
* Capture via TCPDump reveals that the packet is being sent by both instances of Asterisk but nothing being recieved on remote end.
* MikroTik Conntrack shows the session but no repl bytes / packets are recorded. This is further reflected by a lack of 'Seen Reply' flag.
The following steps have been attempted to detemrmine the cause and workable solution at the customer site WITHOUT REBOOTING. They have NOT worked.
* Reset sessions in MikroTik Conntrack.
* Stop Asterisk for 10 Minutes
* Reboot Asterisk
* Reboot Hypervisor
* SIP ALG on/off (tried both, does not matter).
* Static Default Route (with pref src set).
* PPPoE Dialler profile set to 'Default'.
* Redirect and retargetted 5080, (translated remotely to 5060), the trunk becomes reachable until a subsequent disconnect/reauth.
* Redirect and retargetted 5060, the trunk remains unreachable.
* Added port forward udp::5080->udp::5060
The following steps have been attempted to detemrmine the cause and workable solution at the provider site. They have NOT worked.
* Added redirect (IPTABLES POSTROUTING) ports from 5080 -> 5060 on Trunk box.
* Changed customer target port to 5080. The trunk becomes reachable until a subsequent disconnect/reauth.
* Changed customer target port back to 5060. The trunk remains unreachable.
Supplemental.
* I had set up the same test conditions at my lab. With RB2011, PPPoE (over true Fibre Optic) with VMWare workstation and a FreeBSD 10.3 / Asterisk 13 Server. I could not reproduce the error.
* Routed a public /30 to the customer.
* Added vlan interface to MikroTik w/ public IP.
* Added vlan inerface / portgroup to PBX.
* Assigned public IP to PBX.
* Changed last resort gw to new public /30.
* Removed NAT rules, specific to SIP / VoIP.
* Reconfigured SIP configs to listen/connect via/on new public IP.
* Established bi-directional trunk.
* Forced disconnect/re-auth of PPPoE.
* Trunk becomes unreachable until Reboot.
My thoughts.
* I suspect the MikroTik's kernel, subsequent to the disconnect/reauth is no longer processing the SIP packets, irrespective of the port used prior to the disconnect.
* It appears session beccomes stuck in the kernel likely due to internal RouterOS interface / session identification no longer existing.
As a VoIP provider which uses mainly Mikrotiks at customer site, it is really frustrating to manually recreate such interfaces or reboot the Mikrotik, both kills the internet connection all over again and leads to a bad reputation at our customers. As of now we implemented a reboot script every night in order to avoid this, sadly a short downtime on a DSL line is inevitable, so we can not automate this process without resetting every connection because of this bug, also various filter rules and nat rules are bound to the ppoe interface which makes a script to create a new ppoe interface also not an viable option.
Please look into this!