Daily Disconnect

I have had my Mikrotik 750G router installed for a a couple days and so far it is working really well. I have two DSL connections bonded through it and then it connects into my Pix firewall. I had a Cisco where it was before and the Mikrotik is working much better. I have a Blackberry enterprise server on my network behind the Pix and a Microsoft Exchange server. It seems that every day, sometime around 3:00 a.m. the Mikrotik stops allowing outgoing traffic properly. When I get up in the morning I won’t have any messages on my Blackberry past 2:35 or 2:45. The strange thing is that mail is still going to my Exchange server so I know the connection is working. If I start a continuous ping to a host on the internet, several packets are lost, usually about 50% or more.

Doing a system reboot from the Mikrotik gets the connection working again although the ping times are much higher than normal. I think a full power off of the system with the plug fixes that. I also have a free account with Pathview Cloud that basically monitors my internet uptime and performance and sends me an e-mail if the internet is down, that is how I knew the Cisco kept loosing connection. When I have these problems with the Mikrotik, I never get any of these alerts, even if I reboot it, presumably because the reboot is so quick.

Has anyone run into problems like this or know why it might be happening?

Thank you very much

Neal

I had these problems with a customer last year. He was accusing the Mikrotik unit of malfunctioning, and hired me to replace it. It was not the MT router. The ISP connection was faulty. I was losing 33% of ping packets to the unit from an external source. Internally, it was fine. The owner called his ISP, they sent a tech out, and he replaced the bad CAT5 connector on their equipment, and now all is well.

How does the log look after one of these episodes? Is it showing a reboot or anything? That was my giveaway. The log on the suspect unit showed the router had not been down at all, just the connection.

That makes sense, thanks I don’t suspect the router is bad since everything else seems to work fine. I assume that one of the DSL lines is probably going down temporarily. The strange thing to me is that when this or whatever else happens the connection seems to stay up since I still have traffic coming in since I am getting e-mails and some traffic must be going out since I don’t get any alerts from Pathview. With the old router if I even lost my connection for a minute I would get alerts.

I will look at the logs tomorrow morning since I assume it will probably happen overnight again. If one of the links goes down in the mlppp bundle will the connection stay up and will it fix itself if the other link comes up again? I know if both go down I will lose it but I am wondering how resilient it is to a single link failure.

I have been going through the logs when this happens and there doesn’t seem to be anything logged about it. I have enabled syslog capturing of all events and nothing yet.

Both connections seem to remain active and if I check the monitor settings in the Mikrotik they seem fine yet my connection is still becoming unreliable at about 3:00a.m. Even though the connection shows as active I end up losing at least 50% of my ping packets.

Is there a way to have the Mikrotik watch for something like this and automatically reset itself if it occurs or does anyone have any idea why this might be happening?

The lack of log entries is usually a good thing. That was my giveaway. No reboot. No internal failures listed.

Have you tried traceroute? It took a bit to see where my problem was because it was, like yours, intermittent.. If you can use traceroute when you are having connection problems, that will help, especially if you can traceroute from inside the localnet and from an outside source. They should almost meet somewhere during the outages.

I agree that the lack of logs sounds like a good thing. I decided to stay up late tonight so I could call the ISP when the problem happened. They have 24 hour support. Sure enough, at about 3:00 I noticed that my internet didn’t seem to be working and the continuous ping I had been running for the last few hours was losing packets. It would respond twice, then miss three times, then respond once etc. It was random but looked like it lost half the packets. The ISP ran tests on both of my DSL lines and said they looked fine. There was no heavy congestion on the network or anything.

Normally I had been restarting the router by doing a system reboot but this time I did a set 0 to reset the bonded PPPoE connection. It took 2 seconds and the connection started working fine again. The only things that were logged in the syslog server were the events saying I terminated the connection and it reconnected itself. I had enabled info syslogging which to my knowledge would have meant any info or above events. At least that is how the Cisco devices handle. The pings started working fine again and my internet was good. I didn’t reset either of the DSL modems and I didn’t power cycle the Mikrotik.

It seems like something screws up in the PPPoE on the router that causes the connection to drop and it doesn’t re-establish itself. Do I have the logging setup properly or am I missing something that is causing me to not receive the necessary events? Is it also possible to have the firewall watch for packet loss and if there is any, run a command? Worst case, is it possible to schedule a command to run at a certain time? If I can’t figure out the problem an option might be to have the interface reset at 3:00 every morning and hope that at least resets the connection and gets it working again. Not the ideal solution obviously since it doesn’t fix the problem and if this starts happening at a different time I will have the problem all over again but I am just looking for options now.

I also upgraded the router to 4.5 when I got it but the routerboard version says 2.23 when I do the system routerboard print. When I do the system package print it says 4.5 for all the components including the routerboard. The strange part is that the actual system routerboard print doesn’t show that version and the upgrade firmware says that it is only 2.23. I’m not sure if this could be related but I am trying to provide any information that could help. Would going back to an earlier release be a possible solution? I always like running the newest versions if possible which is why I did the upgrade originally.

Any help would be greatly appreciated.

Thanks

Neal

The firmware and the OS will have different versions.

Did you try traceroute to see where the ping packets were being dropped? Even if traceroute doesn’t fail, it should show a delay at the location causing the drop. If it is your public ip, then you know it is your router. If it makes it to the ISP gateway, then it is your ISP’s problem.

The other way to check is set up a script to ping your public gateway (the one controlled by your ISP) and then a remote location, and log the events. If the gateway ping is ok and the remote ping fails, it is once again your ISP. If both fail, it could be your router. If you are unfamiliar with scripting, I can lend a hand.

ADD: I like the traceroute solution better. When you call your ISP tech support, and tell them at what ip the failure is happening, they will not question your competence again.

The strange thing is the disconnect didn’t happen the same last night. At 2:53 it did lose its connection again but instead of intermittent pings I had a continous string of failing pings. I logged into the router and it still said it was connected. I also checked the DSL modems and when I got back to my computer after doing that, the connection was working again. I never lost it for hours like I had every other night. Since it came back up so quickly I didn’t get a chance to do the other ping and traceroute tests you had mentioned which I would like to try if it happens again.

I was also wondering, how big a deal is it to downgrade the OS back to 3.3? I spoke to someone else from my area that uses some of these routers with the same ISP that I use and he says that he has had none of these problems. The only thing he mentioned is that he has never run one with an OS later than 3.3. I upgraded the router as soon as I got it, before I started using it so I am wondering if there might be a problem in this version with this setup. Would it be worth trying this to see if things are more stable.

Thanks.

It sure sounds like your ISP’s connection, but you won’t really know until you can get a traceroute while it is down like that. That scenario would have been perfect had you been ready with traceroute. Have you tried it just to see what it looks like normally?

If I run the traceroute from the Mikrotik it works fine and responds back quickly. I can’t run it from my computer directly I assume because my firewall block the responses. I agree that it would have been a good chance to test that last night and usually it will stay down till I reboot so I didn’t expect it to start working again so quickly.

I assume it will happen again tonight and even though I am not going to stay up for it this time, I will do the tests when I get up tomorrow to see if I can isolate the problem.

I would also find someone that has a box that will run traceroute from a remote WAN connection. I finally found mine by running traceroute to the Mikrotik router from outside. When the outages/losses were occurring, traceroute stopped/slowed just short of the ISP’s gateway device on the customer’s end. That is why I mentioned running traceroute when it is up, so you can see what devices you are going through all the way to the router. Pay close attention to the last couple before the router!

Thanks for the suggestions. I will try some diagnostics the next time it goes down to see if I can identify this better. I was looking at the config of my pppoe interface. My max-mtu and max-mru are both set to 1480. I know there can be some instances where the mtu should be set to a different value but I am wondering if this needs to be done on the Mikrotik routers. The other thing, when I do a monitor 0 for that interface, the mru still shows 1480 but the mtu 32719. That seems a little strange to me but I haven’t worked with the Mikrotik routers before so I am not sure if this is normal or not.

OK, I am not really sure what to do here. It lost its connection at some point just after 3 this morning again but for the second night in a row, the connection came back on its own. I am fine with this happening over night if the connection restores itself, I just don’t want the long downtime if it doesn’t come back.

You had mentioned scripting, is there a way to create a script to watch the connection and in the event that a lot of ping packets are lost or the ability to reach a certain host is compromised have the router reset itself. The reason I am wondering is that I am going away for a few days starting tomorrow and I am worried that the connection will go down and not come back up again while I am away. Since I won’t be here, I won’t be able to reset the connection to get it working again.

Thanks

Neal

Your original post mentioned a pppoe connection. Is that set up in “/interface pppoe-client”? I would rather reset the connection that reboot the router if possible. How do you reset the connection after a failure with the “set 0” command?

You’re right, it is using the interface pppoe command. Whenever I have had the problem up till now I do a reset 0 which works fine. I agree I would like to be able to automate a reset of that interface if communication stops working properly but I am not sure if that is possible. A reboot of the router isn’t required since that command works but the main problem is that if I am not here, I can’t do it remotely.

http://wiki.mikrotik.com/wiki/Netwatch

Netwatch is built in and can monitor an IP address and fire a script (such as resetting as you are) when the IP address becomes unreachable.

And if you only have high packet loss but not reliably a complete communication loss, “/ping” returns the number of successful probes:

[admin@MikroTik] > :local success [/ping 10.2.0.1 count=3]; :put "Successful pings: $success";
10.2.0.1 64 byte ping: ttl=64 time=1 ms
10.2.0.1 64 byte ping: ttl=64 time=3 ms
10.2.0.1 64 byte ping: ttl=64 time=1 ms
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 1/1.6/3 ms
Successful pings: 3
[admin@MikroTik] > /ip firewall filter add chain=output protocol=icmp action=drop
[admin@MikroTik] > :local success [/ping 10.2.0.1 count=3]; :put "Successful pings: $success";
packet rejected
packet rejected
packet rejected
3 packets transmitted, 0 packets received, 100% packet loss
Successful pings: 0

So you can write a script that pings 60 times and checks if more than 30 probes were lost and reset on that condition, and use the scheduler to run that. The below may work, it’s untested:

 /system scheduler add name=WANwatch start-time=startup interval=10m on-event=":if ([/ping count=60 ip.to.monitor] <= 30) do={/interface pppoe-client disable 0; /interface pppoe-client enable 0}"

Hi fewi: Nice job, except I am not certain if the script will disable/enable the interface using the line number without a “print” prior to using it. You will probably need to do a “find” to access the interface.

ADD: Like this
/interface pppoe-client disable [/interface pppoe-client find name=interfacename]

Thank you very much for the suggestions. Sure enough, last night the disconnect happened again. The last message I got was at about 2:30a.m. I started some pings and traceroutes to my gateway first thing this morning. If I did a continuous ping to my gateway I was getting about 40% to 50% packet loss. A traceroute to a host on the internet showed that there were a couple delays at my gateway and then delays throughout the transmission to the host. Presumably because of the gateway delay again. To me, this shows me that the problem is related to my connection or router.

I tried the suggestions you made which were great. Unfortunately the Netmonitor won’t work because it isn’t a complete failure of the link, just very high packet loss. I tried the other script you suggested and it executes but it doesn’t seem to reset the interface. I changed the value to <=15 so that 15 or more loss packets would trigger the script since it is not always failing 50% and I changed the reset command to /interface pppoe-client set 0 since resetting the interface usually brings it back up. I don’t generally have to disable it and re-enable it. To me, this made the command simpler and less commands to run. Any ideas why this command doesn’t work or is there a way to see the outputs of a script to see where it might be failing?

I would like to get this resolved before I leave this morning but if I can’t figure it out, I might have to connect my old router for now and work on this problem again when I get back. Any suggestions would be greatly appreciated.

Can you call your ISP and find where their gateway device is?
Just to check, have you replaced the cable between the router and the modem to see if it is faulty?

If you used the line numbers, the script will not work. You need to use the “find” to get it to disable/enable the interface.

ADD: If it isn’t triggering, change
:if ([/ping count=60 ip.to.monitor] <= 30)
to
:if ([/ping count=10 ip.to.monitor] <= 9)

That way if you get more than 10% drops, it will trigger.