Diagnosis suggestions

G’day all,
I have a problem at the moment with the stability of my internet connection and would like suggestions as to how I can find out exactly where the problem is.
Some preamble.
My setup has been running for a few years and has only recently started playing up.
The problem stared happening after the COVID lock down caused schools to close (I’m in Sydney, Australia).
The problem occurs randomly, I have not been able to find anything that triggers the problem. It can happen at any time of the day or night and does not seem to be load or time dependant. Having said that, it does seem to happen more frequently during week days when everyone is home schooling and working from home. Whether that’s because we are using it and need it back is a possibility. We do regularly wake up and find that the internet is down.
Our setup is a Netgear CG3000v2 cable modem, firmware version 2.08.07, which is in bridge mode.
The modem is connected directly to ethernet port 1 of a Mikrotik RB951G-2HnD, firmware version 6.45.9, which is acting as the router and DHCP server for the house. It is set to be a DHCP client and is emulating the modem’s MAC address.
The problem.
The issue is that we will suddenly lose connection to the internet. If I check the cable modem, all the lights show the correct status but I am unable to connect to it via a web browser. The Mikrotik similarly shows no errors externally. I can log into the Mikrotik and everything seems normal. Nothing in the log shows any errors at all.
To fix the issue, I simply disconnect the ethernet cable between the modem and the router, wait a few seconds and plug it back in. Then within a few seconds we are back up and running.
I have replaced the power supplies for both the modem and the router.
My request
My feeling is that the problem is with the modem or my ISP but I would like to be able to prove this.
Can anybody give me some help with troubleshooting this issue?
Is there some sort of extra logging that I can turn on?
Is there something that I can examine at the time of failure to further diagnose what’s going on?
Any help would be much appreciated.
Thanks in advance.

Blessings
Matt

A well-structured and concise description in a first ever post - appreciated.

What is missing is the information whether the WAN interface of your Mikrotik uses PPPoE connection via the Ethernet port connected to the bridge, or a DHCP client is attached to that Ethernet port, but the initial diagnostic steps will be almost the same for both variants.

To diagnose, set in advance a file name for the /tool sniffer, which is enough to make it write the sniffed packets to Mikrotik’s disk. Then, when the issue shows up:

  • run, in a command-line window, /tool sniffer quick interface=the-ethernet-interface-name, let it run for two to three minutes, then stop it (Ctrl-C), download the file and later open it using Wireshark to look what is going on.
  • if you use PPPoE at WAN, check the output of /interface pppoe-client print detail; if you use DHCP, check the output of interface ethernet print detail and the output of /ip dhcp-client print detail. If the output information doesn’t seem helpful to you, post it here, replacing the IP address by some text if it is eventually a public one.
    As you mention a DHCP server for LAN clients, I omit the possibility that you have an IPv6 WAN.

Thank you Sindy,I will run those commands next time it fails and report back.
I’ve also aded the following to my original post;

..It is set to be a DHCP client and is emulating the modem’s MAC address.

Blessings
Matt

It could be a nasty DNS issue, just try another DNS server, and/or put a different one to the 1st place in the list.
And: do a ping test to an IP in the Internet, for example “ping 8.8.8.8” (on Windows you can use “ping -t 8.8.8.8” for continuous pinging).
If that’s ok and web connects aren’t, then it definitely is a DNS issue.

if it happens with reasonable frequency you could try plugging your PC directly in to to cable modem overnight and see if it is locked up in the morning.
Also is there any particular reason your running a few version behind in ROS? - never mind I see your running Long Term versions.

Sindy, I have all the requested logs but I don’t know how to replace my public IP address in the sniffer log.
Can you provide instructions on how to do this?
Thanks in advance

I worked it out, files attached.
My public address has been replaced with 122.106.0.0.
Thank you again for your help.

Blessings
Matt
202005141243.terminal.txt (2.98 KB)
202005141243.obfuscated.txt (2.99 MB)

Questions:

  • what exactly have you done two minutes after starting the sniff, and for how long had the connection been already down before you’ve started sniffing?
  • was the /ip dhcp-client print detail taken before starting the sniff or after finishing it?

The point is that during those first two minutes, the router just tries to get the MAC address of the gateway in the WAN subnet, and gets no responses; the link is not down because some other device is getting DHCP NAK. Two minutes into the capture, the Mikrotik has requested an IP address from scratch (no renewal, just a request for a new one), got it, and since then everything went smoothly. However, the lease time indicated in the capture is 1d22h39m, whereas the /ip dhcp-client print detail shows about 45m longer time (1d23h23m), which is very strange. Plus the time shown is what actually remains until the expiration, not the lease duration indicated when the lease has been done.

Normally, the DHCP client starts attempting to renew the lease once half the lease time expires. So do the /ip dhcp-client print detail several times and watch the expires-after decrease, and note the remaining time somwhere. At roughly the same time the next day, check whether it has raised again (which would indicate a successful renewal took place in the meantime) or whether it shows about 1d less than the day before and status shows renewing… rather than bound, which would indicate that all the renewal attempts have failed so far. The client normally repeats the renewal request several times before reverting to asking for any address rather than renewing the lease of the current one, but all this normally happens before the original lease expires.

Or configure the sniffer the following way:
/tool sniffer set filter-mac-protocol=ip filter-ip-protocol=udp filter-port=67 filter-interface=br-wan
and start it using tool sniffer start (still with some file-name configured). Keep it running until the next failure (you may log out from the router in the meantime), then do /tool sniffer stop. This way you’ll see exactly what was happening. Just don’t be surprised - you’ll see also some of your neighbours’ DHCP traffic even if the ISP uses port isolation properly (some server->client messages are sent to the broadcast MAC address), so you’ll have to use some display filters in Wireshark to only show what is relevant to your issue.

I start the sniff within minutes (possibly seconds) of internet loss.
I started the sniff, waited for 1 minute and then unplugged the ethernet cable connecting the Mikrotik to the MODEM, waited for 10 seconds and then plugged it back in.
The /ip dhcp-client print detail was not taken at the same time as the sniffer was running.

I’m very appreciative of someone with your level of knowledge and expertise taking the time to help.

Blessings

Matt

Attached is a ~10 minute sniff with the suggested filter.
As you suggest there is a lot of broadcast DHCP traffic. Are the malformed packets an issue?

Matt

You were either sniffing on all interfaces simultaneously or on the LAN one in particular, as the DHCP traffic seen there shows your Mikrotik acting as a DHCP server for your LAN devices.

I know I haven’t replaced br-wan in the command by put-your-wan-interface-name-here :slight_smile:

You may consider removing the sniff file from the previous post as it is useless for the purpose but reveals the MAC addresses of all your home gear.

Regarding the “malformed packet”, it may be an issue of Wireshark dissector or of the Mikrotik DHCP client. So I would say:

  • modify the filter settings to only record the DHCP traffic on the WAN interface
  • start the sniffer
  • disconnect and reconnect the WAN cable, after an hour or so stop the sniffer (to give the neighbors’ traffic a chance to appear in the sniff) and look into the file to verify that there was a normal discover-offer-request-ack sequence for your 'Tik at the beginning
  • start the sniffer again, disconnect and re-connect the cable again, and only stop the sniffer at the next outage

It is possible that the field that confuses the Wireshark dissector only appears in renewal requests, and the DHCP server thus ignores those packets because it is confused too (which does not necessarily mean that the field is incorrect, it may theoretically match some newer DHCP specification which the ISP’s server doesn’t understand).

Also, have a look at Jasper Bongertz’s TraceWrangler software - it can anonymize capture files, but I am not sure whether it speaks DHCP, you have to try. The benefit is that if it does, it will only modify the actual IP addresses, not other fields of the packets with same binary contents (I’m not sure how exactly you’ve anonymized the first trace you’ve posted).