Some Websites Not Reachable (MTU size?)

tikker · April 9, 2022, 6:51pm

Hi,

since updating to 7.2 yesterday I experience problems reaching some websites. E.g. I can reach google-de but not duckduckgo-com.

I found this post that talks about MTU size problems, and the mangle rule solved the problem.

Anyhow, I’d like to UNDERSTAND what the actual problem is!

a) What does that rule specifically do?

b) What must I change to not need that rule?

Moreover, it seems, that this problem is only present on my VLAN. Using a direct link (without bridge and VLAN) it does not occur.

c) What changed in 7.2 that this happens now?

Cheers!

tangent · April 9, 2022, 7:07pm

What must I change to not need that rule?

Don’t blanket-block ICMP. That breaks PMTUD.

What does that rule specifically do?

It forces the matching packets’ TCP MSS to a hard-coded value. It’s a very dirty solution, suitable only when you can’t fix the actual breakage.

For the rest, I can only speculate. Having /export output might help.

Buckeye · April 9, 2022, 7:20pm

@tikker

While it is possible there is a path MTU problem, what made you jump to that conclusion?

What troubleshooting have you done? Have you talked to your buddy and explained the problem to them?

Unless you have evidence that its an MTU problem, why are you treating this as if it were an MTU problem?

See The X/Y Problem

If it is an MTU blackhole, usually the symptom is that web sites will open, but then hang when the webpage attempts to send a “full sized” packet.

The mss clamping is a work around to the problem. If you are interested in what it does, watch this video referenced in this blogpost: TCP MSS clamping – what is it and why do we need it? by Ivan Pepelnjak

tangent · April 9, 2022, 7:55pm

You can bisect your way toward an inference of the path MTU on the link with hping:

$ sudo hping -p 443 -S -y -d 1400 google.com              # works
$ sudo hping -p 443 -S -y -d 1500 google.com              # fails
$ sudo hping -p 443 -S -y -d 1450 google.com              # works
$ sudo hping -p 443 -S -y -d 1475 google.com              # fails
$ sudo hping -p 443 -S -y -d 1462 google.com              # fails
$ sudo hping -p 443 -S -y -d 1461 google.com              # fails
$ sudo hping -p 443 -S -y -d 1460 google.com              # works

Thus 1460 here, which is sensible since the base header size for TCP is 20, and IP adds another 20 fixed octets, so I’ve got a 1500 octet actual path MTU between me and Google.

tikker · April 9, 2022, 7:59pm

what made you jump to that conclusion?

Well, I searched for something like the title. And I found some posts that claimed that might be an MTU issue. That particular one had the specified workaround which I just tried.

It worked afterwards.

usually the symptom is that web sites will open, but then hang

That also happened, but mostly I received no reply at all.

Any idea what might be the reason in the update to 7.2?

tangent · April 9, 2022, 8:38pm

You probably also found posts telling you to turn off port speed autonegotiation and other common voodoo, too.

Until you’ve done something like my hping test, you’ve definitely got an XY problem. Alternate take.

usually the symptom is that web sites will open, but then hang

That also happened, but mostly I received no reply at all.

I’m not sure how diagnostic that is in these days of fat web sites and highly optimized round-trips. It feels like the sort of thing that you’d see in the HTTP/1.0 days when a fair number of the resources on a given web page could actually come down in a single TCP packet. The banner image at the top of this forum is bigger than many entire pages I wrote in my first decade or so of writing web apps.

Any idea what might be the reason in the update to 7.2?

If the problem were that simple, wouldn’t we be seeing an amazing hue and cry over this?

I think you’ve got a local problem, either in your particular configuration — which you haven’t shared — or in your upstream provider.

The hping test will help you suss the latter possibility out. Try setting a plausible MTU (less TCP/IP headers) of 1460, then adding -T to the command to put it into traceroute mode. If it succeeds but only up to a point, you’ve found the router that’s breaking things for you.

tikker · April 9, 2022, 9:05pm

@buckeye

OK, so this is the preferred way? New story:

Yesterday I updated my router to 7.2. This was in the evening. I went to sleep afterwards. I always try to do this, when the rest of the family is most likely not using the internet anymore.

This morning I recognized, that I have no “reception” on my internet radio anymore. WLAN is there but no sound. VoIP phone shows no error, so internet shall be there. I like to hear radio during breakfast…

Afterwards I was surfing the web a little. Browsing some news, following links, the usual… It was very strange: some websites worked, some don’t…

The homepage of my banking works, but online banking wasn’t. Google worked, but duckduckgo wasn’t.

What’s wrong? Please help!

Lots of useless information, so next is the request for a full dump of my configuration.

45 firewall rules, that haven’t changed since yesterday, but we’re talking about them anyway… Afterwards dissecting my VLAN config, that hasn’t changed since yesterday, but we talk about it anyway…

Instead I did some troubleshooting on my own:

Same problem over mobile? No. OK, so more likely my local problem…
Any known large scale issues on ISP side? No.
Any packets dropped by firewall? Turn on logging… No.
DNS? Turn off pi-hole, use 8.8.8.8 directly. Still not working.

Meanwhile family getting pissed-off cuz internet not working properly…

Setup DHCP server on dedicated interface, add forward and NAT rules for that interface. Works! Interesting… With VLAN: no, without VLAN: yes…
Search Google for “mikrotik some websites not reachable”. Found some discussion MTU size. Sounds plausible… Workaround presented… tried it: works! Yay! But why? Ask MT forum…

tikker · April 9, 2022, 9:11pm

@tangent

OK, even if becoming a little off-topic, I weigh in here. X/Y-problem. What would be X in my case?

“Get internet working again?”

BTW: didn’t stumble over speed autoneg, but even if, that seems a little far fetched and unlikely

tangent · April 9, 2022, 9:34pm

Find out why hacking the TCP MSS has any effect whatsoever instead of jumping straight to blaming RouterOS. This hack affects every host along the path, so shouldn’t you rule them out before assigning blame?

If you can show that my hping -T test fails at step 1, then you can blame RouterOS. If it doesn’t fail until 3 steps into your ISP’s network, RouterOS is doing fine, and you’re just using RouterOS’s power to work around your ISP’s brokenness. Maybe that strikes you as a solution, but to me, it’s just a workaround.

If the reason you’re resisting doing my test is that you’re running Windows, first, you have my condolences, and second, you can download a Kali Linux VM, bridge it to your LAN interface, and use hping3 from there while you toggle the MSS mangle rule on and off.

I didn’t stumble over speed autoneg, but even if, that seems a little far fetched and unlikely

I threw that out as an example of a troubleshooting step you see given regardless of plausibility. It’s something people try because it’s easy and doesn’t require understanding anything more complicated than how to operate a GUI combo box. They try things like this even when it didn’t work the last dozen times, again because it’s easy. Prior plausibility doesn’t enter into this style of troubleshooting.

Hacking the TCP MSS happened to work for you, but what I want to know is, why? PMTUD is supposed to prevent the need for that, which leads me to ask, “Who’s breaking PMTUD?”

tikker · April 9, 2022, 9:54pm

Ah, OK. That’s your point… (so far, I’m not convinced this is a X/Y problem, but anyway)

So first of all: I’m not on Windows.

Thus:

hping3 -p 443 -T -S -y -d 1460 google.com

gives

ICMP Fragmentation Needed/DF set from ip=<internal router IP>

whereas

hping3 -p 443 -T -S -y -d 1400 google.com

gives me many hops…

Buckeye · April 9, 2022, 10:04pm

No idea at all. I am new to RouterOS, I only have a single MikroTik router, an RB760iGS that I got in mid Feb 2022, and I immediately updated to v7.2rc3 since I was starting fresh and the hEX S is strictly in a lab environment behind my existing router.

I just noticed that there have been many post to this thread since it started composing the response, so some of these may already have been addressed in other posts.

My only point is that you are more likely to get useful info, if you had mentioned what the problem you wanted to solve was, instead of asking a specific question related to how to do something you think would solve your problem. (in this case, it wasn’t quite as bad, you were asking about something you think solved your problem).

If I understand, you upgraded to v7.2 (from what?) and then you claimed that duckduckgo was not reachable.

I misunderstood what you said in your original post

As that the article was making the claim that the mangle rule solved the problem for them, not that you had added it and it solved your problem.

You then mentioned that it only affected only your VLANs.

I am not yet convinced that this was not just a coincidence. The internet is very dynamic, and what route is taken between you an duckduckgo can change frequently. Perhaps there was a circuit problem somewhere, and you were routed through a link with a lower path MTU. Perhaps the route even changed between the time you tested with the VLAN and the “direct link”. In other words, did you actually verify that the vlan route did not work after you tested with the 'direct" route and verified that it did work?

If you really want to know what the cause is, post the following:

The type of router you have, and the version and config that worked, and the version and config that failed, and at least the difference in the “fix” you applied to the v7.2 config.

Did you do any traceroute testing before you made the change? Did you backup before you made the change? If so, can you still reproduce the problem on v7.2 without the latest fix?

It is good that you want to understand what the fix did, but I usually try to understand the reasons for applying a “fix” before I blindly follow free advice I see on a forum.

BTW, the hping test provided by @tangent is good advice, as it will test using the same protocol (tcp) and port (443) that you were having problems with. The “usual” method of testing pathMTU with standard ping (and adding 28 for the ip and icmp headers) can give either false positive or false negative results with respect to what you would see from a web browser.

There are various ping utility variants that provide more info that the ping utility that comes standard with windows. doing a google search for windows tcp ping finds quite a few different programs all named tcping, as well as a microsoft internals psping (but that seems to be more performance related), and this nirsoft utility PingInfoView I have tried none of these; I have a raspberry pi I use when I need to do troubleshooting. If I was going to try one, I would try the NirSoft one; I have used quite a few of his tools, and even though they don’t have a modern look and feel, I am much more impressed by functionality than eye-candy.

Another useful windows ping tool I do use is hrping. But it does not do tcp ping. You can find hrping with google, and it is a great tool.

tangent · April 9, 2022, 10:09pm

Thank you. Now we have some hard data to work with.

This is with the mangle rule disabled?

Does the result change over your VLAN versus not?

hping3 -p 443 -T -S -y -d 1400 google.com

>

The next step is to bisect it: try 1430, then {1415, 1445} (direction depending on whether 1430 succeeded or failed), etc.

The precise MTU that works may be enlightening to someone. For instance, you might find that the maximum working value is 1456, which would suggest a problem with the [4 extra octets in the 802.1q header](https://en.wikipedia.org/wiki/IEEE_802.1Q).

I get that you're unwilling to share your full configuration, but since you've got it narrowed to the VLAN part of your configuration, would you at least show that?

It would also ease my mind to see any firewall rules involving ICMP, to verify that it isn't your config that's broken PMTUD.

tikker · April 9, 2022, 10:16pm

I guess I have to clarify some things first:

I am using a RB5009.

I upgraded from 7.2RC2 to 7.2.

I did not change my config on the way.

duckduckgo was just an example that clearly worked the day before, because it’s my default search engine. Other examples apply: CCN was reachable, NYTimes was not.

I don’t blame ROS in the first place. It’s absolutely possible, that I have a crappy config, and issues just became clear due to the update. I don’t think it a coincidence. It’s very likely the update made the problem show up, which is why I asked that way.

tangent · April 9, 2022, 10:25pm

Why are you continuing to justify your original post instead of providing the data we want? Until we get that, we’re speculating, and you see how frustrating that is for all concerned.

tikker · April 9, 2022, 10:30pm

tangent, replies where overtaking each other, so the clarification wasn’t meant on your last reply!

So: the mangle rule has no effect on the outcome of the command.

I’ll try the non-VLAN-way tomorrow!

Regarding the firewall I have no specific ICMP rules. But I have rather hard final drop rules, so maybe I need to add specific ICMP allowers to make PMTUD work (again)? Any hint?

VLAN interface config is rather default:

/interface vlan
add arp=enabled arp-timeout=auto disabled=no interface=bridge1 loop-protect=default \
    loop-protect-disable-time=5m loop-protect-send-interval=5s mtu=1500 name=vlan_31 \
    use-service-tag=no vlan-id=31
add arp=enabled arp-timeout=auto disabled=no interface=bridge1 loop-protect=default \
    loop-protect-disable-time=5m loop-protect-send-interval=5s mtu=1500 name=vlan_32 \
    use-service-tag=no vlan-id=32
add arp=enabled arp-timeout=auto disabled=no interface=bridge1 loop-protect=default \
    loop-protect-disable-time=5m loop-protect-send-interval=5s mtu=1500 name=vlan_33 \
    use-service-tag=no vlan-id=33
add arp=enabled arp-timeout=auto disabled=no interface=bridge1 loop-protect=default \
    loop-protect-disable-time=5m loop-protect-send-interval=5s mtu=1500 name=vlan_34 \
    use-service-tag=no vlan-id=34
add arp=enabled arp-timeout=auto disabled=no interface=bridge1 loop-protect=default \
    loop-protect-disable-time=5m loop-protect-send-interval=5s mtu=1500 name=vlan_35 \
    use-service-tag=no vlan-id=35

tangent · April 9, 2022, 10:37pm

I asked for a bisection, not a single test. I really like my “1456” hypothesis, but I won’t know if it’s right until you try it.

Regarding the firewall I have no specific ICMP rules.

The stock firewall has this rule fairly high up:

add action=accept chain=input comment="defconf: accept ICMP" protocol=icmp

Without that, I do believe you’re going to break PMTUD, among other things.

/interface vlan…mtu=1500

>

The warning at the bottom of [this section of the docs](https://help.mikrotik.com/docs/display/ROS/VLAN#VLAN-Properties) seems relevant.

Buckeye · April 9, 2022, 11:03pm

While I agree that knowing the precise MTU that works would be enlightening, I find it hard for me to believe it is related to the extended ethernet framesize caused by vlan tags. It is possible that the bridge / switch interface changes in v7.2 are limiting the ethernet frames (L2 MTU + FCS) to 1518 instead of 1522, but I would guess if that was the case, there would be many more complaints than we have seen, but then again it is possible that not that many users are using vlans. There are some weird issues that have been reported with bridge vlans working on 6.48.6 but not on 7.2 (this was related to a trunk port between a hEX S and a hap ac lite).

First, in my experience, I have never seen ethernet packets containing ethernet frames with vlan header being dropped because they are too long, at least when using switches made in the last 15 years, and as long as the L2 MTU isn’t dropping L2 ethernet packets > 1518, then tagged frames shouldn’t affect the L3 MTU, at least that’s my understanding. i.e. a tagged frame won’t get a L3 MTU of 1504, the vlan interface shim driver has already untagged the frame before the ip driver receives the IP packet. Every “dumb” switch I have ever tested will transparently pass tagged vlans as is. Of course they can’t untag them, or filter based on ports, and I don’t recommend using dumb switches to carry tagged traffic, but I have done it myself many times in a lab environment.

Also, if if was the vlan tag, why wouldn’t it affect all sites, instead of just specific ones?

I think there is more involved than @tikker is sharing. Perhaps a vpn ?

But I will admit that my

usually the symptom is that web sites will open, but then hang

is probably an outdated observation.

tangent · April 9, 2022, 11:13pm

OP reports it works without the VLAN in the way but fails over the VLAN.

Wouldn’t it be diagnostically cool — in the House way — if 1456 worked but 1457 failed?

That said, I suspect fixing the firewall as I suggested is the real solution.

Buckeye · April 9, 2022, 11:14pm

That warning is about ethernet clients that have ancient ethenet adapters with controllers that couldn’t deal with frames > 1518. That probably applied to a 3Com 3c509, but are you aware of any adapters made in the last 15 years that would apply to?

For reference, this is the warning:

MTU should be set to 1500 bytes same as on Ethernet interfaces. But this may not work with some Ethernet cards that do not support receiving/transmitting of full-size Ethernet packets with VLAN header added (1500 bytes data + 4 bytes VLAN header + 14 bytes Ethernet header). In this situation, MTU 1496 can be used, but note that this will cause packet fragmentation if larger packets have to be sent over the interface. At the same time remember that MTU 1496 may cause problems if path MTU discovery is not working properly between source and destination.

Buckeye · April 9, 2022, 11:28pm

Yes it would be interesting.

But I need to see better evidence than we have been provided.

We don’t even know if the same client was used with and without the vlans in the picture.

Or what the failure message was.

I’m done with this thread until I see more useful info.

I need to learn from @sob’s advice in a recent post: