Did recent updates break Path MTU discovery ?

spammyduck · Sat Jun 16, 2018 5:52 pm

I have 13 wireless PoPs and a small fiber to the home deployment (about 100 customers on the FTTH). All 13 wireless pops consist of several AP's all connected via Ethernet to a CCRSwitch. Port one of the CCRSwitch is always the Backhaul radio. Near our NOC is a tower were all of the wireless comes to. On the tower is 9 backhauls and 5 AP's that all connect to a CCRSwitch on the ground. Port 1 of this CCR switch is connected to another CCRSwitch in our NOC via fiber. The CCRSwitch in the NOC had 3 ports in use. Port 24 is connected to the CCR switch at the tower so all wireless traffic comes into port 24. Port 22 is connected to the CCR PPPoE server so all the wireless PPPoE comes and goes there. Port 23 on the NOC switch is connected the 10.10 port on the core router so all the Management traffic from all the radios and switches and other devices comes and goes here.

Our small FTTH deployment is a calix gpon system that connects via fiber directly between the calix C7 and our Mikrotik PPPoE server so no Mikrotik switches between the fiber customers and the PPPoE server.

This use to work wonderfully, we had very happy customers and almost no issues with the network itself. Then our almost 10 year old x86 routerOS PPPoE server died...

Our 10'ish year old x86 routerOS PPPoE server was running RouterOS 6.3something when it died. Anyway the replacement is a CCR1036-12G-4S-EM so the backup config from the old x86 couldn't be loaded onto the new server and I had to configure it from scratch (something I hadn't done for 10'ish years). Because of some major connection issues we were having (that turned out to be caused by a faulty port on a core switch) one of the many things we did trying to fix the problem was upgrade all the mikrotik routers and switches on the network to 6.42.3. Most of them had previously been running 6.3something.

Anyway, for 10 years the old x86 PPPoE server ran with a 1492 MTU and we never changed the factory default MSS Clamping setting on the Canopy, 450i and ePMP radios we installed. It was a great network, everything ran great and I had 800ish very happy PPPoE customers. Then, after installing the new PPPoE server and upgrading all the mikrotik stuff to 6.42.3 everything sucks now... constant PPPoE and connectivity problems.

First thing we discovered was that noone could reach 1/2 the internet. They could load MSN, Ebay, Gmail, Google, Youtube just fine but Netflix, YahooMail, Fast.com, lots of other websites wouldn't load at all, hulu and amazonprime websites would load but movies/shows would never load a single frame.

Turning on MSS Clamping on the customer radios , a setting we had not messed with in the 10 years of doing PPPoE, fixed it... mostly. Seems that turning MSS Clamping on would fix the problem for hours, or days but eventually the problem would return. However with MSS Clamping on, rebooting the customer radio would fix it.. again for a while.

So, it looks like something between 6.3.3'ish and 6.42 broke MTU discovery ?

Our fiber to the home customers (only about 100 right now) were not affected by this. They all come into a Calix C7 which connects straight into the PPPoE server , they have their own PPPoE server configured on the PPPoE server ( PPPoE server has 2 PPPoE servers configured, one for FTTH and bound to of one the SFP ports and the other for the wireless customers and bound to ethernet port 4 so each can not see the others PPPoE server).

We were also having issues with many of the wireless radios connecting to PPPoE and instantly dropping. The PPPoE server logs would show they were authenticating and then instantly hanging up.

We changed the MTU to 1480 (from the 1492 it had always been) and suddenly the people on the fiber started having the same "can't reach 1/2 the internet" problem. Rebooting their pppoe client (usually a netgear/belkin router) would fix it until it happened again.

So, we set the MTU/MRU to 1492 on the fiber PPPoE settings and set the Wireless PPPoE server to 1480 . We had several VPN users that were having a great many disconnects every day after the big Mikrotik upgrade and changing the 1492 to 1480 seems to have reduced , but not eliminated, the number of times their VPN disconnects every day.

I made a lot of use of the whole Master / Slave port grouping on the mikrotik switches and that was illuminated with the 6.42. I never messed with the bridge settings on the switches but I'm seeing the upgraded switches all have a bridge now with all the ports in it and one seemingly random port called "root" . The root port is never the backhaul port, it seems to have been designated as root by some random process and doesn't appear to be a setting I can change.

Anyway, just trying to figure out why, after updating to 6.42, I had to turn on MSS Clamping and change the MTU/MRU to 1480 from 1492 to stop the "can't access 1/2 the internet" problem and why VPN users are now being disconnected several times a day (many times a day if I set MTU/MRU back to 1492) on any devices passing through a mikrotik switch . On the FTTH there are not tik switches (or any other switches) between the customer pppoe client and the PPPoE server and the MTU/MRU remains at 1492 (we actually have problems on the FTTH if we lower the MTU/MRU) have no problems. I can't speak to MSS Clamping because the fiber customer's PPPoE Client is their own device, mostly netgear and belkin routers that, as far as I know don't have a setting to enable/disable MSS Clamping and it is set to "default" on the PPPoE server.

nostromog · Sat Jul 21, 2018 8:38 pm

I seem to be seeing a very similar behaviour:

* we bought a new router, same model than another of our four ones, and I'm experimenting with ipv6 on it using tunnels
* one of the old models, whose configuration has not been changed recently, has PPPoE as upstream, another one is natted under
a PPPoE connection and another on is on the public internet. The new one is PPPoE
* Since I installed the new one, things seem to work in the IPv4 except that the l2tp/ipsec that was stable for hours server side is now hanging
roughly every half an hour, but some times it is very difficult to restart, other times ok, when I pass though my new router
* IPv6 seems to work ok re: icmpv6 and udp, but tcp I can't connect with lots of the internet v6 (including this forum, the he ones, and the
whole google
* PMTU discovery using tcp in tracerouter (in my laptop) works, PMTU discovery using icmp in my laptop works, udp in my laptop does not work
* One can see in the tries from my laptop that the router is gently telling me that the packet was dropped because of MTU, and the same can be found with ping bisection, but all connections that require a big packet get hung forever

I have been trying to find out what is broken specifically with MSS or similar, maybe some problem on header checksum after rewriting... but ipv6 is mostly unusable now...

$ sudo traceroute -6n -m 10 --mtu -M tcp ipv6.google.com 
traceroute to ipv6.google.com (2a00:1450:4006:801::200e), 10 hops max, 80 byte packets
 1  2001:470:1f0a:5af::2  39.034 ms  39.267 ms  39.475 ms
 2  2001:470:1f0a:5af::1  90.561 ms  90.807 ms  89.062 ms
 3  2001:470:0:69::1  94.718 ms  82.365 ms  93.569 ms
 4  2001:7f8::3b41:0:1  109.970 ms  108.952 ms  129.172 ms
 5  2001:4860:0:11df::10  83.483 ms 2001:4860:0:11e1::e  85.333 ms 2001:4860:0:11e0::11  86.485 ms
 6  2001:4860::c:4000:f873  99.338 ms 2001:4860::c:4000:f874  84.007 ms 2001:4860::8:0:cb95  100.488 ms
 7  2001:4860::9:4001:7bc  94.986 ms 2001:4860::9:4000:e392  101.226 ms 2001:4860::9:4001:7bc  94.020 ms
 8  2001:4860::9:4001:c34  100.444 ms 2001:4860::9:4001:c35  100.847 ms  100.129 ms
 9  2001:4860:0:12e0::1  101.039 ms  102.407 ms 2001:4860:0:1::1a8d  100.987 ms
10  2001:4860:0:1::1a8b  100.679 ms  199.274 ms 2a00:1450:4006:801::200e  100.025 ms
$ sudo traceroute -6n -m 10 --mtu -M icmp ipv6.google.com 
traceroute to ipv6.google.com (2a00:1450:4006:801::200e), 10 hops max, 65000 byte packets
 1  2001:470:1f0a:5af::2  172.334 ms F=1280  61.528 ms  62.984 ms
 2  2001:470:1f0a:5af::1  148.842 ms  111.712 ms  110.753 ms
 3  2001:470:0:69::1  108.884 ms  179.070 ms  204.412 ms
 4  2001:7f8::3b41:0:1  204.591 ms  169.505 ms  136.779 ms
 5  2001:4860:0:11e0::10  204.894 ms  138.173 ms  138.315 ms
 6  2001:4860::c:4000:f874  115.765 ms  102.380 ms  105.135 ms
 7  2001:4860::9:4000:e392  121.256 ms  119.443 ms  181.289 ms
 8  2001:4860::9:4001:c34  142.798 ms  158.054 ms  210.792 ms
 9  2001:4860:0:1::1a8d  204.362 ms  205.533 ms  203.666 ms
10  2a00:1450:4006:801::200e  118.599 ms  187.464 ms  204.898 ms
$ sudo traceroute -6n -m 10 --mtu -M udp ipv6.google.com 
traceroute to ipv6.google.com (2a00:1450:4006:801::200e), 10 hops max, 65000 byte packets
 1  2001:470:1f0a:5af::2  59.650 ms F=1280  58.531 ms  59.042 ms
 2  2001:470:1f0a:5af::1  111.187 ms  129.917 ms  138.186 ms
 3  2001:470:0:69::1  204.219 ms  111.610 ms  118.381 ms
 4  * * *
 5  * * *
 6  * * *
 7  * * *
 8  * * *
 9  * * *
10  * * *


[admin@MikroTik] > /tool traceroute [:resolve ipv6.google.com] protocol=icmp count=4
 # ADDRESS                          LOSS SENT    LAST     AVG    BEST   WORST STD-DEV STATUS                                       
 1 2001:470:1f0a:5af::1               0%    4  51.8ms    50.9    50.4    51.8     0.5                                              
 2 2001:470:0:69::1                   0%    4  53.6ms    53.3    46.6    59.2     4.5                                              
 3 2001:7f8::3b41:0:1                 0%    4  45.4ms    45.5    45.3      46     0.3                                              
 4 2001:4860:0:11e0::11               0%    4  46.2ms    46.1      46    46.2     0.1                                              
 5 2001:4860::c:4000:f873             0%    4  45.9ms    45.8    45.7    45.9     0.1                                              
 6 2001:4860::c:4000:d9af             0%    4  54.9ms    55.1    54.9    55.2     0.1                                              
 7 2001:4860::9:4001:2750             0%    4  72.4ms      74    72.3    79.1     2.9                                              
 8 2001:4860:0:1348::1                0%    4  78.2ms    74.5    73.3    78.2     2.1                                              
 9 2001:4860:0:1::165                 0%    4  71.3ms    71.6    71.3    71.8     0.2                                              
10 2a00:1450:4003:803::200e           0%    4  72.2ms    72.2    72.1    72.2     0.1                                              

[admin@MikroTik] > /tool traceroute [:resolve ipv6.google.com] protocol=udp  count=4  
 # ADDRESS                          LOSS SENT    LAST     AVG    BEST   WORST STD-DEV STATUS                                       
 1                                  100%    4 timeout                                                                              
 2                                  100%    4 timeout                                                                              
 3                                  100%    4 timeout                                                                              
 4                                  100%    4 timeout                                                                              
 5                                  100%    4 timeout

Did recent updates break Path MTU discovery ?

Did recent updates break Path MTU discovery ?

Re: Did recent updates break Path MTU discovery ?

Who is online