Community discussions

MikroTik App
 
User avatar
NathanA
Forum Veteran
Forum Veteran
Topic Author
Posts: 829
Joined: Tue Aug 03, 2004 9:01 am

VPLS fragment reassembly bug only on TILE-arch

Sun Dec 18, 2022 5:38 am

I just submitted this bug report to MikroTik through their service desk. But I thought I'd also post it here, just in case anybody else has maybe run into weird oddities on their network with random dropped packets and/or stalled or aborted TCP flows that there was seemingly no rhyme or reason for. This little bugger was difficult to isolate, and as you might expect, it took a while to put together a minimum lab where it could be reproduced & to write up the report.

In summary, if you are using VPLS on your network, take advantage of VPLS fragmentation, and are using any CCR1xxx (Tilera SoC) series routers as PEs, you are probably experiencing this bug even if you don't know it. It has existed for years. Maybe the reason it has managed to escape attention for so long is because most people running VPLS on MikroTik are not using fragmentation. But if you still have parts of your network that are not yet "jumbo-clean" and are forced to use VPLS fragmentation as a result, unless/until this gets fixed, your options are to either force lower MTU for end-users and stop allowing VPLS to fragment your frames, finish making your network jumbo-clean from end to end, or to rip & replace any TILE-based CCRs acting in the capacity of a PE.

- - - -

We have been chasing a problem for some months now where some network users will mysteriously and randomly experience transmission stalls that eventually lead to premature termination of their TCP data transfer, resulting in incomplete file download or upload, or interruption in streaming.

I think I have finally nailed down the problem, and found it in a place that we were not expecting: a bug in RouterOS related to VPLS pseudowire defragmentation. And bizarrely, it only affects CCR1xxx (Tilera) products. This is actually a strange and complex bug, both to describe and to reproduce, so I will attempt to be very exact as I work to lay out the details below. I have also put together a minimum-viable config and test scenario to reliably reproduce the bug, which I will outline here.

As far as I can tell, this bug has existed throughout every RouterOS 6.x release and also into all of RouterOS 7.x even up to the current beta. Because it only affects TILE-based MikroTik routers, my theory is that there is some MPLS/VPLS processing assist or acceleration support (maybe Fast Path related?) in the Linux driver for Tilera SoC Ethernet interfaces, and the bug exists in there somewhere. This is of course only a guess.

Brief description: if a VPLS payload gets fragmented, and then delivered to a CCR1xxx router to be reassembled, if particular bits at particular offsets within one of the fragments are set to particular values, then even though the fragment is not corrupt and its contents are perfectly valid, the CCR will reject the fragment, and the entire payload will be lost (Ethernet frame will never finish being reassembled, and will never be forwarded to the end-user).

If this happens to a TCP packet encapsulated within fragmented VPLS, TCP on the sender side will of course attempt to retransmit when ACK fails to be sent back by the recipent. But since the contents of the retransmitted packet will be 100% identical to the one that was never delivered, the same bits at the same offsets will still have the same values, and the CCR will reject the same packet every single time. Since the recipient never gets the packet no matter how many times it is retransmitted, and therefore never sends back ACK to the sender, eventually one party or the other will give up and send TCP RST, aborting the entire transfer.

When encryption is employed (e.g., HTTPS), these failures occur seemingly at random even when downloading the same content (it will work one time but not another, the failure will occur at different places in the transfer, etc.), since the keys for each session will be different, and thus the actual bits sent across the wire will be different every time. But unencrypted streams that are affected will always hang and then abort at the exact same place in the transfer. So once you find a sequence of bytes that can cause the problem every time, the issue is infinitely reproducible with that sequence.

If we dig deeper and look at affected payloads, it appears that what is happening is that if it is possible for the contents of the fragment to be (incorrectly) interpreted as another MPLS-tagged Ethernet frame -- specifically, one with two labels but without a control word -- then the CCR that is tasked with reassembling the fragments will choose to interpret it that way even though it isn't, and throw out the fragment.

Here are some example screenshots from Wireshark of a "good" fragment (one that CCR does not reject), and a "bad" one (that CCR erroneously rejects).

"Good":

Image

"Bad":

Image

Notice that in the "bad" example, even Wireshark tries to interpret the contents of the encapsulated fragment as if it is another MPLS-tagged Ethernet frame, with two MPLS labels in the stack. I believe this is actually exactly what RouterOS running on TILE is doing. You can see here that Wireshark even tries to interpret the bytes that would normally be where the source and destination would be located as MAC addresses, even though they are just part of a TCP data stream and do not represent host MAC addresses.

Based on this, we can conclude that the following conditions within the packet fragment are necessary to trigger the bug:

For any fragment past the first one, the contents need to be ambiguously (and incorrectly) interpretable as a 2-label MPLS stack with no PW Ethernet Control Word. So...
`
  1. The word at the relative offset of 0x0C past the end of the PW control word (so, absolute offset of 0x26) needs to contain 0x8847 (Ethertype for MPLS-labeled frame)
  2. The following 8 bytes need to be interpretable as an MPLS label stack with 2 labels; so, the 3rd byte of those 8 bytes needs to have its last bit set 0, and then 4 bytes after that one (the 7th byte) needs to have its last bit set to 1
  3. The byte immediately following all of that needs to not have its first 4 bits set to 0 (first 4 bits as 0 following last MPLS label represent PW control word, so flipping any bit on within those 4 bits satisfies this criterion)
`
And in fact, we can see in the "bad" screenshot that all of these are true: octet 0x26 and 0x27 happen to be "0x88" and "0x47" (0x8847) respectively [criterion #1], the following 64 bits of hex "c0 ae 2c f5 1b 4a 47 b2" are interpreted as two MPLS labels, and last bit of even number "0x2c" is 0 while last bit of odd number "0x47" (octet 7) is 1 [criterion #2], and the first 4 bits of immediately following octet "0xd6" are not all 0 [criterion #3].

Further validating this theory, if you take the exact same payload, and you replay it while changing ANY one of these factors by simply flipping a bit or two (for example: change "0x8847" to "0x8848", or change "0x2c" to "0x2d", or change "0xd6" to "0x06"), then the CCR will reassemble the PW fragment just fine and forward the whole frame to the end-user.

Thus to reproduce, it should simply be necessary to take a sequence of bytes that would normally pass, and then to change the right bits in the right places in order for the conditions necessary to trigger the bug to be satisfied, which you can do once you know where the fragmentation will occur.

For the minimum-viable config, this is precisely what I have done: set up a lab network with end-to-end path MTU of 1500 and MPLS MTU of 1492, calculated where a 1500-byte IP packet (== 1514-byte Ethernet frame) would get split in this scenario by the ingressing PE router, send a 1500-byte IP packet from one end to the other and confirm that it was received okay, modify the same payload to meet the above-described necessary conditions, and then transmit the modified payload and confirm that it was NOT received (bug triggered on egressing PE).

For this lab, we will set up the following:
`
  • 2x P routers (named P-1 and P-2; MPLS switchers/forwarders)
  • 2x PE routers (named PE-1 and PE-2; MPLS label push/pop & VPLS encapsulate/deencapsulate and fragment/reassemble)
  • 2x PC hosts of some sort (I call the sending one "server" and receiving one "client")
`
This is very basic / "MPLS 101": P-1 and P-2 are connected to each other, PE-1 is connected to P-1 and PE-2 is connected to P-2, a VPLS tunnel is established between PE-1 and PE-2, and the server is connected to PE-1 while the client is connected to PE-2.
`
+--------+      +------+      +-----+      +-----+      +------+      +--------+
| Server | <==> | PE-1 | <==> | P-1 | <==> | P-2 | <==> | PE-2 | <==> | Client |
+--------+      +------+      +-----+      +-----+      +------+      +--------+
                                                          TILE
`
For this lab, PE-1, P-1, and P-2 can be any hardware arch; in my own tests I have just been using x86, but it does not matter. But PE-2 (the receiving/deencapsulating PE that has the "client" attached to it) MUST be a TILE-arch CCR1xxx model. Once you reproduce the problem with that CCR, you can then replace PE-2 with a non-TILE device with 100% identical configuration, retry the test, and see that it passes successfully. We will also start out by running 6.49.7 on everything, then upgrading the TILE-based PE-2 from 6.49.7 to 7.7rc1, re-running the test, and seeing that it still fails even on RouterOS 7. All of this will establish that the problem only happens on Tilera SoCs, and it affects every single version of RouterOS to-date. (I have also tested back to early versions of RouterOS 6 and the problem has been there since the beginning; it was not introduced as a regression in some later version of 6.x...yes, this bug has existed for YEARS and has somehow flown under the radar this whole time)

For the "server" and the "client", I am simply attaching two Linux hosts, and running Nping from Nmap on the server to send a custom-crafted 1500-byte ICMP payload to the client. I am attaching two versions of the payload: test-payload-pass.txt and test-payload-fail.txt. The "pass" one is simply a repeating sequence of bytes (0 through 9, a space, roman alphabet a-z lowercase then A-Z uppercase, a space). The "fail" one takes this same payload, changes bytes at offsets 0x05aa and 0x05ab to "0x8847", and the byte at offset 0x05b2 from ASCII "N" to "O" (0x4e to 0x4f). This makes the "fail" one satisfy all 3 requirements previously described in order to reproduce the bug, so if you use Nping on the server to send an ICMP ping with the "pass" payload, you will get a response from the client, but if you use the "fail" payload, you will never get a response from the client, because the client never receives the reassembled packet from PE-2.

(For the lab "server" and "client", I like to use Alpine Linux since it is lightweight -- a default installation from the "extended" ISO can easily fit within 2GiB -- and is very quick to install. But you can of course use whatever is most convenient or desireable for you. Nping is not included by default, so if using Alpine, you can install Nping from Alpine with "apk add nmap-nping". This means you will need to attach your "server" host to the internet briefly to install Nping before attaching it to PE-1 in the lab. On Debian or Ubuntu, Nping is included in the full Nmap package, so "apt install nmap". If using Windows then naturally disable Windows Firewall on both sides.)

In my case, I have assigned the server 192.168.10.1/24, and the client 192.168.10.2/24. On Linux, the nping commands I am using are as follows (while in the directory where I have copies of the test payload files, running as superuser/root, with Bash as my shell):
`
# nping --icmp --data-string "$(cat ./test-payload-pass.txt)" 192.168.10.2
# nping --icmp --data-string "$(cat ./test-payload-fail.txt)" 192.168.10.2
`
(note that Nping does not seemingly have a way to ingest a custom payload directly from a file, and on Windows cmd.exe or Powershell I am not sure how to read in a file and incorporate its contents into a parameter)

Here are the example outputs demonstrating the problem:
`
# nping --icmp --data-string "$(cat ./test-payload-pass.txt)" 192.168.10.2
WARNING: Payload exceeds maximum recommended payload (1400)

Starting Nping 0.7.70 ( https://nmap.org/nping ) at 2022-12-17 18:15 PST
SENT (0.0054s) ICMP [192.168.10.1 > 192.168.10.2 Echo request (type=8/code=0) id=15922 seq=1] IP [ttl=64 id=649 iplen=1500 ]
RCVD (0.2044s) ICMP [192.168.10.2 > 192.168.10.1 Echo reply (type=0/code=0) id=15922 seq=1] IP [ttl=64 id=23373 iplen=1500 ]
SENT (1.0057s) ICMP [192.168.10.1 > 192.168.10.2 Echo request (type=8/code=0) id=15922 seq=3] IP [ttl=64 id=649 iplen=1500 ]
RCVD (1.0178s) ICMP [192.168.10.2 > 192.168.10.1 Echo reply (type=0/code=0) id=15922 seq=3] IP [ttl=64 id=23393 iplen=1500 ]
SENT (2.0070s) ICMP [192.168.10.1 > 192.168.10.2 Echo request (type=8/code=0) id=15922 seq=3] IP [ttl=64 id=649 iplen=1500 ]
RCVD (2.0344s) ICMP [192.168.10.2 > 192.168.10.1 Echo reply (type=0/code=0) id=15922 seq=3] IP [ttl=64 id=23496 iplen=1500 ]
SENT (3.0078s) ICMP [192.168.10.1 > 192.168.10.2 Echo request (type=8/code=0) id=15922 seq=4] IP [ttl=64 id=649 iplen=1500 ]
RCVD (3.0511s) ICMP [192.168.10.2 > 192.168.10.1 Echo reply (type=0/code=0) id=15922 seq=4] IP [ttl=64 id=23562 iplen=1500 ]
SENT (4.0094s) ICMP [192.168.10.1 > 192.168.10.2 Echo request (type=8/code=0) id=15922 seq=5] IP [ttl=64 id=649 iplen=1500 ]
RCVD (4.0678s) ICMP [192.168.10.2 > 192.168.10.1 Echo reply (type=0/code=0) id=15922 seq=5] IP [ttl=64 id=23608 iplen=1500 ]

Max rtt: 198.913ms | Min rtt: 12.002ms | Avg rtt: 67.988ms
Raw packets sent: 5 (7.500KB) | Rcvd: 5 (7.500KB) | Lost: 0 (0.00%)
Nping done: 1 IP address pinged in 4.07 seconds
`
# nping --icmp --data-string "$(cat ./test-payload-fail.txt)" 192.168.10.2
WARNING: Payload exceeds maximum recommended payload (1400)

Starting Nping 0.7.70 ( https://nmap.org/nping ) at 2022-12-17 18:16 PST
SENT (0.0027s) ICMP [192.168.10.1 > 192.168.10.2 Echo request (type=8/code=0) id=64955 seq=1] IP [ttl=64 id=20226 iplen=1500 ]
SENT (1.0031s) ICMP [192.168.10.1 > 192.168.10.2 Echo request (type=8/code=0) id=64955 seq=2] IP [ttl=64 id=20226 iplen=1500 ]
SENT (2.0044s) ICMP [192.168.10.1 > 192.168.10.2 Echo request (type=8/code=0) id=64955 seq=3] IP [ttl=64 id=20226 iplen=1500 ]
SENT (3.0050s) ICMP [192.168.10.1 > 192.168.10.2 Echo request (type=8/code=0) id=64955 seq=4] IP [ttl=64 id=20226 iplen=1500 ]
SENT (4.0063s) ICMP [192.168.10.1 > 192.168.10.2 Echo request (type=8/code=0) id=64955 seq=5] IP [ttl=64 id=20226 iplen=1500 ]

Max rtt: N/A | Min rtt: N/A | Avg rtt: N/A
Raw packets sent: 5 (7.500KB) | Rcvd: 0 (0B) | Lost: 5 (100.00%)
Nping done: 1 IP address pinged in 5.01 seconds
`
NOTE: strangely, once a TILE-based PE router receives a VPLS fragment that meets all the conditions necessary to trigger the bug, it will also stop forwarding non-triggering VPLS frames for a few seconds to the "client" host. So if you send a ping with the "fail" payload, and then immediately try to send the "pass" payload afterward, you will also likely drop one or more of the packets with the "pass" payload.

Below you will find the config exports for P-1, P-2, PE-1, and PE-2; in this ticket I am also attaching supout from the CCR1009 I am using in my lab while it is running 7.7rc1 (first configured while running 6.49.7, then in-place upgraded straight to 7.7rc1; note however that I after upgrade I had to manually do "/mpls interface add interface=all mpls-mtu=1492" because the upgrade from 6.x to 7.x "lost" the 'all' MPLS interface that was already configured...so, another bug):

P-1:
/system identity set name=mpls-lab-p-1

/interface bridge add name=loopback

/interface ethernet set [ find default-name=ether1 ] name=ether1-to-pe1
/interface ethernet set [ find default-name=ether2 ] name=ether2-to-p2

/ip address add address=192.168.1.1 interface=loopback
/ip address add address=192.168.0.1/30 interface=ether2-to-p2
/ip address add address=192.168.0.5/30 interface=ether1-to-pe1

/routing ospf instance set [ find default=yes ] router-id=192.168.1.1
/routing ospf interface add interface=ether1-to-pe1 network-type=broadcast
/routing ospf interface add interface=ether2-to-p2 network-type=broadcast
/routing ospf network add area=backbone network=192.168.0.0/23

/mpls interface set [ find default=yes ] mpls-mtu=1492
/mpls ldp set enabled=yes lsr-id=192.168.1.1 transport-address=192.168.1.1 use-explicit-null=yes
/mpls ldp interface add interface=ether1-to-pe1
/mpls ldp interface add interface=ether2-to-p2
P-2:
/system identity set name=mpls-lab-p-2

/interface bridge add name=loopback

/interface ethernet set [ find default-name=ether1 ] name=ether1-to-pe2
/interface ethernet set [ find default-name=ether2 ] name=ether2-to-p1

/ip address add address=192.168.1.2 interface=loopback
/ip address add address=192.168.0.2/30 interface=ether2-to-p1
/ip address add address=192.168.0.9/30 interface=ether1-to-pe2

/routing ospf instance set [ find default=yes ] router-id=192.168.1.2
/routing ospf interface add interface=ether1-to-pe2 network-type=broadcast
/routing ospf interface add interface=ether2-to-p1 network-type=broadcast
/routing ospf network add area=backbone network=192.168.0.0/23

/mpls interface set [ find default=yes ] mpls-mtu=1492
/mpls ldp set enabled=yes lsr-id=192.168.1.2 transport-address=192.168.1.2 use-explicit-null=yes
/mpls ldp interface add interface=ether1-to-pe2
/mpls ldp interface add interface=ether2-to-p1
PE-1:
/system identity set name=mpls-lab-pe-1

/interface bridge add name=loopback

/interface ethernet set [ find default-name=ether1 ] name=ether1-to-p1
/interface ethernet set [ find default-name=ether2 ] name=ether2-to-server

/ip address add address=192.168.2.1 interface=loopback
/ip address add address=192.168.0.6/30 interface=ether1-to-p1

/routing ospf instance set [ find default=yes ] router-id=192.168.2.1
/routing ospf interface add interface=ether1-to-p1 network-type=broadcast
/routing ospf network add area=backbone network=192.168.0.0/22

/mpls interface set [ find default=yes ] mpls-mtu=1492
/mpls ldp set enabled=yes lsr-id=192.168.2.1 transport-address=192.168.2.1 use-explicit-null=yes
/mpls ldp interface add interface=ether1-to-p1

/interface vpls add disabled=no l2mtu=1500 name=vpls1 remote-peer=192.168.2.2 use-control-word=yes vpls-id=1:1

/interface bridge add name=lan protocol-mode=none
/interface bridge port add bridge=lan interface=ether2-to-server
/interface bridge port add bridge=lan interface=vpls1
PE-2:
/system identity set name=mpls-lab-pe-2

/interface bridge add name=loopback

/interface ethernet set [ find default-name=ether1 ] name=ether1-to-p2
/interface ethernet set [ find default-name=ether2 ] name=ether2-to-client

/ip address add address=192.168.2.2 interface=loopback
/ip address add address=192.168.0.10/30 interface=ether1-to-p2

/routing ospf instance set [ find default=yes ] router-id=192.168.2.2
/routing ospf interface add interface=ether1-to-p2 network-type=broadcast
/routing ospf network add area=backbone network=192.168.0.0/22

/mpls interface set [ find default=yes ] mpls-mtu=1492
/mpls ldp set enabled=yes lsr-id=192.168.2.2 transport-address=192.168.2.2 use-explicit-null=yes
/mpls ldp interface add interface=ether1-to-p2

/interface vpls add disabled=no l2mtu=1500 name=vpls1 remote-peer=192.168.2.1 use-control-word=yes vpls-id=1:1

/interface bridge add name=lan protocol-mode=none
/interface bridge port add bridge=lan interface=ether2-to-client
/interface bridge port add bridge=lan interface=vpls1
You do not have the required permissions to view the files attached to this post.
 
sid5632
Long time Member
Long time Member
Posts: 553
Joined: Fri Feb 17, 2017 6:05 pm

Re: VPLS fragment reassembly bug only on TILE-arch

Sun Dec 18, 2022 2:45 pm

What an excellent post!
 
User avatar
Belyivulk
Member Candidate
Member Candidate
Posts: 286
Joined: Mon Mar 06, 2006 10:53 pm
Location: Whangarei, New Zealand
Contact:

Re: VPLS fragment reassembly bug only on TILE-arch

Sun Dec 18, 2022 8:37 pm

Amazing work! I'm sure this very detailed post will help a good many people.
 
oeyre
Member Candidate
Member Candidate
Posts: 137
Joined: Wed May 27, 2009 12:48 pm

Re: VPLS fragment reassembly bug only on TILE-arch

Wed Sep 06, 2023 3:12 am

Outstanding work! Please tell me you logged a bug ticket with MT support?

I wonder if the following change from 7.11 is related...
*) mpls - improved MPLS TCP performance;
 
ConradPino
Member
Member
Posts: 337
Joined: Sat Jan 21, 2023 12:44 pm
Contact:

Re: VPLS fragment reassembly bug only on TILE-arch

Wed Sep 06, 2023 3:58 am

Generous work!
 
User avatar
clambert
Member Candidate
Member Candidate
Posts: 120
Joined: Wed Jun 12, 2019 5:04 am

Re: VPLS fragment reassembly bug only on TILE-arch

Sat Sep 09, 2023 9:03 pm

Amazing post!
 
gunther01
Frequent Visitor
Frequent Visitor
Posts: 50
Joined: Sun Aug 01, 2010 7:00 pm

Re: VPLS fragment reassembly bug only on TILE-arch

Fri Oct 20, 2023 5:45 pm

Any idea if this has been fixed? I seem to have a strange issue between a Tile PE and a new V7 vpls set up that leads me to an issue like this (strange slowdowns, web pages not completing, etc)
 
uCZBpmK6pwoZg7LR
Frequent Visitor
Frequent Visitor
Posts: 54
Joined: Mon Jun 15, 2015 12:23 pm

Re: VPLS fragment reassembly bug only on TILE-arch

Thu Oct 26, 2023 9:40 am

I lost any hopes that MPLS related bugs in ROS 7 will be fixed. Started to look for another router brand without success for now.
 
User avatar
nichky
Forum Guru
Forum Guru
Posts: 1275
Joined: Tue Jun 23, 2015 2:35 pm

Re: VPLS fragment reassembly bug only on TILE-arch

Sat Oct 28, 2023 6:53 am

two question


mpls-mtu=1492, can cost potential fragmentation

use-explicit-null=yes--why u need that?
 
DarkNate
Forum Veteran
Forum Veteran
Posts: 999
Joined: Fri Jun 26, 2020 4:37 pm

Re: VPLS fragment reassembly bug only on TILE-arch

Sat Oct 28, 2023 1:09 pm

I lost any hopes that MPLS related bugs in ROS 7 will be fixed. Started to look for another router brand without success for now.
Just use baby jumbo-frames at the least 1600 L2 MTU, 15xx whatever L3 MTU and no fragmentation occurs. But personally, I ensure 9k MTU on L3 and 9216 or more on L2 network-wide, for wireless equipment, 2290 L2 MTU and L3 MTU is at least 2000. PMTUD will do the rest, zero fragmentation.

Who is online

Users browsing this forum: No registered users and 17 guests