On router hAP ac^2 I monitored the traffic using "Tools / Torch" in the GUI and added all observed L2 EtherTypes via ACL into the rule table of the switch-chip.
But as soon as I activate the last rule by setting disabled=no then Internet stops functioning. What other EtherType is highly likely missing below?
(Btw, interface ether5 is intentionally excluded for safety reasons, ie. to be able to login to the router via that interface in case the other interfaces get blocked by these rules)
I've now added all documented mac-protocols I could find in the wiki pages, ie.
mac-protocol (802.2 | arp | homeplug-av | ip | ipv6 | ipx | lldp | loop-protect | mpls-multicast | mpls-unicast | packing-compr | packing-simple | pppoe | pppoe-discovery | rarp | service-vlan | vlan)
And the behavior is still the same! It blocks as soon as the last rule gets activated! (cf. OP).
I think this means that RouterOS uses either an L2 protocol that is undocumented yet, or there is a BUG in ACL.
(Btw, if it's not a bug then it must be an L2 protocol, not a higher protocol as they all are included by the above.)
SOS MikroTik! Need help ASAP, else can't continue with my work here.
Here my latest attempt with ALL the documented mac-protocols:
>
It seems there is a bug in ACL b/c I did use the "Tools / PacketSniffer" tool over interfaces=all, but all the mac-protocols it lists are already present in the ACL...
Packet Sniffer runs on CPU, not hardware. You will need to temporarily disable hardware acceleration on the port(s) that you wish to sniff, otherwise you will only capture the packets that hit the CPU rather than the packets that are being switched by the switch chip.
Ok, thanks, I’ll consider this fact and try to disable hw accel for the test.
I now performed this test:
added a “redirect-to-cpu” rule as the first rule (actually 2nd in my case) to pass all traffic to the CPU (ie. effectively disabling HW acceleration/offloading).
This rule works (ie. is effective) b/c enabling the last rule to “block all other” does not block anymore as it’s not executed due to this redirect-to-cpu rule.
And so now I collected new data using the Sniffer tool.
But: there is nothing new of a mac-protocol. Ie. this way it’s not possible to find the error.
Any new ideas to pinpoint the error?
Below is the result of the Packet Sniffer session (all the mac-protocols (below they are in decimal) are already in the ACL, so nothing new in the data):
A wild guess here… there is a bug in the bridge filter rules, where the bytes in the 16-bit values of the ethertype field in the 802.1Q headers are swapped on some CPU architectures, and arm (which is the architecture of hAP ac²) is one of these whereas mipsbe is not affected by that; however, there is another bug, in the ppp handling of CCP protocol, which looks similar to me (again, 16-byte value is handled wrong), but this time mipsbe is affected and arm is not.
Hence I can imagine that handling of endianness may be a general issue in various parts of the code, so you might want to set the ethertypes in your switch rules as hex values in swapped byte order, to see whether it changes anything about the behaviour.
Man, sindy! You seem to be right! After reading your above reply, I wanted to test this, and:
indeed it seems that there is/are such endian errors in the RouterOS code as I just finished my first test with also such byte-swapped mac-protocol numbers,
and it now worked flawlessly when I enabled the final "block all other" rule!
I'll post shortly my test script for inspection/verification. It has 58 ACL switch rules in total.
Ok, here's my test script. The last rule is disabled by default.
After emptying "/interface ethernet switch rule" and doing "/import myACL.rsc" I use the following command to be on the safe side for such testing:
"enable numbers=57 ; /delay 20 ; disable numbers=57"
BUT: it is still unknown yet which of the mac-protocols is/are affected by this endian-bug --> one needs to do some more tests...
.
After some lengthy testing, the error finally has been found! :
The endian-error is with the mac-protocol “arp” (EtherType 0x0806).
It can be an endian-error or a simple parsing error from the string “arp” to the right EthType numeric value, maybe mixed up with “rarp”.
If one disables rule #20, which was added via mac-protocol NUMBER,
then ping (and all other traffic) is not possible (effective only after about 15 seconds!.. ) eventhough rule #3 should still work, but it actually doesn’t:
>
To summarize: there is a difference whether one adds via mac-protocol name or via mac-protocol number! Ie.:
> ```text
add switch=switch1 ports=$myPorts mac-protocol=arp
add switch=switch1 ports=$myPorts mac-protocol=0x0806 comment="EthType 0x0806 (arp)"
The first one (the official one) seems to be buggy, ie. not working; the second variant does work.
I’ll do some more tests to see whether this byte-swapped 3rd variant is necessary too:
It’s strange. On my hAP ac² (running 6.45.9), if I add the rule with mac-protocol=0x0806, it is both _print_ed and _export_ed with mac-protocol=arp, i.e. the conversion seems to work both ways. So I don’t get why in your case there is a difference in behaviour when you add it as “arp” and when you add it as “0x0806”.
No there is no differrence, in my case it is the same if you compare the “print” as well the “export” output above: it says in both cases “arp”.
But the real error is what EthType number it internally has assigned to “arp”, ie. add by name (which we can’t see, only guess: but surely it can’t be the 0x0806).
As I tried to explain, it very much looks like a “parsing and assigning error”. Maybe the parsing of “arp” lands in the table item of “rarp”, ie. maybe using the EthType number of rarp for arp
Btw, this same error is present also in the RouterOS 6.47 version, as I upgraded to that 7.0beta8 b/c of that error, but unfortunately it’s in at least both versions.
if you enter a protocol name, at some point it gets converted to a protocol number, as the switch chip hardware understands only numbers. And because not every protocol number has a name assigned in the code, it is logical to store the numbers internally, and only do the number<=>name conversion when interacting with the user.
if the above assumption is correct, then if you enter 0x0806, it gets stored as such; when you print or export the rule, the number is found in the number<=>name mapping table, and “arp” is shown at the UI rather than the 0x0806.
another assumption is that even if there are separate number=>name and name=>number mapping tables (as configuration manipulation need not be lightning-fast, there is no actual need to have names indexed by numbers and numbers indexed by names), one of them is most likely generated from the other, as I cannot imagine a sane programmer to put both to the code manually
based on all the above, I assume that when you use the number when creating the rule, and you get back the correct name when displaying the rule, the number<=>name mapping is correct, and thus when you use the name when creating the rule, it is converted to a correct number.
So something in the above assumptions has to be wrong so that creating the rule using the name and creating it using the number would lead to different numbers being used in the hardware table in the switch chip, but both would be translated to the same name when displayed to the user.
Here’s another mystery to add to the confusion list:
in my print list the rule #41 gets interpreted as another “802.2” though it has a totally different EthType (0x0008).
The correct “802.2” has EtherType 0x0004 (rule #19 and #2 in the print list).
I think it could also be be a hashing issue, ie. where hashing is used for fast-locating (or even direct-access) items in a collection, instead of other forms of fast-locating like bsearch etc.
Both are necessary! arp via name as well via number.
Then this can only mean that “arp by name” uses another essential (undocumented) EtherType.
Otherwise it does not make any sense, IMO.
Unless there is a memory problem caused by “double free’ing”, “use after free”, or overwriting other data, or so…
Can a MikroTik developer not simply take a look in the source code and enlighten us?
Or better make the code snippet available for study/debugging?
Open a ticket and send tech support a ‘supout’ along with your documented evidence and hopefully they will respond.
My question is , will this ‘bug’ affect normal usage?
I already did enough, made them aware of a severe bug and even located the bug. I’m not going to make any more. Enough is enough.
Without fixing this bug the whole ACL cannot be used.
ACL is important for wire-speed firewalling.
It seems under the official mac-protocol names they internally use multiple EtherType codes, maybe not with all of them, but with some.
One comes to that conclusion after testing, analyzing and thinking deep about the observed results.
Because using the numerical versions alone is not sufficient to function correctly… Ie. there is not a 1:1 mapping between names and their EtherType numbers.
Another possible mystery is whether the Packet Sniffer is perhaps hiding some EtherTypes…
If you haven’t open a support ticket, you actually haven’t done enough. This is a forum where users help each other, not a channel to report bugs. But when reporting the bug by sending an e-mail to support@mikrotik.com, it is enough to provide a link to this topic instead of describing the details again. However, the supout.rif taken when it does not work as expected is a mandatory attachment to the support case - they won’t proceed with handling the case until you provide it.
Only now, as looking for the difference between your setup and mine, I have noticed that you are setting the rules using ROS 7.0beta8 - it can only be seen in the export header, you don’t mention that anywhere in the text.
On long-term (6.45.9), I’ve just tried the following rules:
And it just works - if I disable the “accept arp” rule or the “accept ip” rule, it is not possible any more to ping the device connected to ether5; as soon as both are enabled, pinging works again.
So your findings are definitely an important feedback for the ROS 7 development team, but for normal production deployment, there is no issue.
Do you have any special reason why you need to use a ROS 7 beta for the task requiring use of switch chip rules?
Btw, this same error is present also in the RouterOS 6.47 version, as I upgraded to that 7.0beta8 b/c of that error, but unfortunately it’s in at least both versions.
Just try also with the latest stable version 6.47 and tell me that it functions. It doesn’t, much like the current 7.0beta8.