Community discussions

MikroTik App
 
1day
just joined
Topic Author
Posts: 2
Joined: Wed Aug 07, 2024 10:29 pm

CRS309 Bridging and VLANs

Fri Aug 16, 2024 2:39 pm

I have narrowed the problems I am having down to two issues, I can't begin to explain the personal costs, effects and likely outcome of this horror story and I am only sending this to hopefully prevent others wasting time and money on this fundamentally broken kit and perhaps yield some sort of explanation why it is this way. I used to be a fan of Mikrotik - a sad day....

If I was cynical (which I am, now) I would argue the CRS309 is broken by design, lacking enough disk storage ( less than 300kB) for any reasonable work-rounds or diagnostics. I'd love to know how we can use 1/2Gb RAM when the storage to load that RAM it is just 16Mb, of which the OS takes ~98% at 15.7Mb. And there is no USB port to expand it. I should have returned this (heap of ******) the moment I realised this, and that was months ago!! This thing was obsolete on day one. I have done a lot of searching and found others with the same or similar issues, and no explanation or resolution.

Hardware: CRS309-1G-8S+ (amongst others)
ROS: 7.15.3 (stable)

Short version:
A: The VLAN setup on CRS309 seems to differ from other ROS devices I've used in the past and are using elsewhere in this setup, despite following advice from the manual.

B: High (like >50%) data loss even with almost zero system load (CPU @ 2%, ~1000pps elsewhere). This is transient, and occurs after some time, minutes to hours.

Conclusion:
The VLAN implementation is broken by design and there is at least one if not multiple bugs with the bridge forwarding, either in the firmware driving the switch chip, or the chip itself.
The hardware lacks enough storage to diagnose any issues (or add work-rounds, such as a VM) and the same issues take out the 'independent' RJ45. Only the RS232 port was used to rescue the situation, and get the following diagnosis, something missing off most, if not all, recent Mikrotik products.

Longer version;
With l3 offloading turned off, I am getting a very unreliable system, with it on it's almost impossible to debug. The 'swidge' BR0 is configured more as a switch than a router, with just one IP address, used for DHCP and management, the rest of the time it's just a switch. The issue is that after some time it stops at both. I can't even see it with WinBox. Though luckily it does have a serial port.
There is no log indication why it suddenly fails. Though I do notice that a route/calc log message often precedes the point in time where it fails, perhaps coincidence though. Sometimes resetting or changing a setting seems to fix it. Sometimes resetting actually makes it worse (i.e. permanent) then requiring a factory reset to recover (I've had that happen 9 times). To me this suggests some un-initialised register in the switch chip?

When this happens, as well as downstream becoming unreliable. Upstream devices upstream become unreliable with ARPs going unanswered or going up the wrong interface and never returning. Locally hosted web pages partially load or timeout. I can see this in the sniffer. At the same time there is a very high simple packet loss, as demonstrated below. I also get corrupted pings on some interfaces, my guess here is that frame data and frame headers are getting interchanged or a memory fault? Though I've yet to be able to catch and compare a corrupt frame.
Another effect (this time in a live rather than test setup) has this in the logs, you can see somehow 'host' is leaping across trunks on the LAN at a fantastic rate.
 14:51:31 bridge,debug host 98:FD:B4:9E:1B:57:11 changed ports: 18:LAP<5+6>-desk to 1:Trunk-R1-DAC-S1
 14:51:32 bridge,debug host 24:5E:BE:67:C8:5F:11 changed ports: 19:LAP<7+8>-nas6 to 1:Trunk-R1-DAC-S1
 14:51:32 bridge,debug host 24:5E:BE:67:C8:5F:11 changed ports: 1:Trunk-R1-DAC-S1 to 19:LAP<7+8>-nas6
 14:51:32 bridge,debug host 98:FD:B4:9E:1B:57:11 changed ports: 1:Trunk-R1-DAC-S1 to 18:LAP<5+6>-desk
 14:51:32 bridge,debug host 98:FD:B4:9E:1B:57:11 changed ports: 18:LAP<5+6>-desk to 1:Trunk-R1-DAC-S1
 14:51:32 bridge,debug host 98:FD:B4:9E:1B:57:11 changed ports: 1:Trunk-R1-DAC-S1 to 18:LAP<5+6>-desk
 14:51:32 bridge,debug host 98:FD:B4:9E:1B:57:11 changed ports: 18:LAP<5+6>-desk to 1:Trunk-R1-DAC-S1
 14:51:32 bridge,debug host 98:FD:B4:9E:1B:57:11 changed ports: 1:Trunk-R1-DAC-S1 to 18:LAP<5+6>-desk
 14:51:32 bridge,debug host 98:FD:B4:9E:1B:57:11 changed ports: 18:LAP<5+6>-desk to 1:Trunk-R1-DAC-S1
 14:51:32 bridge,debug host 24:5E:BE:67:C8:5F:11 changed ports: 19:LAP<7+8>-nas6 to 1:Trunk-R1-DAC-S1
 14:51:32 bridge,debug host 98:FD:B4:9E:1B:57:11 changed ports: 1:Trunk-R1-DAC-S1 to 18:LAP<5+6>-desk
 14:51:32 bridge,debug host 98:FD:B4:9E:1B:57:11 changed ports: 18:LAP<5+6>-desk to 1:Trunk-R1-DAC-S1
 14:51:32 bridge,debug host 98:FD:B4:9E:1B:57:11 changed ports: 1:Trunk-R1-DAC-S1 to 18:LAP<5+6>-desk
 14:51:32 bridge,debug host 98:FD:B4:9E:1B:57:11 changed ports: 18:LAP<5+6>-desk to 1:Trunk-R1-DAC-S1
 14:51:32 bridge,debug host 98:FD:B4:9E:1B:57:11 changed ports: 1:Trunk-R1-DAC-S1 to 18:LAP<5+6>-desk
 14:51:32 bridge,debug host 98:FD:B4:9E:1B:57:11 changed ports: 18:LAP<5+6>-desk to 1:Trunk-R1-DAC-S1
 14:51:32 bridge,debug host 24:5E:BE:67:C8:5F:11 changed ports: 1:Trunk-R1-DAC-S1 to 19:LAP<7+8>-nas6
 14:51:33 bridge,debug host 98:FD:B4:9E:1B:57:11 changed ports: 1:Trunk-R1-DAC-S1 to 18:LAP<5+6>-desk
 14:51:33 bridge,debug host 98:FD:B4:9E:1B:57:11 changed ports: 18:LAP<5+6>-desk to 1:Trunk-R1-DAC-S1
 14:51:34 bridge,debug host 98:FD:B4:9E:1B:57:11 changed ports: 1:Trunk-R1-DAC-S1 to 18:LAP<5+6>-desk
 14:51:34 bridge,debug host 98:FD:B4:9E:1B:57:11 changed ports: 18:LAP<5+6>-desk to 1:Trunk-R1-DAC-S1
 14:51:35 bridge,debug host 1C:2A:A3:1A:B6:0F:11 changed ports: 1:Trunk-R1-DAC-S1 to 18:LAP<5+6>-desk
 14:51:35 bridge,debug host 98:FD:B4:9E:1B:57:11 changed ports: 1:Trunk-R1-DAC-S1 to 18:LAP<5+6>-desk
 14:51:35 bridge,debug host 98:FD:B4:9E:1B:57:11 changed ports: 18:LAP<5+6>-desk to 1:Trunk-R1-DAC-S1
 14:51:36 bridge,debug host 24:5E:BE:67:C8:5F:11 changed ports: 19:LAP<7+8>-nas6 to 1:Trunk-R1-DAC-S1
 14:51:36 bridge,debug host 98:FD:B4:9E:1B:57:11 changed ports: 1:Trunk-R1-DAC-S1 to 18:LAP<5+6>-desk
 14:51:36 bridge,debug host 98:FD:B4:9E:1B:57:11 changed ports: 18:LAP<5+6>-desk to 1:Trunk-R1-DAC-S1
These effects seem to increase over hours. After a lot of effort, a complete loss of sanity and probably my job (I am serious here!!!), I found the following process to be a reliable way of triggering some of the effects I see.

I also demonstrate the other issue, with VLANs. The entire point of ROS is that the system should work the same (OK, with throughput changes and feature differences) with the same configuration regardless of hardware. Here I see something that appears to break that rule, effectively meaning that you can't have more than one VLAN hosting an IP on the CRS309, regardless of if the IP is hosted on a Bridge port connected VLAN interface or the Bridge interface itself.

Proof
The following is the most minimal example that seems to show these issues that I could work out.

<1> Demonstrates that if the pvid of a bridge does not match the incoming port pvid/or already tagged frame vid, (i.e. always tagged the same on ingress), then de-tagged VLAN traffic, as presented to the particular bridge vlan gets no reply. Given you can only have one bridge on the CRS309 (according to manual) and a bridge must have exactly one pvid, and (again, according to manual, ethernet related changes should be made with the l3 h/w off) surely this it means you can ONLY have one VLAN, rendering VLANs useless? I would have though the bridge pvid should be meaningless except in the situation of non-vlan outbound data originating from the CPU on the bridge port. Here the only CPU originating data in in reply to a host on a vlan, so should replied to on that VLAN not the fallback pvid.

<2..4>
Demonstrates getting high data loss with virtually zero traffic unless l3 offloading is turned on. But, the manual instructions suggest that it may break if we make changes with it turned on. Now, from this observation it seems, if turned off, any changes will likely *appear* to fail, indicating the change is in error and handing over control of an already unstable system is a bad idea. This is ignoring the fact that 'on' may not always be desired if the reduced throughput is acceptable as some features of ROS are removed. Yet here is seems to show off is not an option. If that is the case, why have the setting at all? Why not internally always off-change-on for every switch/bridge related configuration change as the user must do that themselves to be sure the system remains stable.


<START>
Initial state:
IP 192.168.30.129/25 assigned to BR0
IP 192.168.30.137/25 assigned to host on SW3
SW3->RB1 pvid=1301, only-tagged (trunk port)
SW3->PC, pvid=11, only-untagged
RB1-BR0: admit-all, pvid=1000
Switch L3 offload = no
RB1-P1: Other network devices…..
RB1-P3: admit-only-tagged, pvid=1018, connected SW3 trunk port
RB1-BR0-P3: vid=11, admit-only-tagged
VLAN11 interface: vid=11, admit-all
Bridge VLAN: vid=11 on BR0, tagged=P1,P3,VLAN11 untagged=BR0


<1>
Change bridge pvid to 11
/interface/bridge/set pvid=11 numbers=0 #(note here 0=BR0, pvid was 1000)

<2>
Enable l3 offloading
/interface/ethernet/switch set l3-hw-offloading=yes numbers=0 #(note here 0=RB1-switch1)

<3>
Assign IP to VLAN11 interface instead….
/ip/address/set interface=VLAN11 numbers=1 #was BR0

<4>
Disable l3 offloading
/interface/ethernet/switch set l3-hw-offloading=no numbers=0 #(note here 0=RB1-switch1)
(… = time passing)
<START>  each line is 1 ping = 1000ms
> ping -t 192.168.30.129
……
Reply from 192.168.30.137: Destination host unreachable.
Reply from 192.168.30.137: Destination host unreachable.
Reply from 192.168.30.137: Destination host unreachable.
Reply from 192.168.30.137: Destination host unreachable.
<1>
Reply from 192.168.30.129: bytes=32 time=1998ms TTL=64
Reply from 192.168.30.129: bytes=32 time<1ms TTL=64
Request timed out.
Reply from 192.168.30.129: bytes=32 time<1ms TTL=64
Reply from 192.168.30.129: bytes=32 time<1ms TTL=64
Request timed out.
Request timed out.
Reply from 192.168.30.129: bytes=32 time<1ms TTL=64
Request timed out.
Request timed out.
Request timed out.
Reply from 192.168.30.129: bytes=32 time<1ms TTL=64
Request timed out.
Request timed out.
Reply from 192.168.30.137: Destination host unreachable.
Reply from 192.168.30.129: bytes=32 time=994ms TTL=64
Reply from 192.168.30.129: bytes=32 time<1ms TTL=64
Reply from 192.168.30.129: bytes=32 time<1ms TTL=64
Request timed out.
Request timed out.
Request timed out.
Reply from 192.168.30.129: bytes=32 time<1ms TTL=64
Reply from 192.168.30.129: bytes=32 time<1ms TTL=64
Request timed out.
Request timed out.
Reply from 192.168.30.129: bytes=32 time<1ms TTL=64
Reply from 192.168.30.129: bytes=32 time<1ms TTL=64
Request timed out.
Request timed out.
Reply from 192.168.30.137: Destination host unreachable.
<2>
Reply from 192.168.30.137: Destination host unreachable.
Reply from 192.168.30.137: Destination host unreachable.
Reply from 192.168.30.137: Destination host unreachable.
Reply from 192.168.30.137: Destination host unreachable.
Reply from 192.168.30.137: Destination host unreachable.
Reply from 192.168.30.129: bytes=32 time=1010ms TTL=64
[NOTE: log message about forwarding]
Reply from 192.168.30.129: bytes=32 time<1ms TTL=64
Reply from 192.168.30.129: bytes=32 time<1ms TTL=64
Reply from 192.168.30.129: bytes=32 time<1ms TTL=64
Reply from 192.168.30.129: bytes=32 time<1ms TTL=64
Reply from 192.168.30.129: bytes=32 time<1ms TTL=64
….
<3>
….
Reply from 192.168.30.129: bytes=32 time<1ms TTL=64
Reply from 192.168.30.129: bytes=32 time<1ms TTL=64
Reply from 192.168.30.129: bytes=32 time<1ms TTL=64
Reply from 192.168.30.129: bytes=32 time<1ms TTL=64
Reply from 192.168.30.129: bytes=32 time<1ms TTL=64
Reply from 192.168.30.129: bytes=32 time<1ms TTL=64
Reply from 192.168.30.129: bytes=32 time<1ms TTL=64
<4>
Request timed out.
Reply from 192.168.30.129: bytes=32 time<1ms TTL=64
Request timed out.
Request timed out.
Request timed out.
Reply from 192.168.30.129: bytes=32 time<1ms TTL=64
Request timed out.
Request timed out.
….
 
yahelb
just joined
Posts: 8
Joined: Tue Feb 02, 2016 7:41 am

Re: CRS309 Bridging and VLANs

Mon Apr 28, 2025 7:52 pm

I don't understand why would you disable L3 offloading?
This is a switch, not a router...
 
User avatar
anav
Forum Guru
Forum Guru
Posts: 23602
Joined: Sun Feb 18, 2018 11:28 pm
Location: Nova Scotia, Canada
Contact:

Re: CRS309 Bridging and VLANs

Mon Apr 28, 2025 8:46 pm

I didnt get past the first para where your world has apparently ended, but you have never posted here for help.
Why come here to complain, this is not the complaint department its the get assistance with your config department.
Counselling and mental health well being are down the hall.

The way it works in most cases is that peoples words are next to useless as they are not trained in explaining issues nor how to state requirements,
and/or are expressing themselves emotionally vice clinically.

THUS to get config help the best bet is twofold:
a. provide the config /export file=anynameyouwish (minus router serial number, any public WANIP information, keys)
b. provide a detailed network diagram so we can see all attached devices, vlans flowing etc..........
 
jaclaz
Forum Guru
Forum Guru
Posts: 2873
Joined: Tue Oct 03, 2023 4:21 pm

Re: CRS309 Bridging and VLANs

Mon Apr 28, 2025 8:53 pm

I didnt get past the first para where your world has apparently ended, but ...
Hopefully in the several months (roughly 8 ) since the OP started venting, the issue has been either solved or forgotten. :roll:

I wouldn't count too much on the OP ever going to update the thread, post relevant info or even come back to the forum ...
 
User avatar
anav
Forum Guru
Forum Guru
Posts: 23602
Joined: Sun Feb 18, 2018 11:28 pm
Location: Nova Scotia, Canada
Contact:

Re: CRS309 Bridging and VLANs

Mon Apr 28, 2025 9:16 pm

My bad I looked at the date of the responder and not the original post date LOL.
I blane yahelb for bringing it back to life ;-)