If I was cynical (which I am, now) I would argue the CRS309 is broken by design, lacking enough disk storage ( less than 300kB) for any reasonable work-rounds or diagnostics. I'd love to know how we can use 1/2Gb RAM when the storage to load that RAM it is just 16Mb, of which the OS takes ~98% at 15.7Mb. And there is no USB port to expand it. I should have returned this (heap of ******) the moment I realised this, and that was months ago!! This thing was obsolete on day one. I have done a lot of searching and found others with the same or similar issues, and no explanation or resolution.
Hardware: CRS309-1G-8S+ (amongst others)
ROS: 7.15.3 (stable)
Short version:
A: The VLAN setup on CRS309 seems to differ from other ROS devices I've used in the past and are using elsewhere in this setup, despite following advice from the manual.
B: High (like >50%) data loss even with almost zero system load (CPU @ 2%, ~1000pps elsewhere). This is transient, and occurs after some time, minutes to hours.
Conclusion:
The VLAN implementation is broken by design and there is at least one if not multiple bugs with the bridge forwarding, either in the firmware driving the switch chip, or the chip itself.
The hardware lacks enough storage to diagnose any issues (or add work-rounds, such as a VM) and the same issues take out the 'independent' RJ45. Only the RS232 port was used to rescue the situation, and get the following diagnosis, something missing off most, if not all, recent Mikrotik products.
Longer version;
With l3 offloading turned off, I am getting a very unreliable system, with it on it's almost impossible to debug. The 'swidge' BR0 is configured more as a switch than a router, with just one IP address, used for DHCP and management, the rest of the time it's just a switch. The issue is that after some time it stops at both. I can't even see it with WinBox. Though luckily it does have a serial port.
There is no log indication why it suddenly fails. Though I do notice that a route/calc log message often precedes the point in time where it fails, perhaps coincidence though. Sometimes resetting or changing a setting seems to fix it. Sometimes resetting actually makes it worse (i.e. permanent) then requiring a factory reset to recover (I've had that happen 9 times). To me this suggests some un-initialised register in the switch chip?
When this happens, as well as downstream becoming unreliable. Upstream devices upstream become unreliable with ARPs going unanswered or going up the wrong interface and never returning. Locally hosted web pages partially load or timeout. I can see this in the sniffer. At the same time there is a very high simple packet loss, as demonstrated below. I also get corrupted pings on some interfaces, my guess here is that frame data and frame headers are getting interchanged or a memory fault? Though I've yet to be able to catch and compare a corrupt frame.
Another effect (this time in a live rather than test setup) has this in the logs, you can see somehow 'host' is leaping across trunks on the LAN at a fantastic rate.
Code: Select all
14:51:31 bridge,debug host 98:FD:B4:9E:1B:57:11 changed ports: 18:LAP<5+6>-desk to 1:Trunk-R1-DAC-S1
14:51:32 bridge,debug host 24:5E:BE:67:C8:5F:11 changed ports: 19:LAP<7+8>-nas6 to 1:Trunk-R1-DAC-S1
14:51:32 bridge,debug host 24:5E:BE:67:C8:5F:11 changed ports: 1:Trunk-R1-DAC-S1 to 19:LAP<7+8>-nas6
14:51:32 bridge,debug host 98:FD:B4:9E:1B:57:11 changed ports: 1:Trunk-R1-DAC-S1 to 18:LAP<5+6>-desk
14:51:32 bridge,debug host 98:FD:B4:9E:1B:57:11 changed ports: 18:LAP<5+6>-desk to 1:Trunk-R1-DAC-S1
14:51:32 bridge,debug host 98:FD:B4:9E:1B:57:11 changed ports: 1:Trunk-R1-DAC-S1 to 18:LAP<5+6>-desk
14:51:32 bridge,debug host 98:FD:B4:9E:1B:57:11 changed ports: 18:LAP<5+6>-desk to 1:Trunk-R1-DAC-S1
14:51:32 bridge,debug host 98:FD:B4:9E:1B:57:11 changed ports: 1:Trunk-R1-DAC-S1 to 18:LAP<5+6>-desk
14:51:32 bridge,debug host 98:FD:B4:9E:1B:57:11 changed ports: 18:LAP<5+6>-desk to 1:Trunk-R1-DAC-S1
14:51:32 bridge,debug host 24:5E:BE:67:C8:5F:11 changed ports: 19:LAP<7+8>-nas6 to 1:Trunk-R1-DAC-S1
14:51:32 bridge,debug host 98:FD:B4:9E:1B:57:11 changed ports: 1:Trunk-R1-DAC-S1 to 18:LAP<5+6>-desk
14:51:32 bridge,debug host 98:FD:B4:9E:1B:57:11 changed ports: 18:LAP<5+6>-desk to 1:Trunk-R1-DAC-S1
14:51:32 bridge,debug host 98:FD:B4:9E:1B:57:11 changed ports: 1:Trunk-R1-DAC-S1 to 18:LAP<5+6>-desk
14:51:32 bridge,debug host 98:FD:B4:9E:1B:57:11 changed ports: 18:LAP<5+6>-desk to 1:Trunk-R1-DAC-S1
14:51:32 bridge,debug host 98:FD:B4:9E:1B:57:11 changed ports: 1:Trunk-R1-DAC-S1 to 18:LAP<5+6>-desk
14:51:32 bridge,debug host 98:FD:B4:9E:1B:57:11 changed ports: 18:LAP<5+6>-desk to 1:Trunk-R1-DAC-S1
14:51:32 bridge,debug host 24:5E:BE:67:C8:5F:11 changed ports: 1:Trunk-R1-DAC-S1 to 19:LAP<7+8>-nas6
14:51:33 bridge,debug host 98:FD:B4:9E:1B:57:11 changed ports: 1:Trunk-R1-DAC-S1 to 18:LAP<5+6>-desk
14:51:33 bridge,debug host 98:FD:B4:9E:1B:57:11 changed ports: 18:LAP<5+6>-desk to 1:Trunk-R1-DAC-S1
14:51:34 bridge,debug host 98:FD:B4:9E:1B:57:11 changed ports: 1:Trunk-R1-DAC-S1 to 18:LAP<5+6>-desk
14:51:34 bridge,debug host 98:FD:B4:9E:1B:57:11 changed ports: 18:LAP<5+6>-desk to 1:Trunk-R1-DAC-S1
14:51:35 bridge,debug host 1C:2A:A3:1A:B6:0F:11 changed ports: 1:Trunk-R1-DAC-S1 to 18:LAP<5+6>-desk
14:51:35 bridge,debug host 98:FD:B4:9E:1B:57:11 changed ports: 1:Trunk-R1-DAC-S1 to 18:LAP<5+6>-desk
14:51:35 bridge,debug host 98:FD:B4:9E:1B:57:11 changed ports: 18:LAP<5+6>-desk to 1:Trunk-R1-DAC-S1
14:51:36 bridge,debug host 24:5E:BE:67:C8:5F:11 changed ports: 19:LAP<7+8>-nas6 to 1:Trunk-R1-DAC-S1
14:51:36 bridge,debug host 98:FD:B4:9E:1B:57:11 changed ports: 1:Trunk-R1-DAC-S1 to 18:LAP<5+6>-desk
14:51:36 bridge,debug host 98:FD:B4:9E:1B:57:11 changed ports: 18:LAP<5+6>-desk to 1:Trunk-R1-DAC-S1
I also demonstrate the other issue, with VLANs. The entire point of ROS is that the system should work the same (OK, with throughput changes and feature differences) with the same configuration regardless of hardware. Here I see something that appears to break that rule, effectively meaning that you can't have more than one VLAN hosting an IP on the CRS309, regardless of if the IP is hosted on a Bridge port connected VLAN interface or the Bridge interface itself.
Proof
The following is the most minimal example that seems to show these issues that I could work out.
<1> Demonstrates that if the pvid of a bridge does not match the incoming port pvid/or already tagged frame vid, (i.e. always tagged the same on ingress), then de-tagged VLAN traffic, as presented to the particular bridge vlan gets no reply. Given you can only have one bridge on the CRS309 (according to manual) and a bridge must have exactly one pvid, and (again, according to manual, ethernet related changes should be made with the l3 h/w off) surely this it means you can ONLY have one VLAN, rendering VLANs useless? I would have though the bridge pvid should be meaningless except in the situation of non-vlan outbound data originating from the CPU on the bridge port. Here the only CPU originating data in in reply to a host on a vlan, so should replied to on that VLAN not the fallback pvid.
<2..4>
Demonstrates getting high data loss with virtually zero traffic unless l3 offloading is turned on. But, the manual instructions suggest that it may break if we make changes with it turned on. Now, from this observation it seems, if turned off, any changes will likely *appear* to fail, indicating the change is in error and handing over control of an already unstable system is a bad idea. This is ignoring the fact that 'on' may not always be desired if the reduced throughput is acceptable as some features of ROS are removed. Yet here is seems to show off is not an option. If that is the case, why have the setting at all? Why not internally always off-change-on for every switch/bridge related configuration change as the user must do that themselves to be sure the system remains stable.
<START>
Initial state:
IP 192.168.30.129/25 assigned to BR0
IP 192.168.30.137/25 assigned to host on SW3
SW3->RB1 pvid=1301, only-tagged (trunk port)
SW3->PC, pvid=11, only-untagged
RB1-BR0: admit-all, pvid=1000
Switch L3 offload = no
RB1-P1: Other network devices…..
RB1-P3: admit-only-tagged, pvid=1018, connected SW3 trunk port
RB1-BR0-P3: vid=11, admit-only-tagged
VLAN11 interface: vid=11, admit-all
Bridge VLAN: vid=11 on BR0, tagged=P1,P3,VLAN11 untagged=BR0
<1>
Change bridge pvid to 11
/interface/bridge/set pvid=11 numbers=0 #(note here 0=BR0, pvid was 1000)
<2>
Enable l3 offloading
/interface/ethernet/switch set l3-hw-offloading=yes numbers=0 #(note here 0=RB1-switch1)
<3>
Assign IP to VLAN11 interface instead….
/ip/address/set interface=VLAN11 numbers=1 #was BR0
<4>
Disable l3 offloading
/interface/ethernet/switch set l3-hw-offloading=no numbers=0 #(note here 0=RB1-switch1)
Code: Select all
(… = time passing)
<START> each line is 1 ping = 1000ms
> ping -t 192.168.30.129
……
Reply from 192.168.30.137: Destination host unreachable.
Reply from 192.168.30.137: Destination host unreachable.
Reply from 192.168.30.137: Destination host unreachable.
Reply from 192.168.30.137: Destination host unreachable.
<1>
Reply from 192.168.30.129: bytes=32 time=1998ms TTL=64
Reply from 192.168.30.129: bytes=32 time<1ms TTL=64
Request timed out.
Reply from 192.168.30.129: bytes=32 time<1ms TTL=64
Reply from 192.168.30.129: bytes=32 time<1ms TTL=64
Request timed out.
Request timed out.
Reply from 192.168.30.129: bytes=32 time<1ms TTL=64
Request timed out.
Request timed out.
Request timed out.
Reply from 192.168.30.129: bytes=32 time<1ms TTL=64
Request timed out.
Request timed out.
Reply from 192.168.30.137: Destination host unreachable.
Reply from 192.168.30.129: bytes=32 time=994ms TTL=64
Reply from 192.168.30.129: bytes=32 time<1ms TTL=64
Reply from 192.168.30.129: bytes=32 time<1ms TTL=64
Request timed out.
Request timed out.
Request timed out.
Reply from 192.168.30.129: bytes=32 time<1ms TTL=64
Reply from 192.168.30.129: bytes=32 time<1ms TTL=64
Request timed out.
Request timed out.
Reply from 192.168.30.129: bytes=32 time<1ms TTL=64
Reply from 192.168.30.129: bytes=32 time<1ms TTL=64
Request timed out.
Request timed out.
Reply from 192.168.30.137: Destination host unreachable.
<2>
Reply from 192.168.30.137: Destination host unreachable.
Reply from 192.168.30.137: Destination host unreachable.
Reply from 192.168.30.137: Destination host unreachable.
Reply from 192.168.30.137: Destination host unreachable.
Reply from 192.168.30.137: Destination host unreachable.
Reply from 192.168.30.129: bytes=32 time=1010ms TTL=64
[NOTE: log message about forwarding]
Reply from 192.168.30.129: bytes=32 time<1ms TTL=64
Reply from 192.168.30.129: bytes=32 time<1ms TTL=64
Reply from 192.168.30.129: bytes=32 time<1ms TTL=64
Reply from 192.168.30.129: bytes=32 time<1ms TTL=64
Reply from 192.168.30.129: bytes=32 time<1ms TTL=64
….
<3>
….
Reply from 192.168.30.129: bytes=32 time<1ms TTL=64
Reply from 192.168.30.129: bytes=32 time<1ms TTL=64
Reply from 192.168.30.129: bytes=32 time<1ms TTL=64
Reply from 192.168.30.129: bytes=32 time<1ms TTL=64
Reply from 192.168.30.129: bytes=32 time<1ms TTL=64
Reply from 192.168.30.129: bytes=32 time<1ms TTL=64
Reply from 192.168.30.129: bytes=32 time<1ms TTL=64
<4>
Request timed out.
Reply from 192.168.30.129: bytes=32 time<1ms TTL=64
Request timed out.
Request timed out.
Request timed out.
Reply from 192.168.30.129: bytes=32 time<1ms TTL=64
Request timed out.
Request timed out.
….