CRS326-24G - Dual BSD firewalls, CARP, mac address learning and host timeout issues

Hi there. I have a CRS326-24G. In our setup we have a number of incoming lines plugged into various ports, and two firewalls that also plug into bridge ports on the CRS326.

Basically, the two firewalls share an IP address (they are BSD and using CARP), and if one firewall goes down the other seamlessly takes over. We had no problems for about 2 years when using a 3COM switch. We recently swapped the 3COM out for the CRS326 (we are using it primarily as a switch, but with access to a lot of features), and the functionality of the CARP address sharing has effectively stopped. We have traced the problem to the MAC learning and timeout of mac addresses in the bridge host table. If the IP moves from one FW to the other, everything on the firewalls works fine, but the switch keeps sending the data with the shared MAC address to the original port (even though the IP has changed) until the host entry for that MAC expires, and at that point it sends the data to the new port. Since this can take around 5 minutes, it’s obviously a huge problem for us.

I would like to know if anyone else has overcome this problem, and if so how. If nobody else has had this specific problem then I would like help determining:

  1. Is there a way to get the switch to send data with X mac address to two ports (effectively port mirroring only for that mac), and/or
  2. How to change the timeout on the entries in the host table, ideally only for that mac address, and/or
  3. If it’s possible to create a bridge that puts all the ports into a single “collision domain” where all traffic is mirrored to all ports, and/or
  4. If anyone else has any suggestions.

Thanks in advance, and keep sane while isolating :wink:

J

Hmm. It should work. Mind posting a config (

/export hide-sensitive

)?

Also, what code version? Newer versions of code added a message to show mac address “flap” between interfaces which should be shown during failover here.

I’ve done some addiitonal sanitizing, so if something looks too alien I probably messed up. But I think not :wink:

I’d like to stress that this is primarily a switch, NOT a router or firewall.

[admin@MikroTik_LSG_RED] > /export hide-sensitive
# apr/05/2020 21:54:10 by RouterOS 6.46.4
# software id = 534F-0VN6
#
# model = CRS326-24G-2S+
/interface bridge
add admin-mac=B8:69:F4:5E:2B:97 auto-mac=no comment="00 Root Bridge" name=inet_bridge \
    vlan-filtering=yes
/interface ethernet
set [ find default-name=ether1 ] comment=ether01 name=FW-LSGFW01 speed=100Mbps
set [ find default-name=ether2 ] comment=ether02 name=FW-LSGFW02 speed=100Mbps
set [ find default-name=ether11 ] comment="ether11 - Fibre Link" name=ISP-AAAA01 speed=100Mbps
set [ find default-name=ether14 ] comment=ether14 name=ISP-DSL speed=100Mbps
set [ find default-name=ether13 ] comment="ether13 - Fibre Link" name=ISP-BBBZA01 speed=100Mbps
set [ find default-name=ether12 ] comment="ether12 - Microwave Link" name=ISP-CCC01 speed=100Mbps
set [ find default-name=ether7 ] comment="ether7 - LSG Guest Network" name=RTR-LSG_Guest speed=\
    100Mbps
set [ find default-name=ether3 ] comment="ether03 - MikroTik VPN (REMOTECLIENT01)" name=RTR-VPN01 speed=\
    100Mbps
set [ find default-name=ether22 ] comment=ether22 name=SRV-DNINGFW01-IF01 speed=100Mbps
set [ find default-name=ether21 ] comment=ether21 name=SRV-DNINGFW01-IF02 speed=100Mbps
set [ find default-name=ether17 ] comment=ether17 name=SRV-LSGLXD02 speed=100Mbps
set [ find default-name=ether4 ] speed=100Mbps
set [ find default-name=ether5 ] speed=100Mbps
set [ find default-name=ether6 ] speed=100Mbps
set [ find default-name=ether8 ] speed=100Mbps
set [ find default-name=ether9 ] speed=100Mbps
set [ find default-name=ether10 ] speed=100Mbps
set [ find default-name=ether15 ] speed=100Mbps
set [ find default-name=ether16 ] speed=100Mbps
set [ find default-name=ether18 ] speed=100Mbps
set [ find default-name=ether19 ] speed=100Mbps
set [ find default-name=ether20 ] speed=100Mbps
set [ find default-name=ether23 ] speed=100Mbps
set [ find default-name=ether24 ] speed=100Mbps
set [ find default-name=sfp-sfpplus1 ] speed=10Gbps
set [ find default-name=sfp-sfpplus2 ] speed=10Gbps
/interface vlan
add comment="Bridge Interface" interface=inet_bridge name=IF-CCC vlan-id=112
/interface list
add name=WAN
add name=LAN
add name=ISP
add name="FW & Routers"
add name=Servers
add name=Unassigned
/interface wireless security-profiles
set [ find default=yes ] supplicant-identity=MikroTik
/ip hotspot profile
set [ find default=yes ] html-directory=flash/hotspot
/interface bridge port
add bridge=inet_bridge comment=ether2 interface=FW-LSGFW02
add bridge=inet_bridge comment=defconf interface=RTR-VPN01
add bridge=inet_bridge comment=defconf interface=ether4
add bridge=inet_bridge comment=defconf interface=ether5
add bridge=inet_bridge comment=defconf interface=ether6
add bridge=inet_bridge comment=defconf interface=RTR-LSG_Guest
add bridge=inet_bridge comment=defconf interface=ether8
add bridge=inet_bridge comment=defconf interface=ether9
add bridge=inet_bridge comment=defconf interface=ether10
add bridge=inet_bridge comment=defconf interface=ISP-AAAA01
add bridge=inet_bridge comment=defconf interface=ISP-CCC01 pvid=112
add bridge=inet_bridge comment=defconf interface=ISP-BBBZA01 pvid=113
add bridge=inet_bridge comment=defconf interface=ISP-DSL pvid=114
add bridge=inet_bridge comment=defconf interface=ether15
add bridge=inet_bridge comment=defconf interface=ether16
add bridge=inet_bridge comment=defconf interface=SRV-LSGLXD02
add bridge=inet_bridge comment=defconf interface=ether18
add bridge=inet_bridge comment=defconf interface=ether19
add bridge=inet_bridge comment=defconf interface=ether20
add bridge=inet_bridge comment=defconf interface=SRV-DNINGFW01-IF02
add bridge=inet_bridge comment=defconf interface=SRV-DNINGFW01-IF01
add bridge=inet_bridge comment=defconf interface=ether23
add bridge=inet_bridge comment=defconf interface=ether24
add bridge=inet_bridge comment=defconf interface=sfp-sfpplus1
add bridge=inet_bridge comment=defconf interface=sfp-sfpplus2
add bridge=inet_bridge comment=ether1 interface=FW-LSGFW01
/interface bridge vlan
add bridge=inet_bridge comment="ISP - AAAA 01 **UNUSED**" tagged=ISP-AAAA01,FW-LSGFW01,FW-LSGFW02 \
    vlan-ids=111
add bridge=inet_bridge comment="ISP - CCC 01" tagged=\
    FW-LSGFW01,FW-LSGFW02,SRV-LSGLXD02,SRV-DNINGFW01-IF01,SRV-DNINGFW01-IF02,inet_bridge untagged=\
    ISP-CCC01 vlan-ids=112
add bridge=inet_bridge comment="ISP - BBBZA 01" tagged=FW-LSGFW01,FW-LSGFW02 untagged=ISP-BBBZA01 \
    vlan-ids=113
add bridge=inet_bridge comment="ISP - ADSL (DDDDD)" tagged=FW-LSGFW01,FW-LSGFW02 untagged=ISP-DSL \
    vlan-ids=114
add bridge=inet_bridge comment="Base VLAN" vlan-ids=1
/interface list member
add interface=FW-LSGFW01 list=WAN
add interface=FW-LSGFW02 list="FW & Routers"
add interface=RTR-VPN01 list="FW & Routers"
add interface=ether4 list=Unassigned
add interface=ether5 list=Unassigned
add interface=ether6 list=Unassigned
add interface=RTR-LSG_Guest list="FW & Routers"
add interface=ether8 list=Unassigned
add interface=ether9 list=Unassigned
add interface=ether10 list=Unassigned
add interface=ISP-AAAA01 list=ISP
add interface=ISP-CCC01 list=ISP
add interface=ISP-BBBZA01 list=ISP
add interface=ISP-DSL list=ISP
add interface=ether15 list=Unassigned
add interface=ether16 list=Unassigned
add interface=SRV-LSGLXD02 list=Servers
add interface=ether18 list=Unassigned
add interface=ether19 list=Unassigned
add interface=ether20 list=Unassigned
add interface=SRV-DNINGFW01-IF02 list=Servers
add interface=SRV-DNINGFW01-IF01 list=Servers
add interface=ether23 list=Unassigned
add interface=ether24 list=Unassigned
add interface=sfp-sfpplus1 list=Unassigned
add interface=sfp-sfpplus2 list=Unassigned
/ip address
add address=192.168.88.1/24 comment=defconf interface=inet_bridge network=192.168.88.0
add address=10.0.0.5/24 interface=IF-CCC network=10.0.0.0
/ip dhcp-client
# DHCP client can not run on slave interface!
add disabled=no interface=FW-LSGFW01
/ip dns
set servers=8.8.8.8
/ip route
add distance=1 gateway=10.0.0.1
/ip service
set telnet address=192.168.88.0/24,10.0.0.0/24
set ftp address=192.168.88.0/24,10.0.0.0/24
set www address=192.168.0.0/16,10.0.0.0/8
set ssh port=21
set api address=192.168.88.0/24,10.0.0.0/24
/ip ssh
set allow-none-crypto=yes forwarding-enabled=remote
/system clock
set time-zone-name=Africa/Johannesburg
/system identity
set name=MikroTik_LSG_RED
/system package update
set channel=long-term
/system routerboard settings
set boot-os=router-os
/tool sniffer
set file-limit=10000KiB filter-interface=ether9,ether15 memory-limit=1000KiB
[admin@MikroTik_LSG_RED] >

I’m unsure what you mean by code version, but I’m running 6.42.5.

Start by updating the code to at least the latest LTS and retest. It’d good for security and rules other pre-existing bugs. Along the security lines, let SSH be on 22 and use strong crypto.

https://mikrotik.com/download/changelogs/long-term-release-tree

The most current testing release, 6.47b53, has the log message change that I mentioned. This will give an easy way to monitor the failover happen. If upgrading to the LTS release doesn’t help that would be interesting.

Your config has a bunch of interface lists bit doesn’t seem to use them anywhere. The interesting here is the 2 interfaces labeled as firewall are not in the same list. Are firewall rules being applied?

You have a rule about the DHCP client not being allowed on an interface labeled for what I assume is a firewall. Why would it ever be on a L2 interface, a DHCP client would only be on the relavent L3 interface. Either way that doesn’t seem to be present ( or was omitted ).

I’d be curious to see a complete (redacted) config and confirmation on which interfaces CARP is being passed through and on which VLANs.

Thanks for the feedback. My response follows:

  • I misspoke, I’m on version 6.46.4, which is the latest in the stable tree. (I copied and pasted, and it seems I copied the wrong version - not sure how).
  • Running SSH on port 22 is a HUGE security risk - it attracts thousands of breakin attempts each day. Running it on a non standard port (I don’t really run it on 21) means it’s less likely to be an attack vector (attackers have to run a full portscan on the machine to detect it, and that is easy to detect and block, and also far less frequent)
  • Please explain what you mean by strong crypto. We use only SSH keys and I believe we are using only elliptical curve (Ed25519) as the generating algorithm.
  • I may install the testing version if nothing else works, but this is currently live and in production (we had some time over the weekend, which has now passed)
  • The MikroTik is being used as a switch, and not a firewall. The only IPs we need on the switch are there to allow access to the switch itself.
  • The Interface lists are there for future use, and one of the techs politely decided to rearrange the cables on Sunday to make them neater. I’m still fixing up some of the fallout from that. Getting the ports back into the right VLANS was a priority. So the lists are currently a mess that will be fixed later.
  • I missed that DHCP config - it must be a legacy oversight, thanks for pointing it out. I’ve rectified it (IE deleted it) and tested the failover again. Situation is the same.

As to how the CARP works, it’s fairly simple.

Each firewall is configured with a CARP address and a advertising frequency. The host with the lowest frequency is the master and has the live IP address. We currently have 2 firewalls living on switch ports 1 and 2.

(NOTE: The VLANS are for incoming Internet links and are not currently part of the current CARP config. The inet link we are using for CARP is in the default VLAN. It will be moved to it’s own VLAN when everything is working predictably. I will be setting up CARP on those VLANS during the course of this week.)

When the CARP master becomes unavailable, the machine with the next lowest frequency becomes master. In this case there is only one, but it uses the same principles. The CARP master can become unavailable for a number of reasons, the following being some of the reasons:

  • The machine is physically switched off
  • The machine crashes
  • The NIC crashes
  • The NIC is unplugged
  • The cable is broken
  • The machine enters manual maintenance mode (how we test)
  • The machine is rebooting
  • The machine running an update, or is being used to test a configuration (The two most common reasons)

I’ve listed all of the above because the failover works nicely and quickly when the switch sees the port as being down (IE the NIC is not sending data). But that almost never happens. Most of the time the CARP failover happens while the machine is powered on the the NIC is transmitting at least a carrier signal. And as long as the port appears to be up, the switch sends data to the ‘dead’ machine.

The firewalls are connected back-to-back with an ethernet cable plugged into dedicates NICS for data sync. This data sync is all the stateful information needed for the CARP interface to take over seamlessly (even mid transaction), and also consists of a heartbeat, of sorts.

I have determined the following:

  • FW2 detects when FW1 CARP goes down almost immediately (fast enough that my eyes are unable to see a delkay, at least).
  • Traffic dumps show that the traffic is sent to the original port with nothing coming to the new port.
  • There is no arping for that mac address for long intervals.

I really do appreciate the help, and hope that something can be figured out.

I believe the SSH crypto comment is due to
/ip ssh
set allow-none-crypto=yes
as a previous RouterOS upgrade erroneously added this.

The switch itself will only be making ARP requests associated with its management interface. The traffic egress port will be selected by the contents of the FDB, this is populated from the source MAC address of received packets. Unless FW2 transmits a packet with the shared MAC address when FW1 is inoperative the switch will continue to send traffic to the port connecting with FW1 until the entry ages out, after around 5 minutes, at which point the traffic is flooded to all ports. If the switch is configured to perform independent VLAN learning (IVL) rather than shared VLAN learning (SVL) the firewalls will have to transmit a packet with the shared MAC address on all VLANs, not just one of them.

As an aside you can simplify the untagged VLAN configuration, if you set pvid= in /interface bridge port you leave out the untagged= entries in /interface bridge vlan as they will be populated dynamically. This prevents you ending up with a mismatch between the two sections causing lack of communications.

TDW, you have helped me to a solution. Not the solution I wanted, but certainly one I can live with. Thank you for that.

Thanks for the SSH Crypto info, have ensured that crypto algorithms are required.

As to the arp requests, I was should have said “forwarded arp requests”, which is how I expected it to work. Maybe I should have tested more :wink:

Your comment about FW2 needing to transmit the mac address was pure gold. I tested on the firewall with this:

arping -c 1 -s 00:00:5E:00:01:02 -p -P -U [GATEWAY IP ADDRESS]

and like magic I could connect. Now I just need to find a way to automate that on the firewall when it becomes the CARP master and we’re good.

I’d still prefer a less … invasive solution, if anyone has anything more to add.

VRRP transmits the shared MAC address from whichever device is currently master. I’ve not used CARP, but I would have expected it do the same - maybe with a configuration option if not the default.

I do believe CARP requires that configuration to force the packet and to then toggle the FDB because the interface doesn’t actually go down. Their is no automatic GARP like with VRRP from what I could uncover. Another approach would be to turn Mac learning off but that would floor traffic into both ports all the time more like a hub.

I’d argue that moving SSH to another port is security by obscurity. Your host will be detected and anything beyond the most passive detection will find the moved service. Moving it to 21 which is a common port as well only increases the odds. If 22 must be available to external users their are better techniques to slow down even a determined attacker. You could use bastion hosts that support more sophisticated controls and only those bastions are allowed to reach the routers SSH port. The additional controls might be additional and stronger algorithm support as well as one more additional factors to authenticate to it, possibly am attach detection and restriction tool like fail2ban. Remove the need to SSH to the outside entirely if possible. Basically there are better methods to harden your SSH service that provide realistic improvements than hoping someone doesn’t find out what port the service runs on.

Also as far as I’m aware RouterOS does not support ed25519 but we did get rsa-sha2-256 signature support in response to OpenSSH basically killing rsa-sha at this point. Small victories I suppose.

Responding to idlemind.


Security by obscurity is a term used when you rely on obscurity as your only real form of security. Changing the port is an additional layer of protection to an already hardened service. And as I already mentioned earlier in this thread, it drops attempted attacks down from literally thousands a day to a few a week. SSH is no less secure because the port changed, but it now faces less attacks. There is no rational argument in which that is a bad thing.


Moving it to 21 which is a common port as well only increases the odds.

In my response, which you quote, I say clearly

Running it on a non standard port (I don’t really run it on 21)

It seems you responded without actually reading what I said. While I appreciate all help and suggestions, suggestions based on half read problems have the potential to create more problems than they solve.


If 22 must be available to external users their are better techniques to slow down even a determined attacker. You could use bastion hosts that support more sophisticated controls and only those bastions are allowed to reach the routers SSH port. The additional controls might be additional and stronger algorithm support as well as one more additional factors to authenticate to it, possibly am attach detection and restriction tool like fail2ban. Remove the need to SSH to the outside entirely if possible. Basically there are better methods to harden your SSH service that provide realistic improvements than hoping someone doesn’t find out what port the service runs on.

By bastions I suspect you mean firewall type machines that protect the routerboard from the Internet. Kind of like the two firewalls mentioned in my response, which you quoted. You also read the config I posted earlier and saw that the switch has no public IP addresses, and hence no public facing SSH ports.

There is also, to the best of my knowledge, no official or working fail2ban implementation for RouterOS, and even if there was it would chew up valuable resources needed by a multi-GB switching switch (and as I have stated numerous times, this is PRIMARILY a switch). HOWEVER, changing the port does increase security without softening SSH, and does NOT have any performance impact.


Also as far as I’m aware RouterOS does not support ed25519 but we did get rsa-sha2-256 signature support in response to OpenSSH basically killing rsa-sha at this point. Small victories I suppose.

You are correct, thanks for pointing that out. I will modify the policy allowing an exception for using RSA 8192 keys where ed25519 is not available.

I’d like to stress, again, that I appreciate all the help I can get for resolving my problem. However, this post was tangential and only addressed a problem that wasn’t a problem, and attempted to do so while ignoring most of the salient facts presented, finally suggesting solutions that were already in place.

I’d like to address the idea of changing the SSH port quickly.

On resource constrained devices, changing the SSH port to something above 1024 adds security without creating any security holes, assuming you update any existing security measures that may be available to accommodate the new port. A security plus with no minus.

On machines and devices that are not resource constrained and are able to run arbitrary security enhancing tools, then changing the port to something above 1024 STILL offers some additional protection at no cost (update other tools as needed, of course). A plus with no minus. Additionally, you can then configure your own script, or something like fail2ban to IMMEDIATELY take protective action against any source connecting to port 22 as there is no legitimate reason for doing so. (Someone may do it my mistake, but this will act as a rather effective lesson in being careful what you do when managing an Internet connected server). Another plus with no (real) minus.

The only reason to run SSH on port 22 is if your ISP does some sort of traffic management prioritising port 22 traffic, or if you are unable to change it due to permission or technical limitations.

If you disagree, consider two simple questions:

  • If a serious ssh flaw is discovered, who are MORE LIKELY be hit first - those running ssh on port 22, or those running on some other port? Who are MORE LIKELY to have more time to realise there is a problem and mitigate it - those running ssh on port 22, or those running on some other port?
  • If your password has been somehow compromised (anonymously) or is weak, when are you MORE LIKELY to be compromised: If your machine gets thousands of hits a day on port 22, or just a few a week on port xxxx?
  • When should you take attacks more seriously - when there are thousands of anonymous attacks per day, or when there are a few targeted attacks per week? Which is MORE LIKELY to need attention? Which is MORE LIKELY to even be noticed?

Since SSH is stable and secure, bugs and bad passwords are the only two viable attack vectors and your exposure to both is mitigated by changing your SSH port. So calling a changed SSH port “security by obscurity” is wrong, misleading, and potentially dangerous.

Feel free to disagree, but please motivate your argument if you do so.

That’s not a bad idea at all. Because the machines are in failover mode, there would be not performance hit to both getting the same traffic all the time. Additionally, as long as traffic from one VLAN doesn’t bleed over into another with Mac learning off, there would be no other side effects.

Idlemind, are you able to answer a few questions?

  • How do I turn MAC learning off? Is it as simple as clicking on a bridge port and changing the ‘learn’ value to no?
  • Would it then forward ALL traffic for ALL MAC addresses, or only traffic for MAC addresses not claimed by other ports?
  • How would this impact the performance of the switch? Do you know? Obviously it will place a little extra load, but would traffic still be switched in dedicated hardware, or would it have to be switched in CPU? (I’m guessing it would be in hardware)
  • Is there anything else you can think of that I should consider?

I can’t do anything on the switch during the day, which is why I’m asking. I’ll be able to test in about 5 hours, but I’d still appreciate answers to corroborate or correct anything I find.

Thanks,

Jason

EDIT: I was thinking to have the same issue but instead there was some wrong rule on the firewall that I’ve discovered only changing the switch