Host based load balancing and failover
Posted: Wed Jul 24, 2024 10:04 am
Here is a method to provide load balancing and/or failover to servers with a MikroTik router
There are alternative (and better) methods to achieve this where possible such as Keepalived, DNS, NLB, Failover Clustering etc however I had a niche case where none of these were suitable and I did not want to add additional software or VM's to handle this. So perhaps this will help someone who won't or can't use additional software and just wants a simple easy method to steer traffic via a MikroTik router to an available host
In a nutshell this is NAT'ing traffic to 1 or more hosts using PCC to load balance, and using routing distance in a separate routing table to select which are to be active. A script is run to dynamically update the NAT table
Example situation where we want to have load balancing (with failover) to SMTP and RADIUS servers
We want users to be able to send to any available SMTP server and for whatever reason they can only use a single destination IP address. This can be real (10.1.0.101) or a virtual IP such as 10.1.0.200. Those SMTP servers in turn use RADIUS for authentication and in our scenario are again limited to a single destination IP address. We want to utilize resources available with load balancing, and just in case we have a Pentium 133mhz with 16MB of RAM that is still running critical business services to this day. Middle management absolutely refuses to outlay any cash for a replacement, so we have decided to throw that into the mix as a last-resort failover if the 2 primary servers die
Start by creating the routing tables/marks. If you have the same destination IP address/server for multiple services then you only need to use a single routing mark. Otherwise create 1 table/mark per failover group
The distance value is used to determine failover/redundancy. Hosts with the same distance value will be load balanced via PCC based on source address. Thus the same source host will always use the same destination server. If you don't want to use active-active load balancing and instead want active-backup failover, then simply set difference distance values for every host. All hosts with a lower value must be unreachable before a higher distance host is elected. You can create staggered load balancing groups if you i.e. create 2 hosts with distance=101 and 2 hosts with distance=102
For RouterOS v7
For RouterOS v6
Then we need to add mangle rules to steer traffic to the failover group. You can use a real IP address (such as public WAN IP) but in this case we will use a single virtual IP of 10.1.0.200 for handling both SMTP and RADIUS, so we need 2 rules to separate the traffic out to different groups
We optionally need to add a source-nat rule to handle the return traffic. This depends on the application, RADIUS should not care but without this then the host may discard the return packet as it would have tried to open a connection to 10.1.0.200, but instead receives a response from 10.1.0.201 and discards it. Should not be needed if you are connecting into the public WAN IP from outside the network, as traffic will be NAT'd back anyway
Finally we have the script that handles the dynamic NAT and PCC. Copy the below into a Script and then create a scheduler that runs every minute or so and executes the script (or just paste the script directly into the scheduler). Modify the top 2 lines to match the virtual IP in use, as well as the routing-mark used
Create multiple copies of this script and edit the values as necessary
This relies on MikroTik's built in ping check of hosts, if the host turns unreachable in the routing table then it will delete and recreate the NAT rules, it does not check the actual service is reachable. So if the service crashes but the host is still pingable then it will result in broken connectivity. It's possible to modify the script to return actual service status via telnet
There are alternative (and better) methods to achieve this where possible such as Keepalived, DNS, NLB, Failover Clustering etc however I had a niche case where none of these were suitable and I did not want to add additional software or VM's to handle this. So perhaps this will help someone who won't or can't use additional software and just wants a simple easy method to steer traffic via a MikroTik router to an available host
In a nutshell this is NAT'ing traffic to 1 or more hosts using PCC to load balance, and using routing distance in a separate routing table to select which are to be active. A script is run to dynamically update the NAT table
Example situation where we want to have load balancing (with failover) to SMTP and RADIUS servers
We want users to be able to send to any available SMTP server and for whatever reason they can only use a single destination IP address. This can be real (10.1.0.101) or a virtual IP such as 10.1.0.200. Those SMTP servers in turn use RADIUS for authentication and in our scenario are again limited to a single destination IP address. We want to utilize resources available with load balancing, and just in case we have a Pentium 133mhz with 16MB of RAM that is still running critical business services to this day. Middle management absolutely refuses to outlay any cash for a replacement, so we have decided to throw that into the mix as a last-resort failover if the 2 primary servers die
Start by creating the routing tables/marks. If you have the same destination IP address/server for multiple services then you only need to use a single routing mark. Otherwise create 1 table/mark per failover group
The distance value is used to determine failover/redundancy. Hosts with the same distance value will be load balanced via PCC based on source address. Thus the same source host will always use the same destination server. If you don't want to use active-active load balancing and instead want active-backup failover, then simply set difference distance values for every host. All hosts with a lower value must be unreachable before a higher distance host is elected. You can create staggered load balancing groups if you i.e. create 2 hosts with distance=101 and 2 hosts with distance=102
For RouterOS v7
Code: Select all
/routing table add disabled=no fib name=RADIUSFailover
/routing table add disabled=no fib name=SMTPFailover
/ip route add check-gateway=ping disabled=no distance=101 dst-address=10.1.0.201/32 gateway=10.1.0.201 routing-table=RADIUSFailover
/ip route add check-gateway=ping disabled=no distance=101 dst-address=10.1.0.202/32 gateway=10.1.0.202 routing-table=RADIUSFailover
/ip route add check-gateway=ping disabled=no distance=101 dst-address=10.1.0.101/32 gateway=10.1.0.101 routing-table=SMTPFailover
/ip route add check-gateway=ping disabled=no distance=101 dst-address=10.1.0.102/32 gateway=10.1.0.102 routing-table=SMTPFailover
/ip route add check-gateway=ping disabled=no distance=102 dst-address=10.1.0.103/32 gateway=10.1.0.103 routing-table=SMTPFailover
For RouterOS v6
Code: Select all
/ip route add check-gateway=ping disabled=no distance=101 dst-address=10.1.0.201/32 gateway=10.1.0.201
/ip route add check-gateway=ping disabled=no distance=101 dst-address=10.1.0.202/32 gateway=10.1.0.202
/ip route add check-gateway=ping disabled=no distance=101 dst-address=10.1.0.101/32 gateway=10.1.0.101
/ip route add check-gateway=ping disabled=no distance=101 dst-address=10.1.0.102/32 gateway=10.1.0.102
/ip route add check-gateway=ping disabled=no distance=102 dst-address=10.1.0.103/32 gateway=10.1.0.103
Then we need to add mangle rules to steer traffic to the failover group. You can use a real IP address (such as public WAN IP) but in this case we will use a single virtual IP of 10.1.0.200 for handling both SMTP and RADIUS, so we need 2 rules to separate the traffic out to different groups
Code: Select all
/ip firewall mangle add action=mark-routing chain=prerouting comment="Failover Host Routing - RADIUS" dst-address=10.1.0.200 dst-port=1812,1813 new-routing-mark=RADIUSFailover passthrough=yes protocol=udp
/ip firewall mangle add action=mark-routing chain=prerouting comment="Failover Host Routing - SMTP" dst-address=10.1.0.200 dst-port=25 new-routing-mark=SMTPFailover passthrough=yes protocol=tcp
We optionally need to add a source-nat rule to handle the return traffic. This depends on the application, RADIUS should not care but without this then the host may discard the return packet as it would have tried to open a connection to 10.1.0.200, but instead receives a response from 10.1.0.201 and discards it. Should not be needed if you are connecting into the public WAN IP from outside the network, as traffic will be NAT'd back anyway
Code: Select all
/ip firewall nat add action=src-nat chain=srcnat comment="SrcNAT Host Failover Traffic" routing-mark=RADIUSFailover to-addresses=10.1.0.200
Finally we have the script that handles the dynamic NAT and PCC. Copy the below into a Script and then create a scheduler that runs every minute or so and executes the script (or just paste the script directly into the scheduler). Modify the top 2 lines to match the virtual IP in use, as well as the routing-mark used
Create multiple copies of this script and edit the values as necessary
This relies on MikroTik's built in ping check of hosts, if the host turns unreachable in the routing table then it will delete and recreate the NAT rules, it does not check the actual service is reachable. So if the service crashes but the host is still pingable then it will result in broken connectivity. It's possible to modify the script to return actual service status via telnet
Code: Select all
:local i 10.1.0.200 ; ### Set this as the virtual IP to be used for host failover
:local r "RADIUSFailover" ; ### Set this to the name of the routing mark/table to be used
### ---- DO NOT MODIFY BELOW THIS LINE --- ###
# Create a unique global variable for the host
:local n ("DstHost".[:convert $i to=hex])
:execute ":global $n"
# Get active destination entries in routing table, then convert to Gateway IPs in an array
:local t [/ip route find where routing-table="RADIUSFailover" active]
:local x ; :foreach y in=$t do={:set $x ($x,[/ip route get $y gateway])}
# If there is at least 1 active host, compare/create NAT rules with PCC to load balance amongst them. Else remove rules
:if ([:len $x] > 0) do={
# If valid hosts have changed, recreate NAT rules
:if ($x != [/system script environment get [find where name="$n"] value]) do={
[/ip firewall nat remove [find where comment~"^### Failover Host Routing for $i.*"]]
:foreach v,t in=$x do={/ip firewall nat add chain=dstnat routing-mark=$r dst-address=$i per-connection-classifier=("src-address:".[:len $x]."/".[$v]) action=dst-nat to-addresses=$t comment="### Failover Host Routing for $i -> $t"};
}
} else={[/ip firewall nat remove [find where comment~"^### Failover Host Routing for $i.*"]]};
# Workaround to update IP's in dynamic global variable name
:local tempval
:foreach t in=$x do={:set tempval ("$tempval,$t")}
:if ($tempval ~ "^,") do={:set tempval [:pick $tempval 1 [:len $tempval]]}
:execute ":set $n $tempval"