I am a Network Engineer managing a WISP setup in a remote Alaskan tribal village. Last week I was on-site and replaced our middle-mile Internet connection from OneWeb which was very costly and underperforming with 10 high performance Starlink terminals. I have set the IP policy to public IP on all the terminals and using the DHCP client on each of the 10 interfaces on our MikroTik router and configured to automatically add the default routes since Starlink does not offer true static IP addresses and we have no other ISP options available in this very remote area. I see it automatically setup ECMP load balancing and that appears to be working as I can see traffic being distributed across all 10 WAN interfaces and data usage also appears almost equal on each terminal in the Starlink portal.
The problem we are now facing is this setup broke the secure connection between the MikroTik and our Sonar billing instance which controls the LAN DHCP server. Our L2TP/IPSEC VPN no longer works. I figure this is because how traffic is routed out one WAN interface and comes back through a different WAN interface therefore the source IP is keeps changing.
I’ve done a bit of research on this and also watched a few YouTube videos of MikroTik engineers setting up MultiWAN scenarios using PCC with routing marks and separate routing tables for each WAN interface. The difference is in the videos, only 2-3 WAN connections were used and the ISPs were different. However, I have not been as successful with getting this to work as all of the IP addresses Starlink as assigned are in the same subnet and have the same gateway. I know using 10 Starlinks might seem a bit extreme but that was not my decision.
I would like to ask for some guidance in getting this setup to work and fix our secure connections so the village can begin billing customers. Our company is also willing to consider hiring a MikroTik expert to work with us on a consultancy basis to get this to work if there is someone on here available and has proven experience with this scenario. I do consider myself knowledgeable with MikroTik and have formal education in Networking but will admit this is my first attempt at setting up something of this magnitude. I am a quick learner and love all things MikroTik and Networking related so I am willing to listen and learn from anyone on here who knows more about this than I do.
I am attaching a copy of our config with sensitive info removed as well as screenshots of our System > Resources and IP > Routes for reference.
Okay you have other issues.
Your way of separating users and use of vlans and bridge is very confusing.
You have only one pool but then many addresses, whats going on.
Looks like you should have 3 vlans, subscribers group1, subscribers group2 and servers.
One bridge-lan is fine, assign the three vlans (2,4,6) to the bridge, etc.
3 pools, 3 ip dhcp servers and 3 ip dhcp server networks to go along with three IP addresses with interface being the applicable vlan!!!
Then MANGLE rules, have no clue what your mangle rules are doing???
AT a minimum for PCC you will need
-10 rules to mark packets coming from the LAN sources that require PCC.
-10 rules to mark connections for those packets
Example:
Create one PCC mangle rule for each WAN connection /ip firewall mangle add action=mark-connection chain=prerouting connection-mark=no-mark dst-address-type=!local
in-interface-list=NEED_PCC new-connection-mark=to_wan1 passthrough=yes per-connection-classifier=both-addresses:10/0
…
…
add action=mark-connection chain=prerouting connection-mark=no-mark dst-address-type=!local
in-interface-list=NEED_PCC new-connection-mark=to_wan10 passthrough=yes per-connection-classifier=both-addresses:10/9
Then additional 10 routes for PCC marked traffic! add dst-address=0.0.0.0/0 gateway=98.97.96.1%spf28-1-wan1 routing-table=to_wan1
…
…
add dst-address=0.0.0.0/0 gateway=98.97.96.1%spf28-10-wan10 routing-table=to_wan10
Thats very basic, if you have traffic externally arriving on a WAN then it needs to go out the same WAN and we will need additional mangle rules for that.
If you have some traffic originating on the router that should not PCCs but needs to go out a specific WAN that needs to be communicated and accounted for.
Finally you have to ask yourself the question, what happens when one of the WAN connections is NOT available.
THe associated LAN traffic assigned PCC to that route will be screwed and will not pass.
You have two options… Assign additional routes
FOR WAN1 - If 1 fails go to 2, then 3, then 4, then 5
FOR WAN2 - if 2 fails go to 1, then 2, then3, then4 etc…
Within each TABLE, the the sub routes are differentiated by distance…
Example take the to_wan1 rule, give it a distance 2 and then add 9 more rules, so the first and last would look like add dst-address=0.0.0.0/0 gateway=98.97.96.%spf28-1-wan1routing-table=to_wan1 distance=2
…
…
add dst-address=0.0.0.0/0 gateway=98.97.96.1%spf28-10-wan10routing-table=to_wan1 distance=11
THe problem with this approach is that you dont spread the load out to the other WANS you pass a failure onto another TABLE and thus WAN.
The positive side is that this is much easier to config then option B and thats 100 route lines easy and manageable LOL.
Option B, consists of spreading out the PCCs to all the other WANs and its a exploding matrix with considering starting with 10 wans LOL
What we are doing is taking the 1/10 approach alloted for each WAN and dividing its responsibility by 9 other WANs. Each table/each PCC gets 1/90 of the traffic so when one fails we divy up the 1/10 (any singular WAN gets) by giving the remaining wans 1/90th each.. 9/90 = 1/10
10 wans x 9 alternative paths = 90 PCCs
AthenB PC 90/0
AthenC
AthenD
AthenE
AthenF
AthenG
AthenH
AthenI
AthenJ PCC 90/8
TO the last one.
JThenA PCC 90/72
JthenB
JthenC
JtheD
JthenE
JthenF
JthenG
JthenH
JthenI PCC 90/89
Thus considering each one has two routes thats what, approx 180 routes total…
Thank you very much for responding. I really do appreciate it. I admit I could have done better at explaining the network setup which I will attempt to do now and I apologize in advance for the lengthy post. I tend to over explain and add extra detail in hopes of others better understanding our issues.
Our company took over management of this network about a year ago. The internal network was designed by another network engineer and I do agree and want to change the VLANs. When we first took over, an Adtran NetVanta router was in place but I switched it out for the MikroTik due to our billing platform (Sonar) having some features we needed which required the MikroTik to implement as it’s the only router platform that Sonar integrates with. Not to mention that I love MikroTik and Adtran GUI is crap.
We originally only had one WAN connection to a Satellite middle-mile provider (OneWeb) which is garbage for the price they were charging the village. Over $36,000/month for a 100Mbps connection and trying to share that with approx 100+ customers who connect via Cambium cnRanger LTE platform and many who thought they could stream Netflix and also game on that connection. Latency was very high (about 500-600ms consistently) and the link would constantly drop. The speeds we were getting when the connection was up made dial-up internet seem like broadband. I couldn’t even send a single picture via iMessage on my phone over WiFi without it dropping. Needless to say, it was terrible.
The Tribal Council decided to dump OneWeb and purchase 10 high priority Starlink subscriptions and 10 terminals on their own thinking they would get 10x the speed by bonding them all together but later found out that it wouldn’t work that way. They were also told by Starlink they would get static IP addresses which they also learned that wasn’t true. Now we have the situation we are in with 10 WANs and 10 dynamic IP addresses which Starlink has said are DHCP reserved and won’t change unless they have major network maintenance where they would need to failover to different ground stations.
For the LAN network. The previous engineer designed it as follows:
VLAN 2: Subscriber network consisting of the two /23 subnets below:
The bridge interface is assigned the sfp28-12 port which handles both of the VLANs and trunks to a port on an Adtran Netvanta switch that connects to all the servers and Cambium LTE equipment to the customers. The DHCP server is also running on this bridge interface and handles only the 10.130.2.0/23 subnet. All of that is working with customers getting leases and able to browse the internet but again I agree it could be setup better.
10.130.2.0/23 is for customer routers inside their homes and is assigned via DHCP Pool setup in Sonar which communicates with the Mikrotik to assign static leases.
10.130.4.0/23 assigned statically to customer subscriber modules (dishes) that are mounted on their homes. I chose static addresses due to limitations with the Cambium cnRanger platform. There is no option to configure external DHCP server per Cambium Support and no future plans to add that so addresses are assigned statically to the LTE SIM cards. There is another way to set this up and I have documentation from Sonar to do this which I am working on. Our village techs are not networking professionals so my goal is to simplify it so all they need to do is install the equipment, assign the equipment to the customer account in Sonar and it does everything else automatically. The less phone calls I get from the techs, the better since this is the 1st of 14 villages we are working on.
The current mangle rules were setup using Sonar documentation. These rules also use queues and address lists for the data packages that are currently offered and they work with Sonar and a Preseem QoE appliance to control speeds and data caps so I do understand why those are confusing. Those were setup back when we still had only one WAN connection and no load balancing.
The network itself has been working but now with the 10 Starlinks. It sounds like I need to change my default routes and add the mangle rules that you described for PCC to work. A question I have is should I disable the setting in each WAN DHCP client that automatically adds the default routes or leave it enabled and still add the default routes that you described? My gut is telling me to disable it but I figured I would ask anyway to be safe since I am roughly 2000 miles away and any mess up is a potential flight to the village during the winter. Safe mode is nice but sometimes I forget to turn it on.
Regarding the failover issue. I’m going to need to think about that. Before changing to the 10 starlinks. I did have one starlink acting as a failover for the OneWeb and it was setup recursively using a video I found on YouTube from a MikroTik expert that calls himself “The Network Berg”. This worked surprisingly well and easy to implement with only 2 WANs. Only problem with that was the secondary connection stayed down until a failover occurred which is not what I want in this new setup.
It’s looking like I have lots of work to do and again I thank you very much for pointing me in the right direction. I think I have a better understanding but I’m going to review all of this with our chief engineer and begin implementing it. I’ll let you know how it goes. If after reading this novel of a post you find more that needs to be done. Do let me know.
No worries Anav, What you provided is a giant step in the right direction. I built out all the default routes and mangle rules but kept them disabled until I can have a village tech at our tower site to assist remotely should anything go wrong. I do plan to test this today to see if it all works. Another question I have that maybe you could answer. The NEED_PCC interface list that is referenced in the mangle rule. I created that list and need to know if I put all the WAN interfaces in that list? It makes sense that I would but wanted to get your thoughts.
Hi Jaysen, fair question as I had not defined what that was anywhere.
Since it was not clear to me which LAN side entities were getting PCCd so to speak I left it as an new interface list. It has nothing to do with the WANS…
So once you define what subnets will be included in the PCC you can add them to that interface list.
There may be some LAN subnets going out a totallly different WAN (LTE etc) or you want them to go out a specific WAN regardless (starlink 5 for example) and these would NOT be included in the interface list.
If you have some users that are in a subnet that is getting PCCd but you dont want them PCC then you will need to also create a firewall address list of those users.
If you have some users that are not in a subnet that is getting PCCd but you want them PCCd then you will need to also create a firewall address list.
In other words, put all subnets that will be shared amongst the WANs in one interface list and ensure you create firewall address lists for user exceptions.
Feel free to share your mangle rules and routes for viewing … use code block to keep it short. ( the black square with white square brackets on the same line as Bold and Underline etc.)
So I finally got to reading through it, and I am trying to put the bits together.
You wrote that the public IP addresses assigned by Starlink with the high priority subscription were changing but now they are not, but if I get it right, the L2TP/IPsec tunnel still keeps disconnecting (or is not connecting at all)?
Second, I cannot see any L2TP configuration in the export nor a dst-nat rule for IPsec ports, so I figure the L2TP/IPsec client is running in Alaska on some device connected to the 2216’s LAN and connecting to a server at your HQ location, is that correct?
ECMP used to be working on RouterOS 6 where a routing cache existed, so if a connection started using a particular gateway, it continued using it until the routing cache was flushed. I’m not sure whether ECMP can be used on RouterOS 7, which uses a kernel that does not support routing cache any more, when src-nat is engaged, because I don’t know how a Starlink terminal treats packets with a wrong source address. But since you encounter problems, I would first dig in this direction. The thing is that the firewall rules assign the “external” address of a src-nated connection (officialy, the reply-dst-address) once for good while handling the initial packet of the connection. So if ECMP sends a subsequent packets belonging to the same connection via another gateway, that packets will leave with an address that does not belong to the one of the out-interface, so if the Starlink network is picky about it, it will not reach the destination.
If my assumptions above regarding the L2TP/IPsec topology, Starlink behavior, and ECMP behavior in ROS 7 are correct, you need to make sure that the L2TP/IPsec connection keeps using the same WAN all the time; to do that, you need to make it use a dedicated routing table that prefers a particular WAN and only uses another one if the primary one fails, so that the tunnel could re-establish.
Before proceeding, please confirm or correct my understanding.
I have built the routes and the mangle rules as you explained in your first reply. I’ve left them disabled for now until I know they are correct. I am putting them in code blocks below for you to take a look at and make any suggestions.
I created the 10 mangle rules for connection marks below. You had said I also need 10 mangle rules to mark the packets. Would you be able to provide some guidance on how to setup the packet marks? I didn’t see that in your post. Thank you kindly. The other mangle rules in there are for controlling speeds in Sonar and were created using their documentation.
The poster suggests changing passthough=yes to passthrough=no. I also currently have only one srcnat masquerade rule to a WAN interface list that has all my WAN interfaces in. Will that work or should I have a separate masquerade rule for each WAN interface?
Back to my question about the NEED_PCC list. It’s my understanding that PCC will solve my issue with secure connections including our L2TP/IPSEC VPN not working. Customers are currently not able to visit HTTPS sites without getting privacy warnings as they have reported to me and our VPN client on the MikroTik router is unable to connect to office VPN server in Oregon when it used to work when we only had 1 WAN connection. Am I correct in my thinking that all my LAN subnets will need PCC and therefore I should add them all to that list? I would think all customers should get PCCd since pretty much everything we do online these days is secure and most modern browsers default to it. Thoughts on this?
ECMP works fine V7 – it relies on connection tracking to store the routing decision for future packets. I’m not sure that’s the problem here. I suspect it just the variable the speed of starlink sats moving is going to be same across all 10 terminals (e.g. one is slow, all will likely be slow)
The LTE being 600ms would imply the cell networks backhaul is using GEO sat, which at a full transponder would be ~50-100Mb capacity range.
Regarding the Starlink public IP addresses. We were told by Starlink that they do not offer static IP addresses. The public addresses are assigned by DHCP but they said they are static leases or “sticky” as they called it. The only time they could change is if they had to failover our connections to another ground station for whatever reason. Otherwise, they said the IP addresses won’t change. I hope that clears that up. Sorry if it was confusing when I first explained it.
The L2TP client is running directly on the Mikrotik router in Alaska and it connects to another Mikrotik router at our HQ in Oregon running the L2TP server. That was working great prior to changing over to the 10 starlinks. It worked with our previous OneWeb connection and also with just 1 starlink after we got rid of OneWeb. We have been operating on 1 starlink for the past few months temporarily until last week we made the switch to 10 priority dishes which broke it. The logs are now showing it trying to connect but it always fails, then retries, then fails again. It never does successfully connect
You are correct, we have no dst-nat rule for L2TP setup and currently no filter rules at this time that would block any incoming connections. I decided to keep it open for now while I am trying to get this load balancing and PCC working. Then I will go back and work on securing it. I do have a few dst-nat rules that point to some servers on the local server subnet but I would like to get rid of those once the VPN is working again.
Thanks. I was wondering about this for some time already.
That was the previous satellite operator, and I can easily imagine they haven’t got any LEO satellites that high to the north. Starlink has a few satellites on polar orbits specially to cover these areas.
Then something must have gone wrong in the process of posting the configuration, because I can see no /interface l2tp-client section there.
There are also multiple routing tables but just two static routes in the configuration.
If the L2TP client is running directly on the 2216, no dst-nat rule is necessary, it’s just that there absence of the client configuration in the export made me think it is running on another device so I was thinking of possible topologies (an L2TP server in LAN in Alaska would require a dst-nat rule).
Sorry, I ran out of today here, so I’ll be back online in 8-10 hours from now.
I had removed some sensitive data in the config before posting it. That’s likely why it wasn’t in there. The original config contained complete address lists with customer info in them so I removed all that. Here is the most recent config with address list data removed.
OK. Now let me clarify some points that may not be obvious.
First a disclaimer - I am aware that you have inherited most of the current configuration from the previous administrator, so if I occasionally say “you use”, it is just an abbreviation of “I can see in the configuration”.
this point may seem a cosmetic one, but it needs to be clarified to avoid issues in future and to explain the background of some suggestions. @anav has given you a template configuration that uses connection marks as a basis for assignment of routing marks; the “pre-anav” configuration uses connection marks as a basis to assign packet marks to be further used to choose QoS queues. Unfortunately, Mikrotik’s wrapper to netfilter allows at most one connection mark per connection; luckily, the way the connection marks are used for QoS purposes in the pre-anav configuration is clearly a consequence of copy-pasting some template without really understanding it and it can be redone do work the same without use of the connection marks. Another positive point is that in your particular setup, you probably don’t need to use the connection marks for routing either.
the reason why you have to set the gateway parameters of the routes in the form ip.add.re.ss%interface-name is that even if the Starlink network accepts packets sent with a “wrong” source address through a particular terminal, I suppose the bandwidth limitations apply per terminal, not per source address (but only X can confirm this). Since all leases from the same /21 use the same address of the gateway, setting the gateway parameter to IP address alone (which is what the DHCP client normally does) is not sufficient for the routing to choose a particular interface.
However, the screenshot you have posted shows that your WANs get own addresses, and therefore also gateways, from at least two different subnets. So although the leases are “long-term stable”, we cannot risk that the gateway addresses in the manually configured routes would stop matching the real gateway addresses once the leases change, so we need to use a lease script to update the route configuration, not only in the dedicated routing tables but once at it, also in the default routing table (called main). It’s RouterOS 7 so I’ll have to do some tests before giving you a script to apply - I am well aware of those 2000 miles.
So here’s how to make the DHCP clients add/modify the routes the necessary way.
First, copy-paste the following script to the command line window of the router. The exported form looks awful, but it is not invoked until you change other things in the configuration, so you can paste the creation script from and then open the created script named lease-script in the GUI or command line editor to see it in a more user friendly way.
Once the script is added and you make yourself comfortable with what it does, you can test the following steps for one of the DHCP clients (choose the N in the command as required): /ip/dhcp-client/set [find where interface=sfp28-N-wanN] script=lease-script add-default-route=no
/ip/dhcp-client/release [find where interface=sfp28-N-wanN]
In the last configuration you’ve posted, the statically configured routes are already present even in table main, so they will get only updated, not added (unless you remove them before). But as the dynamically added one will be removed due to the change of add-default-route to no, you’ll have to enable the disabled static route in main.
There may be a catch - you have renamed the sfp28-N interfaces to sfp-28-N-wanN, and I had some fun with that when testing it here. I have created and assigned the dhcp script while the interface had a custom name, but then decided to reset the custom name back to the default one. Nevertheless, the script kept using the old custom one; deleting and recreating the dhcp client was not enough to sort that out, but disabling and re-enabling the inteface did. Any questions that might arise are not to me, thank you
If you can see the expected outcome, which is to have a route via sfp28-N-wanN with the correct gateway address in the configuration as a static one, you can do the same for one more DHCP client. My plan is to let the L2TP client use those two before eventually extending the approach to all of them.
All the above is still just an intermediary step. My feeling is that with all uplinks served by the same satellite constellation, most of the outages will be affected by poor satellite visibility and will thus affect all of them, yet still we’ll have to add means to notice and stop using a broken terminal, but that’s for later.
Maybe consider adding some monitoring of the starlink performance — which is kinda annoying since it use gRPC, not SNMP. I know there is starlink plugin for Prometheus, but if you have some other NMS somewhere… imagine there are plugins for starlink. The terminal’s gRPC data includes stuff like max speed, # sats, etc. if you know that it be easier to find where fault lies down the road.
All the Starlinks I happen to deal with are remote ones so I never delved into monitoring of the terminal, assuming that the owner’s app shows you something. What I had in mind was the “good old” monitoring of uplink transparency all the way to internet, preventing the situation where the WAN interface is physically up so the router keeps using it although the packets sent through that interface get nowhere due to an outage further in the ISP network.
But some weeks ago I was wondering whether the location data could be retrieved from the Starlink and haven’t come across gRPC when googling. Once the 7.13 (with its capabilities to interwork between json and variables) becomes stable, I guess I’ll give it a try.
Thank you Sindy for the script. I was starting to wonder how I was going to handle changing the routes in the event our IP addresses change. I didn’t want to do it manually especially if any change occurred in the middle of the night or otherwise working with other clients and out of reach so this is very helpful. I am going to create that now.
I do apologize about the interface names. I had just added -wanN to the end of the default names but willing to rename them to simplify things if necessary. I am very flexible with any changes as is our client. They’ve lived with terrible or no internet most of their lives so they are very understanding and know I am working to make things better for them.
Good news about the satellite visibility is this village is in an area of Alaska that is high up and visibility should be of no concern as we were told by Starlink. The site is wide open with little to no obstructions and so far performance has been very impressive. OneWeb can’t even begin to compete and the village has made sure I knew that. Even with the fraction of a percent obstruction that Starlink dashboard is showing on a few of the terminals. Single strength is very strong and latency is low.
I did rework some of mangle rules late last night prior to seeing your post this morning. I found this guide on the MikroTik site that seems very informative. https://mum.mikrotik.com/presentations/PL10/balancing.pdf Attached is my latest config although it sounds like I may not need all the connection marks as you mentioned so we can get rid of anything that is not needed. I did have a question about the failover section and while I did set the distances in the routes. I am starting to wonder whether I even need to do that since all the gateways are the same on most of the terminals so if it goes down then it seems like it would be down for every connection that uses the same gateway?
To your point of copying-pasting configs without fully understanding. While it made me laugh, it’s a very valid point and I do this quite a bit when I am playing around to see how things will work. I do always make sure to back up my original configs so I can recover in the case I mess up and that has gotten me in the past.
Again I appreciate you and everyone else for being patient and working with me on this.
Depending on the order of creating the DHCP client, renaming the interface it is attached to, and possibly rebooting the router in the past the result may be different. The lease script uses the interface name to find, add, and modify the routes, and on my test router it was remembering the previous name. So forcing the DHCP client to renew an address after adding the script using /interface/ethernet/disable sfp28-N-wanN ; delay 2s ; /interface/ethernet/enable sfp28-N-wanN rather than /ip/dhcp-client/release [find where interface=sfp28-N-wanN] may be a better choice after renaming the interface or if you are not certain.
Frankly speaking the simulation at starlink.sx has scared me a lot as it is showing long (as in tens of minutes) outages in coverage for Unalakleet, so I hope they are using outdated/inaccurate data. LEO satellites are superior to geostationary ones for both delay and throughput per square mile of coverage. Far lower distance to travel so far lower delay, far lower attenuation to compensate, and far bigger area to cover by a single antenna with a given throughput. But in these areas, polar trajectories seem to be mandatory so it’s more or less a separate (from the main one covering the belt from 52°S to 52°N) Starlink constellation. And OneWeb probably didn’t get that far (yet) so geostationary is their only choice for those areas.
From my experience with other Starlink installations in “bypass” mode, the gateway is probably redundant - if you look into the ARP table, you will probably see the MAC addresses of all (currently “both”) the gateway IPs to begin with 00:00:5E:00:. So it is more likely that all your terminals lose connection than that a gateway in just one of the subnets becomes unreachable.
So I maintain my previouos position that we have to deal with issues we have to capacity to deal with, i.e. a breakdown of a single terminal. The only satellite within reach gone bonkers does not fit into this category.
As for route distances - these determine the priority among routes whose dst-address and routing-table parameters are identical. If multiple such routes are eligible for being active, those with the lowest value of distance are actually made active, and if there are multiple such ones, they are used in a round-robin manner (ECMP). So as @anav probably wrote earlier - in order that failover and load distribution worked in accord, you need to have one routing table per uplink that is used for traffic that should prefer that uplink but can send the traffic via other uplinks if the preferred one becomes unusable. In the simplest to configure case, you define just one backup uplink for each preferred one, so if the preferred one dies, the backup one has to bear its full load on top of its own one. In the optimized case, the load of the failed link is evenly distributed among all the remaining ones. So much more lines of configuration with much more room for mistakes but potentially less impact on customers if a terminal eventually fails. Choose your poison.
So for starters, let me give you an example for the L2TP, which is the most wanted functionality right now I gather.
You’ll add a routing table named for-l2tp with two routes, one with distance=1 and the other one with distance=2, using the two WANs for which you have modified the DHCP client behavior. Let’s say you’ve chosen WAN 9 and WAN 10:
By forcing a release followed by a re-lease (pun intended) of the DHCP address as described above, you’ll trigger the lease script that should replace the 1.1.1.1 by the correct addresses of the respective gateways in these routes.
There are multiple possible ways to make the L2TP connection use this table; since the own WAN addresses of the Mikrotik are dynamic and since you identify the VPN server by its fqdn, so I assume its IP address may also change, all these ways require that the routing table to be used was assigned using mangle rules in chain output.
So let’s make any connection from the router itself to your VPN server use this routing table. To make sure it won’t break once the server migrates to another IP address, we’ll add an address list to track the fqdn: /ip/firewall/address-list/add list=re-vpn address=vpn.richesineng.com
It should create a dynamic item in the same address list but with the actual IP address as address. If it doesn’t, something is wrong with the DNS setting.
Then, you add a single rule to the very top of chain output in mangle (which is easy as that chain is totally empty now): /ip/firewall/mangle/add chain=output dst-address-list=re-vpn action=mark-routing new-routing-mark=for-l2tp passthrough=no
It seems this should be it; there is an additional issue, though. A packet sent by the router itself must first be routed using routing table main before it can get to mangle chain output, so it gets assigned some source address depending on the out-interface chosen by main. If mangle assigns a routing-mark to it, the routing is done again, but the source address remains the same. So a src-nat (or masquerade) rule must replace it by the address of the out-interface. This is OK until the uplink connected to that out-interface stops working without the interface going physically down. The connection tracking only removes the connection from its inventory if it has been src-nated using a masquerade rule and if the reply-dst-address assigned by the masquerade rule goes missing. Starlink seems to lease the addresses for 5 minutes (but that’s for CGNAT ones, the public ones may be treated in a different way); if you are happy with the L2TP connection being re-established some 5 minutes after the preferred uplink goes down, nothing else needs to be done. If not, it requires a housekeeping script that removes the address as soon as it detects the failure of the uplink. But again, step by step. Right now I’ll be glad if you make the L2TP work without fancy stuff.
If these settings won’t get the L2TP going, I’ll have to see it online.