Safe Config-as-Code deployment to MikroTik with rollback on failure

I’m working on a config-as-code setup for my MikroTik devices, and I’ve hit a recurring problem: when a new config breaks things (e.g. wrong firewall rule, bad IP, or a syntax error in the .rsc), I often lose access to the router entirely. Recovery means physically going there and factory resetting. Not fun.

I’ve been using an SSH-based provisioning script that:

  • Export the current ROS config, save it locally
  • Picks up my “commited” .rsc file (templates it with Jinja2 + YAML secrets, so I don’t keep secrets in it)
  • Imports it to the device
  • Runs /system reset-configuration run-after-reset=...
  • Waits for the router to reboot and checks it’s alive
  • Exports the complete current config again
  • Prints diff to see what has been changed.

This mostly works… until it doesn’t. If there’s a problem in the run-after-reset script, the router may never come back up.

So I’m thinking — can I make this safer?

Instead of wiping the device and hoping for the best, maybe I can do something like a “big-ass try-catch” around the whole config. The idea is:

  1. Before applying anything, I save a binary backup (latest-working).

  2. I upload the new config script (new-config.rsc).

  3. Inside that script:

    • Set a global variable like PROVISIONED=true at the end if everything works.
    • Wrap the whole config in a :do { ... } on-error={ ... } block.
    • The on-error would run /system/backup/load name=latest-working & reboot. And finally, as a safeguard - if PROVISIONED isn’t set, I restore the backup and reboot as well - just in case some errors are not properly thrown.

Rough outline of the new-config.rsc:

:global PROVISIONED false

:do {
    # all my config here (IPs, firewall, bridge, DHCP, etc.)

    :global PROVISIONED true
} on-error={
    /log error "Provisioning failed, rolling back"
    /system backup load name=latest-working
    :delay 10
    /system reboot
}

:if ($PROVISIONED = false) do={
    /log error "Provisioning incomplete, rolling back"
    /system backup load name=latest-working
    :delay 10
    /system reboot
}

The eployment script (simplified):

ssh "$ROUTER" "/system backup save name=latest-working"
scp new-config.rsc "$ROUTER:new-config.rsc"
ssh "$ROUTER" "/system reset-configuration no-defaults=yes keep-users=yes run-after-reset=new-config.rsc"

What I’m Trying to Figure Out

  • Does this approach make sense at all?
  • Are there edge cases where this might fail silently or leave the router in a half-configured state?
  • Any gotchas around using /system backup load in a script? Does it really revert everything cleanly?
  • Is it dangerous to reboot right after a restore like this?
  • Is there a better way to mark success than a global variable?
  • Any risk of import continuing past an error in the :do block?
  • Finally: what if some error in configuration halts the importing process e.g. because it’s waiting for user input?

Appreciate any feedback - I’m trying to avoid ever having to crawl into a basement with a paperclip again.

I think some problems can escape the :do ... on-error={} is the main concern, so the script may not even run depending on if it “compiles”. Now your approach is no worse off than before, where same thing could happen. So just noting that your solution may not be a complete panacea.

Also, if you’re targeting recent V7 (i.e. stable), you’d like want to use the newer :onerror err in={} do={} syntax instead of :do ... on-error={}

Well, it should work and should be okay to reboot after it’s finished. But the backup should replace the entire file systems, so I don’t think that be “incomplete”.

And since you’re doing a backup on the same version, it should be okay. Where backups sometimes fail is using older/newer backup taken from a a different version (although in most case this still works, but depend on what config was used and if anything changed in syntax). You have a :delay between the restore and reboot - in theory that should not be needed – commands should be synchronous. But for safety it’s smart and can’t hurt – there are cases and past bugs where command are not “completely” done before they return.

Well, unless some really bad future bug, I think you’re safe it’s not going to run an on-error={} block unless there is an error. Restore does not re-format, so if there are “left over” files somehow, those are still around but that would only affect disk space, not the config.

“Better” is pretty subject. I suppose you could use function instead, but that be more confusing for a run-after-reset= script.

That would happen because some needed parameter for an “add” was missing. And just hang and not hit on-error={} AFAIK. While likely some scheme involving jobs/parse/execute might be able be able to essentially add some timeout to capture waiting for input case…that just make the script more likely to fail.

Sure. How much it help, IDK. I’m also a believe that more lines of code, more potential for bugs. i.e. by add more complex schemes, you run the risk it’s the error checking scheme that cause a breakage.

At the end of the day, testing on same hardware, same version beforehand always best. It’s kinda better to have a simple script that’s tested, than “tossing” config into a try/catch and sort out failures if backup had to be restore.

i.e. nothing substitutes for just simple testing run-after-reset=/netinstall/defconf scripts on test device. If you don’t have same hardware, trying CHR with same version may work too… But I’ve never seem a failure where the make/model/version acted differently than a test device of same make/model/version.

1 Like

Thank you for such detailed answer!

I was thinking of such approach, but I’d need to duplicate every device I have. and the CHR won’t have the same hardware, interface names e.g. configuring wifi would throw an error.

I hope MT at some point will introduce a native, first-class-citizen support for configuration-as-code with auro-rollback possibilities.

Heh, over at reddit, giacomok shared his solution: they utilise netwatch’s down-script feature to rollback config.

So my current idea is that for the imported .rsc file, the very first think it would contain is something like:

/tool/netwatch/add \
    start-delay=10 \
    host=1.1.1.1 \
    down-script="/system/backup/load name=latest-working" \
    up-script=":delay 60; /tool/netwatch/remove [find host=1.1.1.1]"

So if within 10s after the import process started, we still don’t have reachable internet, it should rollback the backup. And if we have, it will wait 60s and remove itself.

Hmmm, that implies setting the netwatch startup-delay to 0 or a few seconds (while the default is five minutes).
Surely the default can be lowered, but how much (without possible adverse effects)?

Why to 0? We cannot immediately start the netwatch - it would immediately fail to reach 1.1.1.1, because the device configuration is not yet imported (it takes at least couple of seconds), thus it will rollback the backup, interrupting the configuration import.

In my example above I set it to 10, but that’s just a placeholder, most likely it needs to be longer - I will need to test how long does it take for the device to be live after calling the run-after-reset script.

@jaclaz was speaking about startup-delay, which has a default value of 5 minutes:

Netwatch - RouterOS - MikroTik Documentation

startup-delay (Default: 5m) Time to wait until starting Netwatch probe after system startup

And not start-delay. Assuming your import starts right after the router reboot, which is the case if you use /system reset-configuration, then you probably don’t need to set start-delay=10 because what you wrote (that I quoted) should not be true with the default startup-delay=5m.

Exactly, maybe I am completely wrong, but - generally speaking - I tend to trust the default settings to have been chosen by competent people for some reasons, then when/if I am really convinced that in the specific case they chose something that makes really no sense I may proceed to change them as I think more fit.

Chesterton’s Fence:
https://en.wikipedia.org/wiki/G._K._Chesterton#Chesterton’s_fence

Now, if the good Mikrotik guys chose such a high value as default, I have to presume that - at least on some models - there is an initial time when the Netwatch is not fully reliable (because other parts of the router are initializing, or whatever other reason).
So, one thing is reducing the startup-delay from 5 minutes to (say) 2 minutes (probably depending on the specific device boot time and complexity of the configuration), and another one is reducing it from 5 minutes to (still say) 10 seconds.

1 Like

ahh, I misread. I didn’t know that are two options: start-delay and startup-delay :sweat_smile:

Sure, but 5 minutes? Seems like a major overshot. From my observations, for devices like Chateau 5G ax or CRS310-8G+2S+IN it takes around 90s to do complete reboot + factory reset + import configuration, where importing is the shortest stage.

Nevertheless, thanks for the insights! Once I’ll get to it, I’ll share what timings work for me.

@jaclaz !!!

Why 5 minutes?
Because if you reboot the device on my network (and also on some LTE operators on Italy) the pppoe server won’t let you log in again unless… 5 minutes have passed since the last time…
Since I’m not the only one who adopts this policy, rebooting a device and start netwatch after 10 seconds could result in no internet, and netwatch might keep rebooting/reloading the device because it can’t “see” the internet…

So, the time for rebooting and reactivating the peripherals + 5 minutes is right and sufficient.

It’s better if you specify its use; you can get specific advice.

To mass update my clients’ CPE configurations, I rely on .rsc.

The first thing the script does is create a backup and a scheduler to reload it if the CPE isn’t rebooted within 5 minutes to complete the (re)configuration/update.

I'm deploying MikroTik configurations using Infrastructure-as-Code tools (Terraform/Pulumi) and facing critical issues on production systems. I'd
like to discuss solutions or workarounds.

Problem 1: Lack of Atomic Transactions
When deploying configuration changes via the REST API or Terraform provider, each command executes immediately. If the connection is lost mid-deployment (e.g., due to firewall or IP address changes), the router ends up in a partial state, and I'm locked out.

What I need: Something like Debian's netplan try: apply a complete configuration script, monitor for confirmation, and automatically rollback to the previous working config if no confirmation is received within a timeout. (which is somehow answered here)

Problem 2: Non-Idempotent Behavior
RouterOS commands work interactively (only updating specified properties), which creates issues for declarative config management:
/ipv6/address add address=fd00::1/64 advertise=yes
Later, automation runs:
/ipv6/address add address=fd00::1/64
Result: advertise stays "yes" instead of resetting to default

This makes it impossible to guarantee a defined state without knowing all properties and their defaults.

Problem 3: No Schema Introspection
The API is continuously extended with new properties. My automation tools can't discover:

  • What properties exist for each resource
  • What the default values are
  • Whether properties have been added/changed between RouterOS versions

Questions:

  1. Is there any native support for atomic configuration updates with rollback planned?
  2. Can /console/inspect be used reliably to discover all properties and defaults for automation purposes?
  3. Are others using Terraform/Pulumi successfully? What patterns do you use to prevent lockouts?
  4. Would MikroTik consider implementing a "commit confirm" or transaction mode?

Current workarounds I'm aware of:

  • Safe Mode (only protects single commands)
  • Scheduler-based rollback scripts
  • /system reset-configuration run-after-reset= (only for full resets)

Any guidance would be appreciated!

1 Like

A load of nonsense thrown out there at random.
There's so much of it, it's going to take a long time to refute it all.

Especially when it's written that it seems like things are changing the parameters and values ​​on their own, just randomly,
while it's clear that if something changes, someone has definitely updated the software automatically, in the worst possible way.

You just had to read the previous posts, or rather the last one...

I don’t understand your reaction — I asked for some advice on how to build an automation software that can bring a ROS router to a stable state across multiple ROS versions or different devices. And which might have been interacting with state sets by other config sources.
I did not say anything about those parameter changes. I only tried to determine whether it's possible to discover all the properties of every command. I would name it a schema so that the automation software is able to set all known properties to a defined state.
My example of /ipv6/address should show if the automation software does not know about the “advertise” property, an update(set) is not possible, resulting in an undefined state.
To be more precise, it’s not /ipv6/address added in the second step, it’s
/ipv6/address set [find address="fd00::1/64"] address=”fd00::4/64”

I don’t think @rextended ‘s reply was warranted. Let’s assume he simply does not understand your post.

Similar questions have been asked before, but the RouterOS philosophy is not like what you would want to see. I know that other brands have functionality like that, but RouterOS does not.

Only things I can suggest:

  • connecting by MAC address usually gives access to a badly configured router. but it offers only commandline and winbox access, not REST API. winbox uses an API as well, I think it is similar to the old-style API, but it is undocumented.
  • when you enter configuration via the commandline, you can use braces to enclose several commands (separated by semicolons or newlines) and the whole block will only be executed once the closing brace is sent. that way you can make changes where the intermediate state would mean a lost connection
  • you can use “/export verbose” (possibly combined with “terse” and “show-sensitive”) to obtain all parameter values, also those that are set to default
  • it is not generally possible to obtain default values except by trying (see if they disappear in a plain export). in some cases you can use ‘set !parameter’ or ‘set parameter=””’ to get an empty and thus default value
  • in some cases it would be possible to reset values to default by deleting the item and then re-adding it without specifying a value, but of course that is not possible and/or safe in all cases
  • I have heard about /console/inspect but AFAIK it is undocumented just like winbox API. probably it is the REST API analog
  • a safer way to make changes with rollback is already described at the start of the topic. alternatively you can also partition the router and switch partition somehow when bad things happen
1 Like

thx,@pe1chl it, felt a little unwelcoming -:slight_smile:

Thanks for the quick feedback. So, what if I build a program that tries to set in any related section, like /ip/address add address=127.0.0.127/8 comment=’my-test-record’ a test record, and then use /export verbose to infer something like a schema, and then remove the test records.
That should be quite easy to create and will allow us not to always remove and add; now we can safely update, which addresses the disconnect problem partly, and enable the automation tool to fail before applying if something new appears.

About the ambiguity of setting the default values in that hell I had been years before, but if we are able to update instead of remove/add, we could try both known possibilities and look/learn which parameter works how. These learning could be stored in the schema.
All this sounds complex, but these days such software is easier to build with a little AI help.

I will try it, and try to make out of it a Pulumi provider — I don’t think extending the Terraform provider is a good idea.

Exactly, these are the conceptual errors of your first post.
Who set "advertise=yes" in the first place?
If it doesn't belong, explicitly remove it, without wasting time checking how it's set.
The default can change over time, just ignore it.
Then, for the list of parameters and possible values, just use [TAB] and [F1] on a test model, before "distributing commands," as is appropriate.

# before something set wrong IP and wrong advertise=yes and reading values one obtain:
interface=ether1 address=fd00::1/64 advertise=yes

# later something overwrite the wrong values with correct values
/ipv6 address set [find where address="fd00::1/64"] address=fd00::4/64 advertise=no  disabled=no eui-64=no no-dad=no from-pool=""

# correct form of add ona IPv6 address, to ether1 interface, in the example:
/ipv6 address add interface=ether1 address=fd00::4/64 advertise=no disabled=no eui-64=no no-dad=no from-pool=""
# on this way ZERO ambiguity, no matter the defaults

And speaking of the version confusion, just use one for v6 and one for v7 without creating a mixed and unpredictable network.

Who set "advertise=yes" in the first place? — In my case, I don’t know

About [TAB] [F1] is not automation at all, so I could build something that tries to use something like expect to retrieve the schema or use my test-records to get a schema.

Added some lines on previous script.


Machine can not do all, you must test manually new RouterOS version for check what is changed on API/CLI/REST before, and test version for month before put it on production...

I don't think so. Don't care about people, you'll only find the ones that matter at home.

If you need help [with what I know how to do], I'm here, anyway.