Unexplained Fetch Error

I’ve got a strange script issue I can’t figure out. I have got a few thousand devices running an identical script, 99% of them are running normal, but about 20 of them are erroring out and not completing the script. These 20 devices are showing “Unable to connect to service” in the log, and are not executing the remainder of the script from where Connect is called as they should.

This function is part of the larger script, but the total script is identical to all devices, and there is zero difference in the device configurations except for the basic things like IP addresses & wifi settings. The specific section of code that is failing is below. Again, this is working on thousands of devices, but failing on 20, and those 20 that are failing are spread across the country, They are all connecting to the same url with the same method, yet a handful fail to connect. All devices are running 6.49.7 (this is part of an update script that will trigger them to update to 6.49.8…)

Since there’s no error handling in scripts, I don’t know where to go next on this one…


:local Connect do={
			do {
				:return ([/tool fetch mode=https http-method=$method output=user url="$1" as-value ]->"data")
			} on-error={
				:log warning "Unable to connect to service"
			}
		}

Yeah, you lose the HTTP return code in on-error={} – makes it tough…

My initial thought was if URL was interpolated from other variables, perhaps something needs some additional escaping (like a device name etc), or some part of URL resulted in a nothing/nil. Lot can go wrong when building a string with the type/etc… But if really just a constant, likely not…

My next guess be DNS isn’t resolving correctly.

Maybe implement a retry, perhaps with a :ping/:resolve to the host in URL before a retry, so you’d have more data…? But yeah at least give another try before giving up just once… In V7, there is [:retry] for this…but that’s not in V6. Also some logging of the URL provided might be handy as that be help to know what it was trying to go to…

You can use this esample for read fetch error:
http://forum.mikrotik.com/t/fetch-capable-of-following-redirects/151723/7

In this case the URL is the same for every single device, it’s passed as a variable because this fetch call is used about 5x throughout the rest of the script and therefore future updates or internal testing allows the URL to be updated once for the whole script.

DNS was my first thought too, on one device I tried to ping the endpoint from the device and had no issue, on another I checked DNS cache to see if the IP was already stored in the resolver and it was…

After giving up for the night yesterday I came in this morning and saw that 14 of the 20 devices actually did complete their update process last night, so now I’ve only got 6 holdouts… I’d like to say this was probably just a transient network error on the upstream carrier(s), except there were multiple devices in multiple different parts of the country that had the issue, and it persisted for several days, while other devices on the same network never had any issues… I’m just going to monitor these remaining devices and see if they eventually pull the update on their own and if they are still a problem next week then maybe I’ll manually upgrade and reboot them.

Unfortunately using the external fetch script/write to file option isn’t a good idea here since this script runs every 30 minutes, and normally it also performs 2 fetch commands per execution, but sometimes as many as 6, and that would add up to extremely excessive writes to the flash. We used to use a similar method but were forced to move off it to a RAM only solution because we started to see a number of devices with growing numbers of bad blocks where we were doing that. Yes some devices have ram disk enabled by default but not all do. Plus I’m not sure if that method would properly work with the string to array parsing that we do with the data returned from the fetch command. Getting it to work correctly as a function inside the script was already quite a challenge, and with the number of devices we’re supporting and the fact they are spread across all of North America, I’m pretty cautious with all changes..

And you can’t use ramdisk (or perhaps /task) either in V7 to apply @rextended’s file approach without adding writes… (Now I’ve never run into flash issues but…I don’t provoke them either & imagine specific hardware may be better/worse with flash durability )

I’m still suspicious of DNS - either the network, TTL/cache, or Mikrotik’s implementation… I think some fallback URL (maybe hardcode IP) might be a middle ground to eliminate one failure point.

DNS was my first, and still best hunch, but I can’t find any failure in DNS resolution to blame it on. A fallback URL is a great idea, we can’t go straight to the IP as SSL is required both on a business requirement and also as a server requirement. Server uses SSNI so no SSL access without matching hostname in the request.

In our next update I’ll add some retry logic in the onerror section and have it try a 2nd time before failing, and use another hardcoded URL in that. There’s already several valid URLs we have to hit either that resource or a 2nd instance of it, and two different SSL certs are used depending on which URL you use to access it, so it’s a really good idea to setup a fallback logic in there.