I have found that by converting any text to UNICODE, it can be displayed on the terminal screen and also recognizes it correctly in Telegram (would be a great solution).
For example, we start from the text: hello my friend camión Ññ
In the web converter we have this result:
“\68\65\6C\6C\6F\20\6D\79\20\66\72\69\65\6E\64\20\63\61\6D\69\C3\B3\6E\20\C3\91\C3\B1”
If we test it in Telegram, the result is OK.
{
:local text "\68\65\6C\6C\6F\20\6D\79\20\66\72\69\65\6E\64\20\63\61\6D\69\C3\B3\6E\20\C3\91\C3\B1"
:local MessageText $text
:local SendTelegramMessage [:parse [/system script get MyTGBotSendMessage source]]
$SendTelegramMessage MessageText=$MessageText
:put [$text]
}
Well, issue is INPUT to CLI/winbox/etc is ASCII only. And winbox’s terminal will strip Unicode, so you won’t see anything there – only via ssh would the UTF-8 output work.
But in script you could use variables with the UTF-8 escape sequences with string interpolation, which should work to go to stuff via fetch, like so:
You’d obviously have to define the set of unicode char codes (in RouterOS’s byte notation, “\xx\yy\zz”) ahead of usage, but that might work in some cases.
I suppose you could also use a function in the approach above, so that each letter could still be output as normal ASCII. Syntax is trickier with a function however:
# global flag to output UTF-8
:global "use-unicode" 1
# Tilde over lowercase N
:global tildan do={
:global "use-unicode"
:if ($"use-unicode" = 1) do={
return "\C3\B1"
} else={
return "n"
}
}
# output as unicode
:set "use-unicode" 1
:put "espa$([$tildan])ola"
# output as ascii
:set "use-unicode" 0
:put "espa$([$tildan])ola"
To write a defined text is fine, but the study is to be able to extract any text from a received SMS and forward it by Telegram with Unicode characters, so it could be read in Telegram. I have tried to see how this converter works [ https://r-1.ch/mikrotik-unicode-ssid-generator.php ], but my programming knowledge is very limited.
Ah, that’s slightly different SMS uses UCS-2 encoding, not UTF-8. So it’s really not same as the “emoji” code, which takes UTF-8.
You’re looking for a direct UCS-2/UTF-16 to UTF-8 conversion? That seems already covered by @rextended code above.
UTF-16/UCS-2 using double-byte to store the “popular” unicode – same format as Windows (and SMS) use internally. UNIX (and JSON) etc generally favor UTF-8, which is same as normal ASCII, but uses escape code in the extended ASCII and a variable number of bytes to store the unicode.
I’d think there must be some converter in forum, but I don’t find one instantly. I know I ain’t writing one since it tickier than it looks I suspect.
Perhaps an example shows the problem. We’ll go with the tilde ñ.
In ASCII/CP1252/Latin-1 that decimal 241, it’s one byte, as hex: F1 or as binary: 1111 0001
In GSM7 it can’t be shown since only lower ASCII is supported.
In UCS-2 which GSM can, optionally, use, everything is two bytes, so tilde-lowercase-n is just, in hex: 00F1
In UTF-8, it’s also two bytes (but other unicode could be 3 or 4 bytes – UCS2 is always just two bytes). But it’s a more confusing C3B1 in hex when encoded as UTF-8. Since UTF-8 supports the entire unicode, the higher/extended ASCII codes are re-used in encoding, so while ñ is part of of extended ascii, the extended ascii is “hijacked” to re-used for encoding the full set of unicode into multiple bytes.
Wikipedia has a char of the needed conversion logic:
You’ll note your tilde case “ñ” in ASCII is 0x00F1 but since that > 0x0080, UTF-8 encoding kicks in. Only the lower 127 ASCII characters are unchanged by UTF-8, so the Latin-1/etc the lost in UTF-8. GSM7 used in GSM PDU messages is only the lower 127 chars of ASCII, so using a ñ similar triggers encoding, just two byte UCS2 instead.
So I’d think this is possible in RouterOS script however. But different logic than the @rexetended one. But since in UCS-2, the two bytes are the same as the unicode code point, it’s just matter of remapping ones them to the multiple bytes used by UTF-8. UTF-8 is what’s required for JSON (and display in SSH).
But the issue may be how you even both identify the encoding and extract the UCS2 encoding from an SMS PDU, that the first problem before you get encoding to UTF-8 for use in HTTP stuff like telegram etc.
Logic:
UCS-2 have 65535 possible values (ignoring at the start the invalid sequence), always are 2 bytes.
UTF-8 do not have fixed characters length, and “UNICODE entry point” are different from what effectively is wroted inside the string.
For example, again the €uro sign:
€ is one character of CP1252 (Windows 1252) and other, but not all… (but we suppose to use UCS-2 that have for sure that symbol)
€(1252) = 0x80, is UNICODE entry point 0x20-0xAC and is writed effectively as 0xE2 0x82 0xAC on a string.
But… 0x20-0xAC is also the UCS-2 encoding for €uro…
I have already done both tables for characters on CP1252.
Supposing whe have always correct input value (input check can be added later)
Someone can test this if is working as expected.
I do not test that because do not have time today…
That does work…as designed does not cover all of UCS2 obviously.
I just hope @diamuxin isn’t portugues, latvian, or anyone who needs cedillas, macons, etc. - they’re not in Latin1 charset so not converted by this code. @rextended knows this, but the “lookup table method” is way easier than doing the bit-math needed to convert UCS-2 to UTF-8…
Since OP wanted telegrams, that’s totally right. But technically that’s “urlencoded” UTF-8, which is what you need if you use HTTP with query parameters (which I think telegram examples use to avoid needing JSON…).
100% agree. But JSON could have UTF-8 unicode chars inside, so problem here still wouldn’t just go away with that (e.g. it not JUST some “json2array” that’s missing in scripting) – they have a long list, which include basic encoding/decoding the RFC-ish/unicode formats too – as shown here!
But if someone did want “raw UTF-8” – which is what be needed inside some JSON… using “\” instead of “%” in the loop that builds the results would do get your UTF-8 as a byte stream (likely also need to return [:parse [return $constr]] (to re-interpolate the escape sequences too.
Hello, what a nice surprise to read this progress on the recoding of characters valid for Telegram, thanks to both of you for your interest.
No, Amm0 don’t worry, I’m working with the spanish language, I mainly wanted to make it compatible with “ñ” and acute accents like “á, é, í, ó, ú” (uppercase and lowercase of course), the euro currency symbol (€) is not important because I hardly receive SMS with it.
I want to start testing what @rextended has kindly suggested but I have a doubt, to use the function $testUCS2toUTF8 directly from an SMS content does not convert anything, I suppose that first I will have to convert from normal text to UCS2, right?
:global testUCS2toUTF8
:local content "campaña y acción."; # text extracted from test sms received
:put [$testUCS2toUTF8 $content]
Result on terminal screen: <empty>
You can’t just cut-and-paste the ñ into string. You need to pull the raw SMS data from your modem via at-chat (or maybe /tool/sms supports raw bytes now, dunno) – that’s what’s in UCS-2 format. If it’s your modem that’s giving you the XML, that’s likely UTF-8 inside, which would just parsing to pull out the message part (another complex parsing assignment however), then just urlencoding (e.g. the %20%CD%Ed format).
The message is extracted directly in XML from a Huawei USB modem using an API built into the device by that manufacturer.
Starting from this function (token parser library)
# TOKEN PARSER LIBRARY
# v1.0.0 pkt
# :put [($tokenParser->"getTag") source=$xml tag="SessionInfo"]
# ($tokenParser->"getBetween")
# get delimited value
# source - source string
# fromTok - (optional) text AFTER this token (or from source beginning) will be returned
# toTok - (optional) text BEFORE this token (or until source finish) will be returned
# startPos - (optional) start position (default 0 = beginning)
# returns an array with fields data and pos
#
# ($tokenParser->"getTag")
# get value for XML tag
# source - xml string
# tag - tag which value is to be returned
# startPos - (optional, text index) start position (if not specified, it will search from the beginning of the string)
# returns the tag content
#
# ($tokenParser->"getTagDetailed")
# get value and end position for XML tag
# source - xml string
# tag - tag which value is to be returned
# startPos - (optional, text index) start position (if not specified, it will search from the beginning of the string)
# returns an array with fields "data" (tag content) and "pos" (where tag ends)
#
# ($tokenParser->"getTagList")
# get a list with the content of each appearance of tag
# source - xml string
# tag - tag which value is to be returned
# returns an array with tag contents
#
# ($tokenParser->"forEachTag")
# incremental parser that calls the callback with the content of each appearance of tag
# source - xml string
# tag - tag which value is to be returned
# callback - callback (with param content) to be called for each appearance of tag
# callbackArgs - callback will be called passing this value in param args
:global tokenParser ({})
:set ($tokenParser->"getBetween") do={ # get delimited value
# source - source string
# fromTok - (optional) text AFTER this token (or from source beginning) will be returned
# toTok - (optional) text BEFORE this token (or until source finish) will be returned
# startPos - (optional) start position (default 0 = beginning)
# returns an array with fields data and pos
# if fromTok and/or toTok are specified and neither of them appear in source, empty string "" will be returned as data
# based on function getBetween by CuriousKiwi, modified by pkt
:local posStart
if ([:len $startPos] = 0) do={
:set posStart -1
} else={
:set posStart ($startPos-1)
}
:local found true
:local data
:local resultStart
:if ([:len $fromTok] > 0) do={
:set resultStart [:find $source $fromTok $posStart]
:if ([:len $resultStart] = 0) do={ # start token not found
:set found false
:set data ""
}
:set resultStart ($resultStart + [:len $fromTok])
} else={
:set resultStart 0
}
:local resultEnd
:if (found = true && [:len $toTok] > 0) do={
:set resultEnd [:find $source $toTok ($resultStart-1)]
:if ([:len $resultEnd] = 0) do={ # end token not found
:set found false
:set data ""
}
} else={
:set resultEnd [:len $source]
}
:if ($found = true) do={ :set data [:pick $source $resultStart $resultEnd] }
:return { data=$data; pos=$resultEnd }
}
:set ($tokenParser->"getTag") do={ # get value for XML tag
# source - xml string
# tag - tag which value is to be returned
# startPos - (optional, text index) start position (if not specified, it will search from the beginning of the string)
# returns the tag content
:global tokenParser
:return ([($tokenParser->"getBetween") source=$source fromTok=("<$tag>") toTok=("</$tag>") startPos=$startPos]->"data")
}
:set ($tokenParser->"getTagDetailed") do={ # get value and end position for XML tag
# source - xml string
# tag - tag which value is to be returned
# startPos - (optional, text index) start position (if not specified, it will search from the beginning of the string)
# returns an array with fields "data" (tag content) and "pos" (where tag ends)
:global tokenParser
:return [($tokenParser->"getBetween") source=$source fromTok=("<$tag>") toTok=("</$tag>") startPos=$startPos]
}
:set ($tokenParser->"getTagList") do={ # get a list with the content of each appearance of tag
# source - xml string
# tag - tag which value is to be returned
# returns an array with tag contents
:global tokenParser
:local result ({})
:local doneTags false
:local startPos 0
:do {
:local tagContent [($tokenParser->"getTagDetailed") source=$source tag=$tag startPos=$startPos]
:local content ($tagContent->"data")
:if ($content != "") do={
:set ($result->[:len $result]) $content
# advance start pos to search for next tag
:set startPos ($tagContent->"pos")
} else={
:set doneTags true
}
} while=($doneTags = false)
:return $result
}
:set ($tokenParser->"forEachTag") do={ # incremental parser that calls the callback with the content of each appearance of tag
# source - xml string
# tag - tag which value is to be returned
# callback - callback (with param content) to be called for each appearance of tag
# callbackArgs - callback will be called passing this value in param args
:global tokenParser
:local doneTags false
:local startPos 0
:do {
:local tagContent [($tokenParser->"getTagDetailed") source=$source tag=$tag startPos=$startPos]
:local content ($tagContent->"data")
:if ($content != "") do={
[$callback tagContent=$content args=$callbackArgs]
# advance start pos to search for next tag
:set startPos ($tagContent->"pos")
} else={
:set doneTags true
}
} while=($doneTags = false)
}
Function to get a list of SMS messages
:global recvSMS do={
:local lteIP "192.168.8.1"
:global tokenParser
# get SessionID and Token via LTE modem API
:local urlSesTokInfo "http://$lteIP/api/webserver/SesTokInfo"
:local api [/tool fetch $urlSesTokInfo output=user as-value]
:local apiData ($api->"data")
# parse SessionID and Token from API session data
:local apiSessionID [($tokenParser->"getTag") source=$apiData tag="SesInfo"]
:local apiToken [($tokenParser->"getTag") source=$apiData tag="TokInfo"]
# header and data config
:local apiHead "Content-Type:text/xml,Cookie: $apiSessionID,__RequestVerificationToken:$apiToken"
:local recvData "<?xml version=\"1.0\" encoding=\"UTF-8\"?><request><PageIndex>1</PageIndex><ReadCount>20</ReadCount><BoxType>1</BoxType><SortType>0</SortType><Ascending>0</Ascending><UnreadPreferred>1</UnreadPreferred></request>"
# recv SMS via LTE modem API with fetch
:return [/tool fetch http-method=post http-header-field=$apiHead url="http://$lteIP/api/sms/sms-list" http-data=$recvData output=user as-value]
}
Okay, well that’s simplier than UCS-2 from SMS via AT. Your the Huawai modem is doing you a favor here - most modems give you an SMS PDU that requires parsing before you even get to the UCS2.
If you have “campaña y acción” you can use directly ASCIItoCP1252toURLencode
If you have the SMS value readed directly by AT commands and converted to UCS-2 string,
“campaña y acción” = \00c\00a\00m\00p\00a\00\F1\00a\00\20\00y\00\20\00a\00c\00c\00i\00\F3\00n
(obviously I have alrady converted standard letters on “c-a-m-p…”)
ñ = \00\F1 ó = \00\F3
At this point you convert the UCS-2 string with testUCS2toUTF8 and pass the results to UTF8toURLencode to obtain the URL/GET/POST string for fetch.
results:
%68%65%6C%6C%6F%20%6D%79%20%66%72%69%65%6E%64%20%63%61%6D%69%C3%B3%6E%20%C3%91%C3%B1
The string on example is the converted string “hello my friend camión Ññ” to UCS-2
Entry points: ó = 00 FE, Ñ = 00 D1,ñ = 00 F1
Thank you, it was a pleasure, also develop other useful functions.
MikroTik do not decode UTF-2 SMS, but if one have patience (next step?.. ) to extract by AT commands the SMS PDU,
and extract UTF-2 text message from the PDU, is possible to forward that message to e-mail, twitter, etc.