Community discussions

MikroTik App
 
User avatar
diamuxin
Member
Member
Topic Author
Posts: 337
Joined: Thu Sep 09, 2021 5:46 pm

Convert any text to UNICODE

Fri Feb 10, 2023 8:14 pm

Hello,

Is it possible to create in RouterOS a converter similar to this one?

https://r-1.ch/mikrotik-unicode-ssid-generator.php

I have found that by converting any text to UNICODE, it can be displayed on the terminal screen and also recognizes it correctly in Telegram (would be a great solution).

For example, we start from the text: hello my friend camión Ññ

In the web converter we have this result:
"\68\65\6C\6C\6F\20\6D\79\20\66\72\69\65\6E\64\20\63\61\6D\69\C3\B3\6E\20\C3\91\C3\B1"

If we test it in Telegram, the result is OK.

Image

{
:local text "\68\65\6C\6C\6F\20\6D\79\20\66\72\69\65\6E\64\20\63\61\6D\69\C3\B3\6E\20\C3\91\C3\B1"          
:local MessageText $text
:local SendTelegramMessage [:parse [/system script get MyTGBotSendMessage source]]
$SendTelegramMessage MessageText=$MessageText
:put [$text]
}

And on the terminal screen it is also OK.

Image


thanks in advance.
 
User avatar
Amm0
Forum Guru
Forum Guru
Posts: 4240
Joined: Sun May 01, 2016 7:12 pm
Location: California
Contact:

Re: Convert any text to UNICODE

Fri Feb 10, 2023 9:18 pm

Well, issue is INPUT to CLI/winbox/etc is ASCII only. And winbox's terminal will strip Unicode, so you won't see anything there – only via ssh would the UTF-8 output work.

But in script you could use variables with the UTF-8 escape sequences with string interpolation, which should work to go to stuff via fetch, like so:
:global tilden "\C3\B1"
:put "espa$(tilden)ola"

:global garaisi "\C4\AB"
:put "Labr$(garaisi)t"
You'd obviously have to define the set of unicode char codes (in RouterOS's byte notation, "\xx\yy\zz") ahead of usage, but that might work in some cases.
 
User avatar
Amm0
Forum Guru
Forum Guru
Posts: 4240
Joined: Sun May 01, 2016 7:12 pm
Location: California
Contact:

Re: Convert any text to UNICODE

Fri Feb 10, 2023 9:43 pm

I suppose you could also use a function in the approach above, so that each letter could still be output as normal ASCII. Syntax is trickier with a function however:
# global flag to output UTF-8
:global "use-unicode" 1

# Tilde over lowercase N
:global tildan do={
   :global "use-unicode" 
   :if ($"use-unicode" = 1) do={
       return "\C3\B1"
    } else={
       return "n"
    }
}

# output as unicode
:set "use-unicode" 1
:put "espa$([$tildan])ola"

# output as ascii
:set "use-unicode" 0
:put "espa$([$tildan])ola"
 
User avatar
diamuxin
Member
Member
Topic Author
Posts: 337
Joined: Thu Sep 09, 2021 5:46 pm

Re: Convert any text to UNICODE

Sat Feb 11, 2023 1:33 am

To write a defined text is fine, but the study is to be able to extract any text from a received SMS and forward it by Telegram with Unicode characters, so it could be read in Telegram. I have tried to see how this converter works [ https://r-1.ch/mikrotik-unicode-ssid-generator.php ], but my programming knowledge is very limited. :(

BR.
Last edited by diamuxin on Sat Feb 11, 2023 3:16 am, edited 1 time in total.
 
User avatar
Amm0
Forum Guru
Forum Guru
Posts: 4240
Joined: Sun May 01, 2016 7:12 pm
Location: California
Contact:

Re: Convert any text to UNICODE

Sat Feb 11, 2023 2:09 am

Ah, that's slightly different SMS uses UCS-2 encoding, not UTF-8. So it's really not same as the "emoji" code, which takes UTF-8.

You're looking for a direct UCS-2/UTF-16 to UTF-8 conversion? That seems already covered by @rextended code above.

UTF-16/UCS-2 using double-byte to store the "popular" unicode – same format as Windows (and SMS) use internally. UNIX (and JSON) etc generally favor UTF-8, which is same as normal ASCII, but uses escape code in the extended ASCII and a variable number of bytes to store the unicode.
 
User avatar
anav
Forum Guru
Forum Guru
Posts: 21730
Joined: Sun Feb 18, 2018 11:28 pm
Location: Nova Scotia, Canada
Contact:

Re: Convert any text to UNICODE

Sat Feb 11, 2023 3:29 am

WRONG solution for telegram, what MT ROS should do is parse JSON. The cheap hack of parsing text is a fragile approach.
 
User avatar
Amm0
Forum Guru
Forum Guru
Posts: 4240
Joined: Sun May 01, 2016 7:12 pm
Location: California
Contact:

Re: Convert any text to UNICODE

Sat Feb 11, 2023 3:52 am

There is no JSON involved here – although read/writing JSON has long been missing but different issue.

OP is starting with UCS-2 encoded SMS PDU – if only he was starting with JSON.
 
User avatar
Amm0
Forum Guru
Forum Guru
Posts: 4240
Joined: Sun May 01, 2016 7:12 pm
Location: California
Contact:

Re: Convert any text to UNICODE

Sat Feb 11, 2023 3:57 am

I'd think there must be some converter in forum, but I don't find one instantly. I know I ain't writing one since it tickier than it looks I suspect.

Perhaps an example shows the problem. We'll go with the tilde ñ.
In ASCII/CP1252/Latin-1 that decimal 241, it's one byte, as hex: F1 or as binary: 1111 0001
In GSM7 it can't be shown since only lower ASCII is supported.
In UCS-2 which GSM can, optionally, use, everything is two bytes, so tilde-lowercase-n is just, in hex: 00F1
In UTF-8, it's also two bytes (but other unicode could be 3 or 4 bytes – UCS2 is always just two bytes). But it's a more confusing C3B1 in hex when encoded as UTF-8. Since UTF-8 supports the entire unicode, the higher/extended ASCII codes are re-used in encoding, so while ñ is part of of extended ascii, the extended ascii is "hijacked" to re-used for encoding the full set of unicode into multiple bytes.

Wikipedia has a char of the needed conversion logic:
Image
from: https://en.wikipedia.org/wiki/UTF-8#Encoding

You'll note your tilde case "ñ" in ASCII is 0x00F1 but since that > 0x0080, UTF-8 encoding kicks in. Only the lower 127 ASCII characters are unchanged by UTF-8, so the Latin-1/etc the lost in UTF-8. GSM7 used in GSM PDU messages is only the lower 127 chars of ASCII, so using a ñ similar triggers encoding, just two byte UCS2 instead.

Columbia has table of UCS2 values that might also be helpful:
http://www.columbia.edu/kermit/ucs2.html

So I'd think this is possible in RouterOS script however. But different logic than the @rexetended one. But since in UCS-2, the two bytes are the same as the unicode code point, it's just matter of remapping ones them to the multiple bytes used by UTF-8. UTF-8 is what's required for JSON (and display in SSH).

But the issue may be how you even both identify the encoding and extract the UCS2 encoding from an SMS PDU, that the first problem before you get encoding to UTF-8 for use in HTTP stuff like telegram etc.
 
User avatar
rextended
Forum Guru
Forum Guru
Posts: 12522
Joined: Tue Feb 25, 2014 12:49 pm
Location: Italy
Contact:

Re: Convert any text to UNICODE

Sat Feb 11, 2023 1:38 pm

So I'd think this is possible in RouterOS script however.
But different logic than the @rexetended one.
Logic:
UCS-2 have 65535 possible values (ignoring at the start the invalid sequence), always are 2 bytes.
UTF-8 do not have fixed characters length, and "UNICODE entry point" are different from what effectively is wroted inside the string.
For example, again the €uro sign:
€ is one character of CP1252 (Windows 1252) and other, but not all... (but we suppose to use UCS-2 that have for sure that symbol)
€(1252) = 0x80, is UNICODE entry point 0x20-0xAC and is writed effectively as 0xE2 0x82 0xAC on a string.
But... 0x20-0xAC is also the UCS-2 encoding for €uro....
I have already done both tables for characters on CP1252.
Supposing whe have always correct input value (input check can be added later)

Someone can test this if is working as expected.
I do not test that because do not have time today....

Based on already existing tables:
viewtopic.php?t=177551#p967513
On future one conversion function based on bit, instead of tables can be done, when I have time

..........
code removed, see
viewtopic.php?p=983695#p983695
..........

The string on example is the converted string "hello my friend camión Ññ" to UCS-2
Entry points: ó = 00 FE, Ñ = 00 D1,ñ = 00 F1

Result is the string for telegram:
%68%65%6C%6C%6F%20%6D%79%20%66%72%69%65%6E%64%20%63%61%6D%69%C3%B3%6E%20%C3%91%C3%B1

Can be decoded here:
https://www.urldecoder.org/
Last edited by rextended on Sun Feb 12, 2023 7:10 am, edited 1 time in total.
 
User avatar
Amm0
Forum Guru
Forum Guru
Posts: 4240
Joined: Sun May 01, 2016 7:12 pm
Location: California
Contact:

Re: Convert any text to UNICODE

Sat Feb 11, 2023 5:16 pm

That does work...as designed does not cover all of UCS2 obviously.
On future one conversion function based on bit, instead of tables can be done, when I have time

I just hope @diamuxin isn't portugues, latvian, or anyone who needs cedillas, macons, etc. - they're not in Latin1 charset so not converted by this code. @rextended knows this, but the "lookup table method" is way easier than doing the bit-math needed to convert UCS-2 to UTF-8...

€ is one character of CP1252 (Windows 1252) and other, but not all...
While Windows CP1252 and ISO-8859-1, or more generally "Latin 1", are the same. The euro sign € is an oddity since it's in CP1252, but not a character in ISO-8859-1. e.g. https://en.wikipedia.org/wiki/Windows-1 ... age_layout vs https://en.wikipedia.org/wiki/ISO/IEC_8 ... age_layout.
In theory, the SMS's UCS-2 should have this encoded using 0x20AC, since that's it's unicode codepoint - but who knows, it could be the 0x0080 since most OS accept both for €.

Result is the string for telegram:
%68%65%6C%6C%6F%20%6D%79%20%66%72%69%65%6E%64%20%63%61%6D%69%C3%B3%6E%20%C3%91%C3%B1
Since OP wanted telegrams, that's totally right. But technically that's "urlencoded" UTF-8, which is what you need if you use HTTP with query parameters (which I think telegram examples use to avoid needing JSON...).

what MT ROS should do is parse JSON
100% agree. But JSON could have UTF-8 unicode chars inside, so problem here still wouldn't just go away with that (e.g. it not JUST some "json2array" that's missing in scripting) – they have a long list, which include basic encoding/decoding the RFC-ish/unicode formats too – as shown here!

But if someone did want "raw UTF-8" – which is what be needed inside some JSON... using "\\" instead of "%" in the loop that builds the results would do get your UTF-8 as a byte stream (likely also need to return [:parse [return $constr]] (to re-interpolate the escape sequences too.
 
User avatar
diamuxin
Member
Member
Topic Author
Posts: 337
Joined: Thu Sep 09, 2021 5:46 pm

Re: Convert any text to UNICODE

Sat Feb 11, 2023 6:13 pm

Hello, what a nice surprise to read this progress on the recoding of characters valid for Telegram, thanks to both of you for your interest.

No, Amm0 don't worry, I'm working with the spanish language, I mainly wanted to make it compatible with "ñ" and acute accents like "á, é, í, ó, ú" (uppercase and lowercase of course), the euro currency symbol (€) is not important because I hardly receive SMS with it.

I want to start testing what @rextended has kindly suggested but I have a doubt, to use the function $testUCS2toUTF8 directly from an SMS content does not convert anything, I suppose that first I will have to convert from normal text to UCS2, right?

1.- Received SMS

<?xml version="1.0" encoding="UTF-8"?>
<response>
	<Count>1</Count>
	<Messages>
		<Message>
			<Smstat>0</Smstat>
			<Index>40000</Index>
			<Phone>+34XXXXXXXXX</Phone>
			<Content>campaña y acción.</Content>
			<Date>2023-02-11 17:37:29</Date>
			<Sca></Sca>
			<SaveType>0</SaveType>
			<Priority>0</Priority>
			<SmsType>1</SmsType>
		</Message>
	</Messages>
</response>

Test script, from > system/script/run test1

:global testUCS2toUTF8
:local content "campaña y acción."; # text extracted from test sms received
:put [$testUCS2toUTF8 $content]
Result on terminal screen: <empty>

I want to collaborate, how do I have to test?
 
User avatar
Amm0
Forum Guru
Forum Guru
Posts: 4240
Joined: Sun May 01, 2016 7:12 pm
Location: California
Contact:

Re: Convert any text to UNICODE

Sat Feb 11, 2023 6:53 pm

You can't just cut-and-paste the ñ into string. You need to pull the raw SMS data from your modem via at-chat (or maybe /tool/sms supports raw bytes now, dunno) – that's what's in UCS-2 format. If it's your modem that's giving you the XML, that's likely UTF-8 inside, which would just parsing to pull out the message part (another complex parsing assignment however), then just urlencoding (e.g. the %20%CD%Ed format).

How are you getting that XML?
 
User avatar
diamuxin
Member
Member
Topic Author
Posts: 337
Joined: Thu Sep 09, 2021 5:46 pm

Re: Convert any text to UNICODE

Sat Feb 11, 2023 7:18 pm

How are you getting that XML?
The message is extracted directly in XML from a Huawei USB modem using an API built into the device by that manufacturer.

Starting from this function (token parser library)

# TOKEN PARSER LIBRARY
# v1.0.0 pkt

# :put [($tokenParser->"getTag") source=$xml tag="SessionInfo"]

# ($tokenParser->"getBetween")
#  get delimited value
#    source - source string
#    fromTok - (optional) text AFTER this token (or from source beginning) will be returned
#    toTok - (optional) text BEFORE this token (or until source finish) will be returned
#    startPos - (optional) start position (default 0 = beginning)
#  returns an array with fields data and pos
# 
# ($tokenParser->"getTag")
#  get value for XML tag
#    source - xml string
#    tag - tag which value is to be returned
#    startPos - (optional, text index) start position (if not specified, it will search from the beginning of the string)
#  returns the tag content
# 
# ($tokenParser->"getTagDetailed")
#  get value and end position for XML tag
#    source - xml string
#    tag - tag which value is to be returned
#    startPos - (optional, text index) start position (if not specified, it will search from the beginning of the string)
#  returns an array with fields "data" (tag content) and "pos" (where tag ends)
# 
# ($tokenParser->"getTagList")
#  get a list with the content of each appearance of tag
#    source - xml string
#    tag - tag which value is to be returned
#  returns an array with tag contents
# 
# ($tokenParser->"forEachTag")
#  incremental parser that calls the callback with the content of each appearance of tag
#    source - xml string
#    tag - tag which value is to be returned
#    callback - callback (with param content) to be called for each appearance of tag
#    callbackArgs - callback will be called passing this value in param args


:global tokenParser ({})

:set ($tokenParser->"getBetween") do={ # get delimited value
  # source - source string
  # fromTok - (optional) text AFTER this token (or from source beginning) will be returned
  # toTok - (optional) text BEFORE this token (or until source finish) will be returned
  # startPos - (optional) start position (default 0 = beginning)
  
  # returns an array with fields data and pos
  # if fromTok and/or toTok are specified and neither of them appear in source, empty string "" will be returned as data

  # based on function getBetween by CuriousKiwi, modified by pkt

  :local posStart
  if ([:len $startPos] = 0) do={
    :set posStart -1
  } else={
    :set posStart ($startPos-1)
  }

  :local found true
  :local data 

  :local resultStart
  :if ([:len $fromTok] > 0) do={
    :set resultStart [:find $source $fromTok $posStart]
    :if ([:len $resultStart] = 0) do={ # start token not found
      :set found false
      :set data ""
    }
    :set resultStart ($resultStart + [:len $fromTok])
  } else={
    :set resultStart 0
  }

  :local resultEnd
  :if (found = true && [:len $toTok] > 0) do={
    :set resultEnd [:find $source $toTok ($resultStart-1)]
    :if ([:len $resultEnd] = 0) do={ # end token not found
      :set found false
      :set data ""
    }
  } else={
    :set resultEnd [:len $source]
  }

  :if ($found = true) do={ :set data [:pick $source $resultStart $resultEnd] }

  :return { data=$data; pos=$resultEnd }
}

:set ($tokenParser->"getTag") do={ # get value for XML tag
  # source - xml string
  # tag - tag which value is to be returned
  # startPos - (optional, text index) start position (if not specified, it will search from the beginning of the string)

  # returns the tag content

  :global tokenParser
  :return ([($tokenParser->"getBetween") source=$source fromTok=("<$tag>") toTok=("</$tag>") startPos=$startPos]->"data")
}

:set ($tokenParser->"getTagDetailed") do={ # get value and end position for XML tag
  # source - xml string
  # tag - tag which value is to be returned
  # startPos - (optional, text index) start position (if not specified, it will search from the beginning of the string)

  # returns an array with fields "data" (tag content) and "pos" (where tag ends)

  :global tokenParser
  :return [($tokenParser->"getBetween") source=$source fromTok=("<$tag>") toTok=("</$tag>") startPos=$startPos]
}

:set ($tokenParser->"getTagList") do={ # get a list with the content of each appearance of tag
  # source - xml string
  # tag - tag which value is to be returned

  # returns an array with tag contents

  :global tokenParser

  :local result ({})
  :local doneTags false
  :local startPos 0

  :do {
    :local tagContent [($tokenParser->"getTagDetailed") source=$source tag=$tag startPos=$startPos]

    :local content ($tagContent->"data")
    :if ($content != "") do={
      :set ($result->[:len $result]) $content

      # advance start pos to search for next tag
      :set startPos ($tagContent->"pos")
    } else={
      :set doneTags true
    }
  } while=($doneTags = false)

  :return $result
}

:set ($tokenParser->"forEachTag") do={ # incremental parser that calls the callback with the content of each appearance of tag
  # source - xml string
  # tag - tag which value is to be returned
  # callback - callback (with param content) to be called for each appearance of tag
  # callbackArgs - callback will be called passing this value in param args

  :global tokenParser

  :local doneTags false
  :local startPos 0

  :do {
    :local tagContent [($tokenParser->"getTagDetailed") source=$source tag=$tag startPos=$startPos]

    :local content ($tagContent->"data")
    :if ($content != "") do={
      [$callback tagContent=$content args=$callbackArgs]

      # advance start pos to search for next tag
      :set startPos ($tagContent->"pos")
    } else={
      :set doneTags true
    }
  } while=($doneTags = false)
}



Function to get a list of SMS messages

:global recvSMS do={
  :local lteIP "192.168.8.1"

  :global tokenParser

  # get SessionID and Token via LTE modem API
  :local urlSesTokInfo "http://$lteIP/api/webserver/SesTokInfo"
  :local api [/tool fetch $urlSesTokInfo output=user as-value]
  :local apiData  ($api->"data")

  # parse SessionID and Token from API session data 
  :local apiSessionID [($tokenParser->"getTag") source=$apiData tag="SesInfo"]
  :local apiToken [($tokenParser->"getTag") source=$apiData tag="TokInfo"]

  # header and data config
  :local apiHead "Content-Type:text/xml,Cookie: $apiSessionID,__RequestVerificationToken:$apiToken"
  :local recvData "<?xml version=\"1.0\" encoding=\"UTF-8\"?><request><PageIndex>1</PageIndex><ReadCount>20</ReadCount><BoxType>1</BoxType><SortType>0</SortType><Ascending>0</Ascending><UnreadPreferred>1</UnreadPreferred></request>"

  # recv SMS via LTE modem API with fetch
  :return [/tool fetch http-method=post http-header-field=$apiHead url="http://$lteIP/api/sms/sms-list" http-data=$recvData output=user as-value]
}

Script to extract the content of messages:

:global tokenParser
:global recvSMS
:local xmlSmsList ([$recvSMS]->"data")
:local smsList [($tokenParser->"getTagList") source=$xmlSmsList tag="Message"]
:local smsCount [:tonum [($tokenParser->"getTag") source=$xmlSmsList tag="Count"]]

:if ($smsCount > 0) do={

:foreach tagContent in=$smsList do={

  :local index [($tokenParser->"getTag") source=$tagContent tag="Index"]
  :local date [($tokenParser->"getTag") source=$tagContent tag="Date"]
  :local phone [($tokenParser->"getTag") source=$tagContent tag="Phone"]
  :local content [($tokenParser->"getTag") source=$tagContent tag="Content"]
  :local read ([($tokenParser->"getTag") source=$tagContent tag="Smstat"] = 1)

  :if ($content != "") do={
    :put "$index $read $date $phone $content"    
    /tool e-mail send to=user@mail.com subject="SMS $phone" body="$index $read $date $phone $content" 

   # Telegram Start
   :local MessageText "SMS $phone $content"
   :local SendTelegramMessage [:parse [/system script get MyTGBotSendMessage source]]
   $SendTelegramMessage MessageText=$MessageText
   # Telegram End
  }
}

}

Content Telegram Module "MyTGBotSendMessage"

:local BotToken "XXXXXXXXXX:XXXXXXXXXX-XXXXXXXXXXXXXXXXXXXXX";
:local ChatID "XXXXXXXXX";
:local parseMode "HTML";
:local SendText $MessageText;

/tool fetch url="https://api.telegram.org/bot$BotToken/sendMessage\?chat_id=$ChatID&parse_mode=$parseMode&text=$SendText" keep-result=no;

That is the process, I hope I have explained myself well.


BR.
 
User avatar
Amm0
Forum Guru
Forum Guru
Posts: 4240
Joined: Sun May 01, 2016 7:12 pm
Location: California
Contact:

Re: Convert any text to UNICODE

Sat Feb 11, 2023 7:38 pm

Okay, well that's simplier than UCS-2 from SMS via AT. Your the Huawai modem is doing you a favor here - most modems give you an SMS PDU that requires parsing before you even get to the UCS2.

Your ":local content" variable should already have UTF-8 in it (e.g. the XML metadata <?xml version="1.0" encoding="UTF-8"?>). So you just need to use a different @rextended function with the $content (e.g. [$fURLEncode $content]) before passing it along to telegram:
viewtopic.php?p=670983&hilit=urlencode#p885685

To get the raw UTF-8 bytes into the urlencoded string (e.g. UTF-8 that's % encoded for use in the HTTP query string).

RouterOS only allows you parse the bytes involved in unicode, but it really doesn't haven't unicode support for display/input in CLI/winbox/SSH/etc.
 
User avatar
diamuxin
Member
Member
Topic Author
Posts: 337
Joined: Thu Sep 09, 2021 5:46 pm

Re: Convert any text to UNICODE

Sat Feb 11, 2023 8:20 pm

In that case, I tried with the $fURLEncode function but it doesn't work either.

# ------------------- fURLEncode ----------------------
#
:global fURLEncode do={
    :local Chars {" "="%20";"!"="%21";"#"="%23";"%"="%25";"&"="%26";"'"="%27";"("="%28";")"="%29";"*"="%2A";"+"="%2B";","="%2C";"/"="%2F";":"="%3A";";"="%3B";"<"="%3C";"="="%3D";">"="%3E";"@"="%40";"["="%5B";"]"="%5D";"^"="%5E";"`"="%60";"{"="%7B";"|"="%7C";"}"="%7D"}
    :set ($Chars->"\07") "%07"
    :set ($Chars->"\0A") "%0A"
    :set ($Chars->"\0D") "%0D"
    :set ($Chars->"\22") "%22"
    :set ($Chars->"\24") "%24"
    :set ($Chars->"\3F") "%3F"
    :set ($Chars->"\5C") "%5C"
    :local URLEncodeStr
    :local Char
    :local EncChar
    :for i from=0 to=([:len $1]-1) do={
        :set Char [:pick $1 $i]
        :set EncChar ($Chars->$Char)
        :if (any $EncChar) do={
            :set URLEncodeStr "$URLEncodeStr$EncChar"
        } else={
            :set URLEncodeStr "$URLEncodeStr$Char"
        }
    }
    :return $URLEncodeStr
}

I have modified the array to include two special characters "ñ" and "ó" but it does not work.

:local Chars {" "="%20";"!"="%21";"#"="%23";"%"="%25";"&"="%26";"'"="%27";"("="%28";")"="%29";"*"="%2A";"+"="%2B";","="%2C";"/"="%2F";":"="%3A";";"="%3B";"<"="%3C";"="="%3D";">"="%3E";"@"="%40";"["="%5B";"]"="%5D";"^"="%5E";"`"="%60";"{"="%7B";"|"="%7C";"}"="%7D";"ñ"="%C3%21";"ó"="%C3%23"}

Result:
campa%C3%21a%20y%20acci%C3%23n.
status: failed

I surrender :(

Thank you in any case.

..
 
User avatar
rextended
Forum Guru
Forum Guru
Posts: 12522
Joined: Tue Feb 25, 2014 12:49 pm
Location: Italy
Contact:

Re: Convert any text to UNICODE

Sun Feb 12, 2023 4:01 am

C3 21 C3 23???
this is correct urlencoded string...
campa%C3%B1a%20y%20acci%C3%B3n
ñ = \C3\B1 and ó = \C3\B3
C3 21 = ! and C3 23 = #

I not remember the 2021 version (now deleted), but I have already done URLencode for UTF-8 some days ago...
viewtopic.php?t=177551#p980163

If you have "campaña y acción" you can use directly ASCIItoCP1252toURLencode

If you have the SMS value readed directly by AT commands and converted to UCS-2 string,
"campaña y acción" = \00c\00a\00m\00p\00a\00\F1\00a\00\20\00y\00\20\00a\00c\00c\00i\00\F3\00n
(obviously I have alrady converted standard letters on "c-a-m-p...")
ñ = \00\F1 ó = \00\F3
At this point you convert the UCS-2 string with testUCS2toUTF8 and pass the results to UTF8toURLencode to obtain the URL/GET/POST string for fetch.
Last edited by rextended on Sun Feb 12, 2023 4:27 am, edited 1 time in total.
 
User avatar
rextended
Forum Guru
Forum Guru
Posts: 12522
Joined: Tue Feb 25, 2014 12:49 pm
Location: Italy
Contact:

Re: Convert any text to UNICODE

Sun Feb 12, 2023 4:14 am

But if someone did want "raw UTF-8" – which is what be needed inside some JSON... using "\\" instead of "%" in the loop that builds the results would do get your UTF-8 as a byte stream (likely also need to return [:parse [return $constr]] (to re-interpolate the escape sequences too.
I have already done that function, simply remove the % on the function and pass the result to:
hexstr2chrstr
viewtopic.php?p=871742#p871742

Or directly convert "on the fly" the character with hex2chr
viewtopic.php?p=871741#p871741

Or alter the table to give directly the characters instead of hex values...
from
:local CP1252toUTF8 {"00";"01";"02";.....................;"C3BD";"C3BE";"C3BF"}
to
:local CP1252toUTF8 {"\00";"\01";"\02";.....................;"\C3\BD";"\C3\BE";"\C3\BF"}

and from
        :local utf ($CP1252toUTF8->[:find $CP1252testEP [:pick $string $pos ($pos + 2)] -1])
        :local sym ""
        :if ([:len $utf] = 2) do={:set sym "%$[:pick $utf 0 2]" }
        :if ([:len $utf] = 4) do={:set sym "%$[:pick $utf 0 2]%$[:pick $utf 2 4]" }
        :if ([:len $utf] = 6) do={:set sym "%$[:pick $utf 0 2]%$[:pick $utf 2 4]%$[:pick $utf 4 6]" }
        :set constr "$constr$sym"
to
        :local utf ($CP1252toUTF8->[:find $CP1252testEP [:pick $string $pos ($pos + 2)] -1])
        :set constr "$constr$utf"
 
User avatar
rextended
Forum Guru
Forum Guru
Posts: 12522
Joined: Tue Feb 25, 2014 12:49 pm
Location: Italy
Contact:

Re: Convert any text to UNICODE

Sun Feb 12, 2023 5:42 am

searchtag # rextended ucs2utf8

I have completed the function :) :) :)

Without using tables, converting all UCS-2 (unicode 2 bytes entry point) characters to UTF-8...
:global UCS2toUTF8 do={
    :local numbyte2hex do={
        :local input [:tonum $1]
        :local hexchars "0123456789ABCDEF"
        :local convert [:pick $hexchars (($input >> 4) & 0xF)]
        :set convert ($convert.[:pick $hexchars ($input & 0xF)])
        :return $convert
    }

    :local charsString ""
    :for x from=0 to=15 step=1 do={ :for y from=0 to=15 step=1 do={
        :local tmpHex "$[:pick "0123456789ABCDEF" $x ($x+1)]$[:pick "0123456789ABCDEF" $y ($y+1)]"
        :set $charsString "$charsString$[[:parse "(\"\\$tmpHex\")"]]"
    } }

    :local chr2int do={:if (($1="") or ([:len $1] > 1) or ([:typeof $1] = "nothing")) do={:return -1}; :return [:find $2 $1 -1]}

    :local string $1
    :if (([:typeof $string] != "str") or ($string = "")) do={ :return "" }
    :local output ""

    :local lenstr [:len $string]
    :for pos from=0 to=($lenstr - 1) step=2 do={
       :local input (([$chr2int [:pick $string  $pos      ($pos + 1)] $charsString] * 0x100) + \
                     ([$chr2int [:pick $string ($pos + 1) ($pos + 2)] $charsString]        ))
        :local results [:toarray ""]
        :local utf   ""
        :if ($input > 0x7F) do={
            :if ($input > 0x7FF) do={
                :if ($input > 0xFFFF) do={
                    :if ($input > 0x10FFFF) do={
                        :error "UTF-8 do not have code point > of 0x10FFFF"
                    } else={
                        :error "UCS-2 do not have code point > of 0xFFFF"
# the following commented lines are not used on UCS-2
# but I have already prepared my script for future changes to work with all UNICODE code points from 0x000000 to 0x10FFFF as well...
#                        :set ($results->0) (0xF0 + ( $input >> 18        ))
#                        :set ($results->1) (0x80 + (($input >> 12) & 0x3F))
#                        :set ($results->2) (0x80 + (($input >>  6) & 0x3F))
#                        :set ($results->3) (0x80 + ( $input        & 0x3F))
                    }
                } else={
                    :set ($results->0) (0xE0 + ( $input >> 12        ))
                    :set ($results->1) (0x80 + (($input >>  6) & 0x3F))
                    :set ($results->2) (0x80 + ( $input        & 0x3F))
                }
            } else={
                :set ($results->0) (0xC0 + ($input >>    6))
                :set ($results->1) (0x80 + ($input  & 0x3F))
            }
        } else={
            :set ($results->0) $input
        }
        :foreach item in=$results do={
            :set utf "$utf%$[$numbyte2hex $item]"
        }
        :set output "$output$utf"
    }
    :return $output
}

example code

{
:local ucsreadedfromsms "\00h\00e\00l\00l\00o\00\20\00m\00y\00\20\00f\00r\00i\00e\00n\00d\00\20\00c\00a\00m\00i\00\F3\00n\00\20\00\D1\00\F1"
:put [$UCS2toUTF8 $ucsreadedfromsms]
}

results:
%68%65%6C%6C%6F%20%6D%79%20%66%72%69%65%6E%64%20%63%61%6D%69%C3%B3%6E%20%C3%91%C3%B1
The string on example is the converted string "hello my friend camión Ññ" to UCS-2
Entry points: ó = 00 FE, Ñ = 00 D1,ñ = 00 F1

For test the results:
https://www.urldecoder.org/


EDIT: Reformatted, fixed for non CP1252 characters.
Last edited by rextended on Sat Jul 15, 2023 2:51 am, edited 14 times in total.
 
User avatar
diamuxin
Member
Member
Topic Author
Posts: 337
Joined: Thu Sep 09, 2021 5:46 pm

Re: Convert any text to UNICODE

Sun Feb 12, 2023 3:46 pm

Considering that the SMS message is extracted from my modem in UTF-8 format (I have already commented it on viewtopic.php?t=193491#p983556)
<?xml version="1.0" encoding="UTF-8"?>
<response>
	<Count>1</Count>
	<Messages>
		<Message>
			<Smstat>0</Smstat>
			<Index>40000</Index>
			<Phone>+34XXXXXXXXX</Phone>
			<Content>Google España G-126663 es tu código de verificación.</Content>
			<Date>2023-02-12 13:09:30</Date>
			<Sca></Sca>
			<SaveType>0</SaveType>
			<Priority>0</Priority>
			<SmsType>1</SmsType>
		</Message>
	</Messages>
</response>

I have tried the function $UTF8toURLencode

:global UTF8toURLencode do={
    :local ascii "\00\01\02\03\04\05\06\07\08\09\0A\0B\0C\0D\0E\0F\
                  \10\11\12\13\14\15\16\17\18\19\1A\1B\1C\1D\1E\1F\
                  \20\21\22\23\24\25\26\27\28\29\2A\2B\2C\2D\2E\2F\
                  \30\31\32\33\34\35\36\37\38\39\3A\3B\3C\3D\3E\3F\
                  \40\41\42\43\44\45\46\47\48\49\4A\4B\4C\4D\4E\4F\
                  \50\51\52\53\54\55\56\57\58\59\5A\5B\5C\5D\5E\5F\
                  \60\61\62\63\64\65\66\67\68\69\6A\6B\6C\6D\6E\6F\
                  \70\71\72\73\74\75\76\77\78\79\7A\7B\7C\7D\7E\7F\
                  \80\81\82\83\84\85\86\87\88\89\8A\8B\8C\8D\8E\8F\
                  \90\91\92\93\94\95\96\97\98\99\9A\9B\9C\9D\9E\9F\
                  \A0\A1\A2\A3\A4\A5\A6\A7\A8\A9\AA\AB\AC\AD\AE\AF\
                  \B0\B1\B2\B3\B4\B5\B6\B7\B8\B9\BA\BB\BC\BD\BE\BF\
                  \C0\C1\C2\C3\C4\C5\C6\C7\C8\C9\CA\CB\CC\CD\CE\CF\
                  \D0\D1\D2\D3\D4\D5\D6\D7\D8\D9\DA\DB\DC\DD\DE\DF\
                  \E0\E1\E2\E3\E4\E5\E6\E7\E8\E9\EA\EB\EC\ED\EE\EF\
                  \F0\F1\F2\F3\F4\F5\F6\F7\F8\F9\FA\FB\FC\FD\FE\FF"
    :local UTF8toURLe {"00";"01";"02";"03";"04";"05";"06";"07";"08";"09";"0A";"0B";"0C";"0D";"0E";"0F";
                       "10";"11";"12";"13";"14";"15";"16";"17";"18";"19";"1A";"1B";"1C";"1D";"1E";"1F";
                       "+";"21";"22";"23";"24";"25";"26";"27";"28";"29";"2A";"2B";"2C";"-";".";"2F";
                       "0";"1";"2";"3";"4";"5";"6";"7";"8";"9";"3A";"3B";"3C";"3D";"3E";"3F";
                       "40";"A";"B";"C";"D";"E";"F";"G";"H";"I";"J";"K";"L";"M";"N";"O";
                       "P";"Q";"R";"S";"T";"U";"V";"W";"X";"Y";"Z";"5B";"5C";"5D";"5E";"_";
                       "60";"a";"b";"c";"d";"e";"f";"g";"h";"i";"j";"k";"l";"m";"n";"o";
                       "p";"q";"r";"s";"t";"u";"v";"w";"x";"y";"z";"7B";"7C";"7D";"~";"7F";
                       "80";"81";"82";"83";"84";"85";"86";"87";"88";"89";"8A";"8B";"8C";"8D";"8E";"8F";
                       "90";"91";"92";"93";"94";"95";"96";"97";"98";"99";"9A";"9B";"9C";"9D";"9E";"9F";
                       "A0";"A1";"A2";"A3";"A4";"A5";"A6";"A7";"A8";"A9";"AA";"AB";"AC";"AD";"AE";"AF";
                       "B0";"B1";"B2";"B3";"B4";"B5";"B6";"B7";"B8";"B9";"BA";"BB";"BC";"BD";"BE";"BF";
                       "C0";"C1";"C2";"C3";"C4";"C5";"C6";"C7";"C8";"C9";"CA";"CB";"CC";"CD";"CE";"CF";
                       "D0";"D1";"D2";"D3";"D4";"D5";"D6";"D7";"D8";"D9";"DA";"DB";"DC";"DD";"DE";"DF";
                       "E0";"E1";"E2";"E3";"E4";"E5";"E6";"E7";"E8";"E9";"EA";"EB";"EC";"ED";"EE";"EF";
                       "F0";"F1";"F2";"F3";"F4";"F5";"F6";"F7";"F8";"F9";"FA";"FB";"FC";"FD";"FE";"FF"
                      }
    :local string $1
    :if (([:typeof $string] != "str") or ($string = "")) do={ :return "" }
    :local lenstr [:len $string]
    :local constr ""
    :for pos from=0 to=($lenstr - 1) do={
        :local urle ($UTF8toURLe->[:find $ascii [:pick $string $pos ($pos + 1)] -1])
        :local sym $urle
        :if ([:len $urle] = 2) do={:set sym "%$[:pick $urle 0 2]" }
        :set constr "$constr$sym"
    }
    :return $constr
}

And now I get what I need:


Result on screen:
:put "$index $read $phoneTG $date2dmy $contentTG"
40000 false %2B34XXXXXXXXX 12/02/2023 13:09:30 Google+Espa%C3%B1a+G-126663+es+tu+c%C3%B3digo+de+verificaci%C3%B3n.

Result in Telegram:
Image


Thank you very, very much for your patience.

BR.
Last edited by diamuxin on Sun Feb 12, 2023 4:00 pm, edited 1 time in total.
 
User avatar
rextended
Forum Guru
Forum Guru
Posts: 12522
Joined: Tue Feb 25, 2014 12:49 pm
Location: Italy
Contact:

Re: Convert any text to UNICODE

Sun Feb 12, 2023 3:53 pm

Thank you very, very much for your patience.
Thank you, it was a pleasure, also develop other useful functions.

MikroTik do not decode UTF-2 SMS, but if one have patience (next step?... :roll: ) to extract by AT commands the SMS PDU,
and extract UTF-2 text message from the PDU, is possible to forward that message to e-mail, twitter, etc.
 
User avatar
Amm0
Forum Guru
Forum Guru
Posts: 4240
Joined: Sun May 01, 2016 7:12 pm
Location: California
Contact:

Re: Convert any text to UNICODE

Sun Feb 12, 2023 6:43 pm

Very impressive. I'm glad @diamuxin didn't give up, whose gotten pretty far in what is hard problem...
I surrender :(

I use your some of urlencode/hexstring functions all the time – I guess I never ran into lack of "extended ASCII"/unicode in them ;)
There wasn't a UCS2 decoder in the forums AFAIK...which is kinda problem since Telit, Sierra, etc. modems don't have some XML (or JSON) options – just the GSM-spec AT SMS commands which expect UCS2 for unicode.
 
User avatar
rextended
Forum Guru
Forum Guru
Posts: 12522
Joined: Tue Feb 25, 2014 12:49 pm
Location: Italy
Contact:

Re: Convert any text to UNICODE

Mon Feb 13, 2023 12:27 am

which expect UCS2 for unicode.
I have already done the function CP1252 to UCS-2...
viewtopic.php?t=177551#p967513
ASCIItoCP1252toUNICODE already convert input text on CP1252 to UNICODE 2 byte = UCS-2
Just remove the "0x" inside the function, on the output, and you obtain the HEX UCS-2 string for the text part of SMS
 
User avatar
rextended
Forum Guru
Forum Guru
Posts: 12522
Joined: Tue Feb 25, 2014 12:49 pm
Location: Italy
Contact:

Re: Convert any text to UNICODE

Mon Feb 13, 2023 1:45 am

I want write UTF-8 to UCS-2 and I want save here on advance the algorythm I want implement.

Some notes for UTF-8 to UCS-2...

UTF-8 char can be 1, 2, 3 or 4 bytes.
UCS-2 is always 2 bytes.

Read 1st character of a UTF-8 string

if is < 0x80, simply ad one byte 0x00 and keep same byte for 2nd value, and restart with next char of the string.

if is > 0x7F and < 0xC2 is one error, add on the string the 0xFF 0xFD replacement characters, and restart with next char of the string.

if is > 0xEF is one error, add on the string the 0xFF 0xFD replacement characters, and restart with next char of the string.

if is between 0xC2 and 0xDF, read also next character.
If next character is outside 0x80..0xBF is one error, add on the string the 0xFF 0xFD replacement characters, and restart with next char of the string (restart from character readed)
From first hex subtract 0xC0, from the 2nd hex subtract 0x80, multiply the first for 0x40 and add the second, convert what remain to hex 2 byte.
Examples:
1) UTF-8 C2 A3 (£) = ((0xC2 - 0xC0)* 0x40) + (0xA3 - 0x80) = (0x5 * 0x40) + 0x23 = 0x80 + 0x23 = 0xA4 = 0x00 0xA3
2) UTF-8 C5 A8 (Ũ) = ((0xC5 - 0xC0)* 0x40) + (0xA8 - 0x80) = (0x5 * 0x40) + 0x28 = 0x140 + 0x28 = 0x168 = 0x01 0x68
If ok, restart skipping the extra character readed.

if is between 0xE0 and 0xEF, read also next TWO character.
If the next character is outside 0x80..0xBF is one error, add on the string the 0xFF 0xFD replacement characters, and restart with next char of the string (restart from first character readed)
If the second character is outside 0x80..0xBF is one error, add on the string the 0xFF 0xFD replacement characters,
and restart with next char of the string BUT SKIP the first character readed on advance (restart from the second character readed)
From first hex subtract 0xE0, from the 2nd and 3rd hex subtract 0x80, multiply the first for 0x1000, the second for 0x40, and add both to the second, convert what remain to hex 2 byte.
Example:
UTF-8 E2 82 AC (€) = ((0xE2 - 0xE0)* 0x1000) + (0x82 - 0x80)* 0x40) + (0xAC - 0x80) = (0x2 * 0x1000) + (0x2 * 0x40) + 0x2C = 0x2000 + 0x80 + 0x2C = 0x20AC = 0x20 0xAC
If ok, restart skipping the TWO extra character readed.
Last edited by rextended on Fri Feb 17, 2023 6:50 pm, edited 2 times in total.
 
User avatar
Amm0
Forum Guru
Forum Guru
Posts: 4240
Joined: Sun May 01, 2016 7:12 pm
Location: California
Contact:

Re: Convert any text to UNICODE

Wed Feb 15, 2023 7:51 pm

One additional note on "UNICODE", maybe others know. But while RouterOS is Linux, its scripting and configuration line-endings seem to be "\r\n" like Windows. UNIX uses just a single "\n"...
Never knew this until today. Not saying it's a problem. Just another consideration since UTF-8 to a "RouterOS string" should theoricially deal with that. Perhaps convert the "\n" to "\r\n" (if there wasn't already an "\r" preceding the "\n". maybe? I personally don't care, just curious and a footnote here.
[u@rsc] > :put "something\nnext"
something
         next
[u@rsc] > :put "something\r\nnext"
something
next
 
User avatar
rextended
Forum Guru
Forum Guru
Posts: 12522
Joined: Tue Feb 25, 2014 12:49 pm
Location: Italy
Contact:

Re: Convert any text to UNICODE

Thu Feb 16, 2023 2:09 am

UTF-8 do not consider at all how \r \n \t, etc. are used
UTF-8 is only a way to represent two byte UCS-2 with less possible bytes
UCS-2 is a way to represent ~65536 different characters instead of the standard 128 + "random" 128 other characters (based on codepages)
CPxxxx is a way to represent 128 more characters instead of the standard 128, but are localized
Also GSM-7 have 127 characters + 1 character for extended localized pages of other 127 characters.

Correctly, on your example, \n go to next line....
[u@rsc] > :put "something\nnext"
something
         next

Correctly, on your example, \r return to start, and \n go to next line.... (is the same if the reverse \n\r is used)
[u@rsc] > :put "something\r\nnext"
something
next

What you do not do is use only \r: go to start, overwriting "some" with "next"
[u@rsc] > :put "something\rnnext"
nextthing

Shortly: UNIX is wrong...
\r is carriage Return
and \n is New line...
This control characters are born with printers, the monitor do not exist at that time...
The cursor must be first put back at the starting position, and the paper must advance of one line... \r \n...
If on terminal you do only the \n command, the expected and correct behaviour is to advance to next line the cursor, but keep same position....

In this case, 7 bit ASCII, CP437, CP850, CP1252, UCS-2, UTF-8, or GSM-7, all have both CR (\r) and LF (\n) and is not a matter of character encoding, but how OS is done.
 
User avatar
rextended
Forum Guru
Forum Guru
Posts: 12522
Joined: Tue Feb 25, 2014 12:49 pm
Location: Italy
Contact:

Re: Convert any text to UNICODE

Fri Feb 17, 2023 8:25 pm

searchtag # rextended utf8 to ucs2, utf8 to ucs2 pdu

Without using tables, converting one UTF-8 string to one UCS-2 string (unicode 2 bytes entry point), but obviously only the 0x0000 to 0xFFFF characters supported to the UCS-2.
On error or with unsupported characters, the replacement character 0xFF 0xFE is used.
:global UTF8toUCS2 do={
    :local repch "\FF\FD"
    :if ([:typeof $2] = "no-replace") do={:set repch ""}
    :local numbyte2hex do={
        :local input [:tonum $1]
        :local hexchars "0123456789ABCDEF"
        :local convert [:pick $hexchars (($input >> 4) & 0xF)]
        :set convert ($convert.[:pick $hexchars ($input & 0xF)])
        :return $convert
    }

    :local charsString ""
    :for x from=0 to=15 step=1 do={ :for y from=0 to=15 step=1 do={
        :local tmpHex "$[:pick "0123456789ABCDEF" $x ($x+1)]$[:pick "0123456789ABCDEF" $y ($y+1)]"
        :set $charsString "$charsString$[[:parse "(\"\\$tmpHex\")"]]"
    } }

    :local chr2int do={
        :if (($1="") or ([:len $1] > 1) or ([:typeof $1] = "nothing")) do={:return -1}
        :return [:find $2 $1 -1]
    }

    :local string $1
    :if (([:typeof $string] != "str") or ($string = "")) do={ :return "" }
    :local output ""

    :local lenstr [:len $string]
    :local read1; :local char1; :local char2; :local char3; :local char4; :local ucsvalue
    :local outstr ""
    :local pos 0
    :while ($pos < $lenstr) do={
        :set read1 [:pick $string $pos ($pos + 1)]
        :set char1 [$chr2int $read1 $charsString]
        :if ($char1 < 0x80) do={
            :set outstr "\00$read1"
        }
        :if ((($char1 > 0x7F) and ($char1 < 0xC2)) or ($char1 > 0xEF)) do={
            :set outstr $repch
        }
        :set char2 [$chr2int [:pick $string ($pos + 1) ($pos + 2)] $charsString]
        :if (($char1 > 0xC1) and ($char1 < 0xE0)) do={
            :if (($char2 < 0x80) or ($char2 > 0xBF)) do={
                :set outstr $repch
            } else={
                :set ucsvalue ((($char1 - 0xC0) * 0x40) + ($char2 - 0x80))
                :set outstr "$[:pick $charsString (($ucsvalue >> 8) & 0xFF)]$[:pick $charsString ($ucsvalue & 0xFF)]"
                :set pos ($pos + 1)
            }
        }
        :set char3 [$chr2int [:pick $string ($pos + 2) ($pos + 3)] $charsString]
        :if (($char1 > 0xDF) and ($char1 < 0xF0)) do={
            :if ((($char2 < 0x80) or ($char2 > 0xBF)) \
                 or ((($char1 = 0xE0) and ($char2 < 0xA0)) or (($char1 = 0xED) and ($char2 > 0x9F)))) do={
                :set outstr $repch
            } else={
                :if (($char3 < 0x80) or ($char3 > 0xBF)) do={
                    :set outstr $repch
                    :set pos ($pos + 1)
                } else={
                    :set ucsvalue ((($char1 - 0xE0) * 0x1000) + (($char2 - 0x80) * 0x40) + ($char3 - 0x80))
                    :set outstr "$[:pick $charsString (($ucsvalue >> 8) & 0xFF)]$[:pick $charsString ($ucsvalue & 0xFF)]"
                    :set pos ($pos + 2)
                }
            }
        }

# the following commented lines are not used on UCS-2
# but I have already prepared my script for future changes to work with all UNICODE code points from 0x000000 to 0x10FFFF as well...
#        :set char4 [$chr2int [:pick $string ($pos + 3) ($pos + 4)] $charsString]
#        :if (($char1 > 0xEF) and ($char1 < 0xF5)) do={
#            :if ((($char2 < 0x80) or ($char2 > 0xBF)) \
#                 or ((($char1 = 0xF0) and ($char2 < 0x90)) or (($char1 = 0xF4) and ($char2 > 0x8F)))) do={
#                :set outstr $repch
#            } else={
#                :if (($char3 < 0x80) or ($char3 > 0xBF)) do={
#                    :set outstr $repch
#                    :set pos ($pos + 1)
#                } else={
#                    :if (($char4 < 0x80) or ($char4 > 0xBF)) do={
#                        :set outstr $repch
#                        :set pos ($pos + 2)
#                    } else={
#                        :set ucsvalue ((($char1 - 0xF0) * 0x40000) + (($char2 - 0x80) * 0x1000) + \
#                                       (($char3 - 0x80) * 0x40) + ($char4 - 0x80))
#                        :set outstr "$[:pick $charsString (($ucsvalue >> 16) & 0xFF)]"
#                        :set outstr "$outstr$[:pick $charsString (($ucsvalue >> 8) & 0xFF)]$[:pick $charsString ($ucsvalue & 0xFF)]"
#                        :set pos ($pos + 3)
#                    }
#                }
#            }
#        }

        :set output "$output$outstr"
        :set pos ($pos + 1)
    }
    :return $output
}

For example, for convert the string "hello my friend camión Ññ" on UTF-8 to UCS-2:

example code

{
:global testutf8 "\68\65\6C\6C\6F\20\6D\79\20\66\72\69\65\6E\64\20\63\61\6D\69\C3\B3\6E\20\C3\91\C3\B1"
:global testucs2 [$UTF8toUCS2 $testutf8]
:put $testutf8
:put $testucs2
/sys scri env pri
}

results code

hello my friend cami  n  
hello my friend cami n 
 # NAME               VALUE
 x testucs2           \00h\00e\00l\00l\00o\00 \00m\00y\00 \00f\00r\00i\00e\00n\00d\00 \00c\00a\00m\00i\00\F3\00n\00 \00\D1\00\F1
 x testutf8           hello my friend cami\C3\B3n \C3\91\C3\B1
MikroTik can not display non-7-bit-ASCII characters on terminal, but on memory are present correct values.
Entry points: ó = 00 FE, Ñ = 00 D1,ñ = 00 F1



And this is for obtain a HEX string to send message on SMS usable by AT commands for calculate UCS-2 PDU:
:global UTF8toUCS2hexstring do={
    :local repch "FFFD"
    :if ([:typeof $2] = "no-replace") do={:set repch ""}
    :local numbyte2hex do={
        :local input [:tonum $1]
        :local hexchars "0123456789ABCDEF"
        :local convert [:pick $hexchars (($input >> 4) & 0xF)]
        :set convert ($convert.[:pick $hexchars ($input & 0xF)])
        :return $convert
    }

    :local charsString ""
    :for x from=0 to=15 step=1 do={ :for y from=0 to=15 step=1 do={
        :local tmpHex "$[:pick "0123456789ABCDEF" $x ($x+1)]$[:pick "0123456789ABCDEF" $y ($y+1)]"
        :set $charsString "$charsString$[[:parse "(\"\\$tmpHex\")"]]"
    } }

    :local chr2int do={
        :if (($1="") or ([:len $1] > 1) or ([:typeof $1] = "nothing")) do={:return -1}
        :return [:find $2 $1 -1]
    }

    :local string $1
    :if (([:typeof $string] != "str") or ($string = "")) do={ :return "" }
    :local output ""

    :local lenstr [:len $string]
    :local read1; :local char1; :local char2; :local char3; :local char4; :local ucsvalue
    :local outstr ""
    :local pos 0
    :while ($pos < $lenstr) do={
        :set read1 [:pick $string $pos ($pos + 1)]
        :set char1 [$chr2int $read1 $charsString]
        :if ($char1 < 0x80) do={
            :set outstr "00$[$numbyte2hex $char1]"
        }
        :if ((($char1 > 0x7F) and ($char1 < 0xC2)) or ($char1 > 0xEF)) do={
            :set outstr $repch
        }
        :set char2 [$chr2int [:pick $string ($pos + 1) ($pos + 2)] $charsString]
        :if (($char1 > 0xC1) and ($char1 < 0xE0)) do={
            :if (($char2 < 0x80) or ($char2 > 0xBF)) do={
                :set outstr $repch
            } else={
                :set ucsvalue ((($char1 - 0xC0) * 0x40) + ($char2 - 0x80))
                :set outstr "$[$numbyte2hex (($ucsvalue >> 8) & 0xFF)]$[$numbyte2hex ($ucsvalue & 0xFF)]"
                :set pos ($pos + 1)
            }
        }
        :set char3 [$chr2int [:pick $string ($pos + 2) ($pos + 3)] $charsString]
        :if (($char1 > 0xDF) and ($char1 < 0xF0)) do={
            :if ((($char2 < 0x80) or ($char2 > 0xBF)) \
                 or ((($char1 = 0xE0) and ($char2 < 0xA0)) or (($char1 = 0xED) and ($char2 > 0x9F)))) do={
                :set outstr $repch
            } else={
                :if (($char3 < 0x80) or ($char3 > 0xBF)) do={
                    :set outstr $repch
                    :set pos ($pos + 1)
                } else={
                    :set ucsvalue ((($char1 - 0xE0) * 0x1000) + (($char2 - 0x80) * 0x40) + ($char3 - 0x80))
                    :set outstr "$[$numbyte2hex (($ucsvalue >> 8) & 0xFF)]$[$numbyte2hex ($ucsvalue & 0xFF)]"
                    :set pos ($pos + 2)
                }
            }
        }

# the following commented lines are not used on UCS-2
# but I have already prepared my script for future changes to work with all UNICODE code points from 0x000000 to 0x10FFFF as well...
#        :set char4 [$chr2int [:pick $string ($pos + 3) ($pos + 4)] $charsString]
#        :if (($char1 > 0xEF) and ($char1 < 0xF5)) do={
#            :if ((($char2 < 0x80) or ($char2 > 0xBF)) \
#                 or ((($char1 = 0xF0) and ($char2 < 0x90)) or (($char1 = 0xF4) and ($char2 > 0x8F)))) do={
#                :set outstr $repch
#            } else={
#                :if (($char3 < 0x80) or ($char3 > 0xBF)) do={
#                    :set outstr $repch
#                    :set pos ($pos + 1)
#                } else={
#                    :if (($char4 < 0x80) or ($char4 > 0xBF)) do={
#                        :set outstr $repch
#                        :set pos ($pos + 2)
#                    } else={
#                        :set ucsvalue ((($char1 - 0xF0) * 0x40000) + (($char2 - 0x80) * 0x1000) + \
#                                       (($char3 - 0x80) * 0x40) + ($char4 - 0x80))
#                        :set outstr "$[$numbyte2hex (($ucsvalue >> 16) & 0xFF)]"
#                        :set outstr "$outstr$[$numbyte2hex (($ucsvalue >> 8) & 0xFF)]$[$numbyte2hex ($ucsvalue & 0xFF)]"
#                        :set pos ($pos + 3)
#                    }
#                }
#            }
#        }

        :set output "$output$outstr"
        :set pos ($pos + 1)
    }
    :return $output
}


On this example the string "hello my friend camión Ññ" on UTF-8 is converted on UCS-2 for use on SMS PDU "AT" command:

example code

{
:global testutf8 "\68\65\6C\6C\6F\20\6D\79\20\66\72\69\65\6E\64\20\63\61\6D\69\C3\B3\6E\20\C3\91\C3\B1"
:global testucs2 [$UTF8toUCS2hexstring $testutf8]
:put $testutf8
:put $testucs2
/sys scri env pri
}

results code

hello my friend cami  n  
 # NAME               VALUE
 x testucs2           00680065006C006C006F0020006D007900200066007200690065006E0064002000630061006D006900F3006E002000D100F1
 x testutf8           hello my friend cami\C3\B3n \C3\91\C3\B1
The
(32)00680065006C006C006F0020006D007900200066007200690065006E0064002000630061006D006900F3006E002000D100F1
is the encoded message on UCS-2 for create the SMS PDU. This message is 25 characters, but use 50 characters on the SMS for the UCS-2 PDU encoding.
On PDU 0x32 (50) must be added at the start, is the length of the message, and on UCS-2 is never a odd number.
Last edited by rextended on Fri Mar 10, 2023 3:53 pm, edited 6 times in total.
 
User avatar
Amm0
Forum Guru
Forum Guru
Posts: 4240
Joined: Sun May 01, 2016 7:12 pm
Location: California
Contact:

Re: Convert any text to UNICODE

Fri Feb 17, 2023 10:03 pm

belli€mo

One note, the BOM is helpful for files... see below:
# if we add the "BOM" (byte order mark), it works to a file and displays in TextEdit (Mac version of notepad.exe)
# without the \FE\FF, an exported file is unreadable (starts with \00 so unsure what to do)
:global z ("\FE\FF".[$UTF8toUCS2 ("belli"."\E2\82\AC"."mo")])
# one important benefit of UTF16/UCS2 is getting the number of *characters* not *bytes* is possible...
# so if dealing with UTF8 from JSON etc, converting to UCS2 using $UTF8toUCS2 may be helpful
:put ([:len $z]/2)
9
# should be 8 but that BOM at start need to be accounted for here...
:put (([:len $z]/2)-1)
8
# what's curious is that UCS2 prints at least the ASCII parts just fine on Mac+SSH 
:put $z
# ��belli �mo
/file print file= ucsfile
/file set ucsfile contents=$z
/system script env print where name=z
# Columns: NAME, VALUE
#  NAME  VALUE                       
# 7  z     FEFF00b00e00l00l00i AC00m00o
So for files the BOM code is required at start of string (for output to a file, other places BOM is likely not helpful). I know notepad.exe also respects the BOM, so may be useful to my colorless friend in Italy.

** I'm guessing [:put] just strips nulls \00 before output ... Even more helpful of UCS2+RouterOS actually. That means there little harm to storing lower ASCII as UCS, other than double the size of the underlying "str" type...
 
User avatar
rextended
Forum Guru
Forum Guru
Posts: 12522
Joined: Tue Feb 25, 2014 12:49 pm
Location: Italy
Contact:

Re: Convert any text to UNICODE

Sat Feb 18, 2023 12:51 am

The BOM, Byte Order Mark for UTF-8 is 0xEF 0xBB 0xBF
viewtopic.php?t=177551#p967513

I just mentioned it in 2022 because all these functions assume that the source is already in the correct format.
The UCS-2 is obsolete, so in the end it's only found in the PDU of the SMS sent with that encoding, and it doesn't have the BOM there.
In ASCII-7bit the BOM is not used, nor on GSM-7 and the various CP437, 850, 1252, etc.
It is only used by UTF-8 (and others not covered here).
It is only useful if UTF-8 must be stored in a file, so that the program that opens the file recognizes that it is written either in UTF-8 or must use the CodePage to interpret what is written (other BOM also say if it's in UTF-16 and others, but we don't cover it here now).

These functions that I wrote, however, were not designed to work directly on files, but if they were used,
it would be up to the part of the script that deals with loading or saving the file to apply it and remove it if needed.

There is no point in adding it if you already know the content of the readed value

This is one example of a file on ANSI (my PC have CP1252 / Windows-1252) and € is 0x80, UTF-8 and UTF-8 with BOM on both € is 0xE2 0x82 0xAC.
test_cp.png
Simply Windows Notepad understand perfectly to open the UTF-8 version also if BOM is not present.

P.S.: On UTF-16 and UTC-2 is 0x20 0xAC and on UTF-32 is 0x00 0x00 0x20 0xAC
You do not have the required permissions to view the files attached to this post.
 
User avatar
Amm0
Forum Guru
Forum Guru
Posts: 4240
Joined: Sun May 01, 2016 7:12 pm
Location: California
Contact:

Re: Convert any text to UNICODE

Sat Feb 18, 2023 1:08 am

The BOM, Byte Order Mark for UTF-8 is 0xEF 0xBB 0xBF
viewtopic.php?t=177551#p967513
[...]
These functions that I wrote, however, were not designed to work directly on files, but if they were used,
it would be up to the part of the script that deals with loading or saving the file to apply it and remove it if needed.
Agreed - great work. Just you have a LOT of snippets, so hard to keep track ;). This thread has the current "best of" for unicode is why I mention the file stuff (and want to test semi-independently via TextEdit.app) – obviously not need specifically for SMS.

The BOM does only come up with UCS2 and/or UTF-16. Most "web stuff" generally assumes UTF-8 or has some encoding in the metadata, so BOMs have even less use for UTF-8.
 
User avatar
rextended
Forum Guru
Forum Guru
Posts: 12522
Joined: Tue Feb 25, 2014 12:49 pm
Location: Italy
Contact:

Re: Convert any text to UNICODE

Sat Feb 18, 2023 1:16 am

UTF-16 has 2 BOM, and this time is really a BOM, not just a code for identify UTF-8....
0xFE 0xFF and 0xFF 0xFE
if is FEFF the "space", for example, is 0x00 0x20 (the default if not writed)
if is FFFE the "space", for example, is 0x20 0x00

If on UTF-16 (or also UTF-32 0x00 0x00 0xFE 0xFF [the default] and 0xFF 0xFE 0x00 0x00)
the Byte Order Mark is not correct, the string can not be interpreted correctly...
 
User avatar
diamuxin
Member
Member
Topic Author
Posts: 337
Joined: Thu Sep 09, 2021 5:46 pm

Re: Convert any text to UNICODE

Sat Feb 18, 2023 1:23 am

Amazing!
thank you for your great work.

BR.
 
User avatar
rextended
Forum Guru
Forum Guru
Posts: 12522
Joined: Tue Feb 25, 2014 12:49 pm
Location: Italy
Contact:

Re: Convert any text to UNICODE

Sat Feb 18, 2023 2:09 am

thank you for your great work.
Thanks to you.


What is missing right now
let's make a summary:

(ignoring if the data must be stored in a file, sent on telegram, sent via SMS or represented as a hexadecimal string for the PDU)

From ASCII-7bit (stored in one 8-bit byte) to any CPxxx: Not needed, already the first 128 bytes are identical on all codepages.

From ASCII-7bit (stored in one 8-bit byte) to UTF-8: Yes, just ASCIItoCP1252toUTF8
viewtopic.php?t=177551#p967513

From ASCII-7bit (stored in one 8-bit byte) to UCS-2 (and UTF-16): Yes, just ASCIItoCP1252toUNICODE, the first 128 bytes are identicals.
viewtopic.php?t=177551#p967513

From ASCII-7bit (stored in one 8-bit byte) to GSM-7: No, but simply knowing the GSM-7 alphabet
viewtopic.php?p=411358#p411358
and modifying the already existing ASCIItoCP1252toUTF8 function, it takes little, apart from renaming the function and variables appropriately, just update the internal table


From CPxxx to ASCII-7bit (stored in one 8-bit byte): Possible, but all 128 extra characters can not be coded on 128 bytes.
Can be used ASCIItoCP1252toUTF8 replacing the utf-8 table with a table where 0x00 to 0x7F are identical, but all the other are approximations of the starting character.
For example from CP1252 to ASCII-7bit: "L'offerta a ½ prezzo è €25,00" => "L'offerta a 1/2 prezzo e' EUR25,00"

From CPxxx to UTF-8: Yes, for CP1252 just use ASCIItoCP1252toUTF8
can be created any table conversion for any CP.
viewtopic.php?t=177551#p967513

From CPxxx to UCS-2 (and UTF-16): Yes, just use ASCIItoCP1252toUNICODE
also here can be created any table conversion for any CP.
viewtopic.php?t=177551#p967513

From CPxxx to GSM-7: No, but simply knowing the GSM-7 alphabet
viewtopic.php?p=411358#p411358
as before modifying the already existing ASCIItoCP1252toUTF8 function.



From UTF-8 to ASCII-7bit (stored in one 8-bit byte): Possible, like before all 65536 extra characters can not be coded on 128 bytes.
Can be used the part of UTF8toUCS2 that code first 0x7F bytes, but all ~65400 more characters???? :lol:

From UTF-8 to CPxxx: Possible, but only 256 characters can be specified. But on this case, knowing the destination CP, there is no confusion on destination characters.
Probably reverting table inside ASCIItoCP1252toUTF8 function, and creating a table pairs for each codepage wanted, do the work,
but all UTF-8 characters not used from that codepage, must be represented with "?"

From UTF-8 to UCS-2: Yes, the function above :)

From UTF-8 to GSM-7: Possible, but is again the same thing of ASCII-7bit: can be represented only 128 (not really true, just some more than 128) and is why on SMS exist UCS-2


From UCS-2 to ASCII-7bit (stored in one 8-bit byte): Possible, like before all 65536 characters can not be coded on 128 bytes.

From UCS-2 to CPxxx: Possible, but only 256 characters can be specified. But also on this case, knowing the destination CP, there is no confusion on destination characters.
Probably reverting table inside ASCIItoCP1252toUNICODE function, and creating a table pairs for each codepage wanted, do the work,
but all UCS-2 characters not used from that codepage, must be represented with "?"
viewtopic.php?t=177551#p967513

From UCS-2 to UTF-8: Yes, again the function above :)

From UCS-2 to GSM-7: Possible, but is again the same thing everytime: can be represented only 128 bytes


I probably wrote something wrong with copy and paste or for sure the English syntax, but I think you understand enough.
Last edited by rextended on Sat Feb 18, 2023 4:49 am, edited 1 time in total.
 
User avatar
Amm0
Forum Guru
Forum Guru
Posts: 4240
Joined: Sun May 01, 2016 7:12 pm
Location: California
Contact:

Re: Convert any text to UNICODE

Sat Feb 18, 2023 3:39 am

In [$UTF8toUCS2] the 3rd line, did you mean:
    :if ([:tostr $2] = "no-replace") do={:set repch ""}
says ":if ([:typeof $2] = "no-replace")" - but that's not a type

And, you can maybe enable the UTF16 code and add a "as-utf16" option in arg2 or arg3
:if ([:tostr $2] = "no-replace" || [:tostr $3] = "no-replace") do={:set repch ""}
 :local useutf16 0
 :if ([:tostr $2] = "as-utf16" || [:tostr $3] = "as-utf16") do={:set useutf16 1}
 # [...]
 :if ($useutf16) do={
 # your commented out code to deal with the "extra" part of UCS2
 }
Just an idea.
 
User avatar
rextended
Forum Guru
Forum Guru
Posts: 12522
Joined: Tue Feb 25, 2014 12:49 pm
Location: Italy
Contact:

Re: Convert any text to UNICODE

Sat Feb 18, 2023 4:17 am

On UTF8toUCS2 version is
    :local repch "\FF\FD"
    :if ([:typeof $2] = "no-replace") do={:set repch ""}
on UTF8toUCS2hexstring version is "FFFD" without the \

Not a typo, a "hidden" feature.... :lol:
if the 2nd parameter is "no-replace", do not replace invalid characters with FF FD but simply do not add anything to the output for the invalid UTF-8 code.

0xFF 0xFD is REPLACEMENT CHARACTER � for unknow or broken codes.
Last edited by rextended on Sat Feb 18, 2023 4:29 am, edited 2 times in total.
 
User avatar
rextended
Forum Guru
Forum Guru
Posts: 12522
Joined: Tue Feb 25, 2014 12:49 pm
Location: Italy
Contact:

Re: Convert any text to UNICODE

Sat Feb 18, 2023 4:27 am

And, you can maybe enable the UTF16 code and add a "as-utf16" option in arg2 or arg3
 # your commented out code to deal with the "extra" part of UCS2
Just an idea.
I do not find any pratical use of UTF-16 or UTF-32, but UTF-8 can encode also UTF-16 and when I make one program, I prefer set the full feature.
(like on UCS2toUTF8, if you read, I have prepared also UTF-16 to UTF-8, but UCS-2 is 2 bytes only and I do not enable that part because never use 3 bytes)
viewtopic.php?p=985104#p983695

On this way I do not have to learn again and invent a way for the conversion procedure for both ways, I have already prepared all...

For enable the UTF-16 parts, must be added also the BOM (as writed on previous posts) on the output, uncomment commented parts,
and replace EF on uncommented part "or ($char1 > 0xEF)" with F4

But if the function is for UCS-2, UTF16 is useless, is why is only made for future or for create ANOTHER function...
Last edited by rextended on Sat Feb 18, 2023 4:42 am, edited 1 time in total.
 
User avatar
Amm0
Forum Guru
Forum Guru
Posts: 4240
Joined: Sun May 01, 2016 7:12 pm
Location: California
Contact:

Re: Convert any text to UNICODE

Sat Feb 18, 2023 4:42 am

LOL. I didn't lookup the preceding unicode.

But another question, in your summary...
if something is for sure "ASCII-7bit (stored in an 8 byte)", then is the $ASCIItoCP1252toUNICODE needed to get to UCS-2? Since the UTF-8 starts at 0x80, and if str really was just ASCII, UTF8toUCS2 should be identical in output (e.g. 7-bit/"us-ascii" is still valid UTF-8 afterall ;)). ¿No?
 
User avatar
rextended
Forum Guru
Forum Guru
Posts: 12522
Joined: Tue Feb 25, 2014 12:49 pm
Location: Italy
Contact:

Re: Convert any text to UNICODE

Sat Feb 18, 2023 4:45 am

I do not understand what you mean,

If you have one ASCII-7bit string like "test" on UCS-2 (and UTF-16 big endian) are "\00t\00e\00s\00t" and on the CPxxx, ANSII, ASCII-8bit and UTF-8 is the same: "test"
(on UTF-32 big endian is "\00\00\00t\00\00\00e\00\00\00s\00\00\00t")
Last edited by rextended on Sat Feb 18, 2023 4:56 am, edited 3 times in total.
 
User avatar
rextended
Forum Guru
Forum Guru
Posts: 12522
Joined: Tue Feb 25, 2014 12:49 pm
Location: Italy
Contact:

Re: Convert any text to UNICODE

Sat Feb 18, 2023 4:47 am

if something is for sure "ASCII-7bit (stored in one 8-bit byte)", then is the $ASCIItoCP1252toUNICODE needed to get to UCS-2? ....
if you are really sure, just put \00 in front of each character...

pseudo-code of the function "ASCII-7bit (stored in one 8-bit byte)" to UCS-2 / UTF-16:
foreach character in input, add on output \00 + current character

pseudo-code of the function "ASCII-7bit (stored in one 8-bit byte)" to UTF-32:
foreach character in input, add on output \00\00\00 + current character

P.S.: replaced the misspeleld (stored in an 8 byte) to (stored in one 8-bit byte)
Last edited by rextended on Sat Feb 18, 2023 4:57 am, edited 1 time in total.
 
User avatar
Amm0
Forum Guru
Forum Guru
Posts: 4240
Joined: Sun May 01, 2016 7:12 pm
Location: California
Contact:

Re: Convert any text to UNICODE

Sat Feb 18, 2023 4:56 am

7-bit ASCII = "test" or as str type "\74\65\73\74"
utf-8 = "test" or as str type "\74\65\73\74"

It's only where characters in the string to convert are above 0x80 or 127, where CP1252 and UTF8 diverge.

Just saying your wonderful UTF8toUCS2 work just fine for "7-bit ASCII" since ALL chars are < 0x80. Since "normal" RouterOS strings only use 7-bits (e.g. without escaping hex), your UTF8toUCS2 works for those too to get UCS2.

Trying to say the UTF8toUCS2 is even MORE generic, and useful ;). Great work here!
 
User avatar
rextended
Forum Guru
Forum Guru
Posts: 12522
Joined: Tue Feb 25, 2014 12:49 pm
Location: Italy
Contact:

Re: Convert any text to UNICODE

Sat Feb 18, 2023 4:58 am

Great work here!
Thanks!

I go to bed now, is late.
Have a nice day.
 
User avatar
Sertik
Member
Member
Posts: 489
Joined: Fri Sep 25, 2020 3:30 pm
Location: Russia, Moscow

Re: Convert any text to UNICODE

Sat Feb 25, 2023 11:24 am

Script for sending incoming SMS to mail with full parsing

viewtopic.php?t=161931

Who is online

Users browsing this forum: ilmar and 6 guests