Here is the relevant part of their Javascript code, that converts a UTF8 charcode to UTF-16: nChr = sDOMStr.charCodeAt(nChrIdx) ĪBytes = 128 + (nChr > 6 & 63) ĪBytes = 128 + (nChr > 12 & 63) ĪBytes = 128 + (nChr > 18 & 63) ĪBytes = 128 + (nChr > 24 & 63) Īnd, here it is replicated in Google Sheets, as a formula: =IF(UNICODE(A2) utf-16 -> utf-8 split into bytes as decimal -> split into bytes as binary -> binary”, we can stop at the 3rd step, where we have an array of bytes as decimals, and add them together into a binary product. Luckily, the MDN page on base64 encoding and decoding had a very clean Javascript function I could use as a base. My starting point for this project was to try and convert a single UTF-16 charcode/codepoint to a UTF-8 codepoint. However, the approach below still seems to work. Originally, I thought Sheets was using UTf-16, but I am now aware it is UTF-32BE. This section talks about converting UTF-16 to UTF-8. Google Sheets: Text Character to UTF-8 Array There is a lot to this, so I’m going to break down my overall solution (Text -> UTF-16 Unicode Decimal -> UTF-8 Decimal -> UTF-8 Binary) into smaller chunks, in case you need just one part of it. The variable nature of UTF-8 means that each byte also has to have information in it on whether or not it is the last byte, which is the reason why you can’t just plug the binary from a 2-byte-wide UTF-8 char into UTF-32, and vice-versa. In comparison, UTF-16 is also variable, but has a fixed minimum of 2 bytes in width, and UTF-32 is not variable, and fixed at 4 bytes in width. The reason why UTF-8 is so different, and also why it is so popular, is because it is a variable length encoding scheme, with a minimum length of just one byte, and a maximum of 4 bytes wide.
#UNICODE TO UTF 8 CONVERTER CODE#
This might surprise some people, but most JS engines do not use UTF-8 as the encoding scheme for strings – they use UTF-16!!! Similar to how I tested Google Sheets, this is easy to verify by checking the char code of a non-ASCII character:Īnd guess where this gets special mention… the MDN page on base64! I am not alone in realizing the UTF-* issue with Base64! Why is UTF-8 so different? Sidenote: JavaScript suffers the same issue! In my case, I was feeding the result of UNICODE() into my base64 generator, which was causing it to spit out a base64 encoded string that could only be properly decoded with a decoder that uses UTF-32, which most decoders do not use. In fact, the default charset used in Apps Script with Utilities.base64Decode() is UTF-8, and specifying UTF-32 for decoding is not an option! If you pull up various online “text to base64” converters online, chances are that 99% of them use either plain ASCII or UTF-8 for the encoding/decoding of text. Usually, when people talk about converting text to Base64, without even saying it, the implication is that we are going to be using UTF-8 as the encoding scheme.
#UNICODE TO UTF 8 CONVERTER FULL#
To be specific, I was running into this with writing a custom Base64 encoder function (see full details on that here). However, when you are feeding the result of UNICODE() into something where you don’t control the decoder part, it becomes an issue. Why is this a problem? Well, most of the time it is not as long as you use the same encoding scheme on both ends (input and output) it really doesn’t matter. To show this more clearly, here is table directly from Google Sheets, that show for various input characters, the CODE() function always returns the UTF-32BE value: This is made even more confusing by the fact that the Apps Script part of Google Sheets has an enum specifically for the UTF-8 charset. After a bit of digging, I discovered that, although this is not really documented, most of the Google Sheets functions use UTF-32 (UTF-32BE) as the encoding scheme, whereas I was expecting UTF-8. I recently ran into a head-scratcher of an issue with Google Sheets, where a formula that used the output of the UNICODE() and/or CODE() functions was not matching with the expected value. The issue – Google Sheets does not use UTF-8 (Seeing the character does not help - too many other things could be fiddling with it.Warning: This post is going to get into the “dangerously nerdy” territory ? Quick links to solution: Please do this to show what the bytes look like.