OSDev.org

Posted: **Sun Jun 14, 2015 9:30 pm**

I have been thinking about how to manage unicode and have a design consideration I thought I wold try and voice.

Although there are many ways to represent unicode the most common appears to be UTF8 which is a multibyte character representation ranging from 1 to 6 bytes (I think). Given that a single character may be larger than one byte the easy distinction within the traditional system of differentiating between char and char* for example is no longer valid as both a character and a character string is referenced via char* effectively.

This implies that the programmer then becomes even more responsible for the management of the distinction and ensuring only the correct routines are called e.g. print_char(char*) and print_string(char*) which is really not so ok because this means we cannot use polymorphism to create routines of the same name because the type of both are the same. and the impact on iostreams for example would be signification because we could no longer use them with some form of special handling of the scenario.

Because I am using c++ I can see two possibilities:

1) typedef UTF8* unichr; and UTF8* unistr; which is a small step in the right direction but leaves too many opportunities for error
2) some what like the c++ string class - create wrapper classes that manage all of the distinctions between unicode character and string values. I think this can be done with reasonable efficiency at the moment because I cannot see why there would be a need for virtual methods and hence no vtab so there should be no space penalty and with appropriate use of inline no speed penalty as well.

Obviously once this path is taken most of POSIX would require attention, but for my purposes this is not an issue because I do not intend to conform to POSIX, while recognising the downside of this choice it frees me to make some interesting experiments.

Anyone have any thoughts on an alternative to manage this issue?

Posted: **Sun Jun 14, 2015 9:54 pm**

Alternatively, you may do UTF16 internally, and provide facility to do utf8 or whatever encoding <-> system (utf16) conversion.
This way the application programmer care less on what the internal representation is (i.e. you may also add hashing, length etc, or even do the conversion in lazy way.)

How to warp API for string and character is the last thing you want to care, indeed, I wouldn't bother to provide function for single character.

Posted: **Sun Jun 14, 2015 10:39 pm**

char* is not a UTF-8 character, single character is an int.

Posted: **Sun Jun 14, 2015 10:51 pm**

there is not really much point in using UTF16 because that has already been superseded by UTF4 (32 bits) so that does not solve the issue it simply moves it. I know I could inflate and deflate the values as they traverse between the OS and external non unicode representations but this seems like a kludge.

I know UTF8 is not a char*, I was referring to the fact that as UTF8 is multibyte then there are only a restricted set of cases where a single byte will store a unicode character, as for using an int it suffers the same inflate and deflate issues s above, lastly unless you have a 64 bit environment then you cannot store every unicode character in a single integer back to both unicode character and string being larger than a single integer and hence both being accessed via a point and hence unable to distinguish based upon primitive type.

keep on thinking.

Posted: **Mon Jun 15, 2015 1:14 am**

How does print_char handle "Q̣̀"? There is no way to represent it as a single character.

Posted: **Mon Jun 15, 2015 1:23 am**

hubris wrote:lastly unless you have a 64 bit environment then you cannot store every unicode character in a single integer

I don't understand this. The total size of the Unicode code space is 17 × 65536 = 1114112, well within the range of 32-bit integers. On the other hand, those unicode "values" are not characters in traditional sense (as in "character is an entity that can be input by the user and drawn to the screen independently as a single unit"), but "code points" (most associated with "abstract characters") that might need to be combined to other code points to yield actual "printable characters" (e.g. U+0303 "combining tilde" will add tilde above the next symbol).

That is, there is no encoding that would allow you to encode "UTF characters" (the printable glyphs) in any fixed number of bits, but you can use an integer to hold individual code points. The "length of encoded string" (e.g. number of chars in UTF-8; number of ints in UTF-32) is not the same as "number code points in the string" (except for UTF-32), and neither is the same as "number of printable glyphs in the string". So the benefit of UTF-32 over other encodings is that you can easily tell the number of code points in a string (but that might not be that useful as it sounds).

So, forget the idea that there is a basic type for "printable characters". Use "char*" to represent UTF-8 encoded (possibly null-terminated) strings. Typedef integer as wchar_t, and use "wchar_t*" to represent UTF-32 encoded (possibly null-terminated) strings. Use "print(X)" to print null-terminated strings (both UTF-8 and UTF-32 versions), and "print(X,int)" to print non-terminating strings. Don't bother printing individual "char"s or "wchar_t"s. Use functions in "wchar.h" (or "cwchar" in your case).

Posted: **Mon Jun 15, 2015 2:10 am**

UTF-8 decoding only matters if you're actively processing the exact content of a string, in which case you need multibyte support anyway to deal with composition regardless if you use an 8-bit or 32-bit representation. For most dataflow and parsing operations this is meaningless as the app doesn't look at the string's internals but you only make copies and possibly splits at 1-byte character boundaries (newlines, quotes, xml, json, nulls) of which we know are in the ASCII range.

For anything that's not among those simple cases, you need to tackle things thoroughly or you'll have shabby support at whichever level you try doing things.

In my OS all string arguments are defined to be UTF-8.

Posted: **Mon Jun 15, 2015 6:27 am**

hubris wrote:Because I am using c++ I can see two possibilities:

1) typedef UTF8* unichr; and UTF8* unistr; which is a small step in the right direction but leaves too many opportunities for error
2) some what like the c++ string class - create wrapper classes that manage all of the distinctions between unicode character and string values. I think this can be done with reasonable efficiency at the moment because I cannot see why there would be a need for virtual methods and hence no vtab so there should be no space penalty and with appropriate use of inline no speed penalty as well.

What prevents you from using #2? Or even simpler - just use C++ String class directly.

If you prefer something without space penalty then a lot of work expects you. But the memory is cheap. However, you still can try the hard way.

Posted: **Mon Jun 15, 2015 9:57 am**

hubris wrote:there is not really much point in using UTF16 because that has already been superseded by UTF4 (32 bits) so that does not solve the issue it simply moves it. I know I could inflate and deflate the values as they traverse between the OS and external non unicode representations but this seems like a kludge.

UTF-32 (there is no UTF4) does not "supersede" UTF-16. They are both perfectly valid ways of representing Unicode with different minimum character widths.

Posted: **Mon Jun 15, 2015 11:18 am**

Hi,

linguofreak wrote:
hubris wrote:there is not really much point in using UTF16 because that has already been superseded by UTF4 (32 bits) so that does not solve the issue it simply moves it. I know I could inflate and deflate the values as they traverse between the OS and external non unicode representations but this seems like a kludge.
UTF-32 (there is no UTF4) does not "supersede" UTF-16. They are both perfectly valid ways of representing Unicode with different minimum character widths.

USC-2 is superseded (not capable of storing all Unicode code-points); and virtually everything that used it has been modified/upgraded to use UTF-16 instead.

Mostly, you need 2 encodings - one for efficient storage and transfer (e.g. UTF-8) and one for efficient internal processing (e.g. UTF-32LE on little-endian architectures like 80x86). UTF-16 isn't ideal for either case and probably shouldn't be used (unless you have to deal with a backward compatibility mess caused by "USC-2 upgraded to UTF-16").

Cheers,

Brendan

Posted: **Mon Jun 15, 2015 11:39 pm**

Thank you all. Opinions, even when not in agreement, provide a different perspective. Brendan's last post kind aligned to where I was going; with obviously many more questions to come. The compatibility issue is some thing I am taking a considered stance of not conforming although I know this is likely to lead to failure but I am predominately doing this for learning and experimenting rather than world domination.

However I take the point that this has to be embedded correctly from the ground up, given my starting that is no constraint. Will see if suicide or success is on the future.

My next topic is how to represent unicode/utf8 and the architecture require to display characters on the screen. This is not an urgent topic for me as I am far away from that position. Currently I am still battling the evils of trying to build xcompiler on cygwin, one step forward and 3 back seems to be my current progress.

OSDev.org

Unicode Character vs String

Unicode Character vs String

Re: Unicode Character vs String

Re: Unicode Character vs String

Re: Unicode Character vs String

Re: Unicode Character vs String

Re: Unicode Character vs String

Re: Unicode Character vs String

Re: Unicode Character vs String

Re: Unicode Character vs String

Re: Unicode Character vs String

Re: Unicode Character vs String