Unicode Character vs String

Question about which tools to use, bugs, the best way to implement a function, etc should go here. Don't forget to see if your question is answered in the wiki first! When in doubt post here.
Post Reply
hubris
Member
Member
Posts: 28
Joined: Sun May 24, 2015 12:38 am
Location: Brisbane, Australia

Unicode Character vs String

Post by hubris »

I have been thinking about how to manage unicode and have a design consideration I thought I wold try and voice.

Although there are many ways to represent unicode the most common appears to be UTF8 which is a multibyte character representation ranging from 1 to 6 bytes (I think). Given that a single character may be larger than one byte the easy distinction within the traditional system of differentiating between char and char* for example is no longer valid as both a character and a character string is referenced via char* effectively.

This implies that the programmer then becomes even more responsible for the management of the distinction and ensuring only the correct routines are called e.g. print_char(char*) and print_string(char*) which is really not so ok because this means we cannot use polymorphism to create routines of the same name because the type of both are the same. and the impact on iostreams for example would be signification because we could no longer use them with some form of special handling of the scenario.

Because I am using c++ I can see two possibilities:

1) typedef UTF8* unichr; and UTF8* unistr; which is a small step in the right direction but leaves too many opportunities for error
2) some what like the c++ string class - create wrapper classes that manage all of the distinctions between unicode character and string values. I think this can be done with reasonable efficiency at the moment because I cannot see why there would be a need for virtual methods and hence no vtab so there should be no space penalty and with appropriate use of inline no speed penalty as well.

Obviously once this path is taken most of POSIX would require attention, but for my purposes this is not an issue because I do not intend to conform to POSIX, while recognising the downside of this choice it frees me to make some interesting experiments.

Anyone have any thoughts on an alternative to manage this issue?
User avatar
bluemoon
Member
Member
Posts: 1761
Joined: Wed Dec 01, 2010 3:41 am
Location: Hong Kong

Re: Unicode Character vs String

Post by bluemoon »

Alternatively, you may do UTF16 internally, and provide facility to do utf8 or whatever encoding <-> system (utf16) conversion.
This way the application programmer care less on what the internal representation is (i.e. you may also add hashing, length etc, or even do the conversion in lazy way.)

How to warp API for string and character is the last thing you want to care, indeed, I wouldn't bother to provide function for single character.
User avatar
Roman
Member
Member
Posts: 568
Joined: Thu Mar 27, 2014 3:57 am
Location: Moscow, Russia
Contact:

Re: Unicode Character vs String

Post by Roman »

char* is not a UTF-8 character, single character is an int.
"If you don't fail at least 90 percent of the time, you're not aiming high enough."
- Alan Kay
hubris
Member
Member
Posts: 28
Joined: Sun May 24, 2015 12:38 am
Location: Brisbane, Australia

Re: Unicode Character vs String

Post by hubris »

there is not really much point in using UTF16 because that has already been superseded by UTF4 (32 bits) so that does not solve the issue it simply moves it. I know I could inflate and deflate the values as they traverse between the OS and external non unicode representations but this seems like a kludge.

I know UTF8 is not a char*, I was referring to the fact that as UTF8 is multibyte then there are only a restricted set of cases where a single byte will store a unicode character, as for using an int it suffers the same inflate and deflate issues s above, lastly unless you have a 64 bit environment then you cannot store every unicode character in a single integer back to both unicode character and string being larger than a single integer and hence both being accessed via a point and hence unable to distinguish based upon primitive type.

keep on thinking.
Octocontrabass
Member
Member
Posts: 5588
Joined: Mon Mar 25, 2013 7:01 pm

Re: Unicode Character vs String

Post by Octocontrabass »

How does print_char handle "Q̣̀"? There is no way to represent it as a single character.
Hellbender
Member
Member
Posts: 63
Joined: Fri May 01, 2015 2:23 am
Libera.chat IRC: Hellbender

Re: Unicode Character vs String

Post by Hellbender »

hubris wrote:lastly unless you have a 64 bit environment then you cannot store every unicode character in a single integer
I don't understand this. The total size of the Unicode code space is 17 × 65536 = 1114112, well within the range of 32-bit integers. On the other hand, those unicode "values" are not characters in traditional sense (as in "character is an entity that can be input by the user and drawn to the screen independently as a single unit"), but "code points" (most associated with "abstract characters") that might need to be combined to other code points to yield actual "printable characters" (e.g. U+0303 "combining tilde" will add tilde above the next symbol).

That is, there is no encoding that would allow you to encode "UTF characters" (the printable glyphs) in any fixed number of bits, but you can use an integer to hold individual code points. The "length of encoded string" (e.g. number of chars in UTF-8; number of ints in UTF-32) is not the same as "number code points in the string" (except for UTF-32), and neither is the same as "number of printable glyphs in the string". So the benefit of UTF-32 over other encodings is that you can easily tell the number of code points in a string (but that might not be that useful as it sounds).

So, forget the idea that there is a basic type for "printable characters". Use "char*" to represent UTF-8 encoded (possibly null-terminated) strings. Typedef integer as wchar_t, and use "wchar_t*" to represent UTF-32 encoded (possibly null-terminated) strings. Use "print(X)" to print null-terminated strings (both UTF-8 and UTF-32 versions), and "print(X,int)" to print non-terminating strings. Don't bother printing individual "char"s or "wchar_t"s. Use functions in "wchar.h" (or "cwchar" in your case).
Hellbender OS at github.
User avatar
Combuster
Member
Member
Posts: 9301
Joined: Wed Oct 18, 2006 3:45 am
Libera.chat IRC: [com]buster
Location: On the balcony, where I can actually keep 1½m distance
Contact:

Re: Unicode Character vs String

Post by Combuster »

UTF-8 decoding only matters if you're actively processing the exact content of a string, in which case you need multibyte support anyway to deal with composition regardless if you use an 8-bit or 32-bit representation. For most dataflow and parsing operations this is meaningless as the app doesn't look at the string's internals but you only make copies and possibly splits at 1-byte character boundaries (newlines, quotes, xml, json, nulls) of which we know are in the ASCII range.

For anything that's not among those simple cases, you need to tackle things thoroughly or you'll have shabby support at whichever level you try doing things.

In my OS all string arguments are defined to be UTF-8.
"Certainly avoid yourself. He is a newbie and might not realize it. You'll hate his code deeply a few years down the road." - Sortie
[ My OS ] [ VDisk/SFS ]
embryo2
Member
Member
Posts: 397
Joined: Wed Jun 03, 2015 5:03 am

Re: Unicode Character vs String

Post by embryo2 »

hubris wrote:Because I am using c++ I can see two possibilities:

1) typedef UTF8* unichr; and UTF8* unistr; which is a small step in the right direction but leaves too many opportunities for error
2) some what like the c++ string class - create wrapper classes that manage all of the distinctions between unicode character and string values. I think this can be done with reasonable efficiency at the moment because I cannot see why there would be a need for virtual methods and hence no vtab so there should be no space penalty and with appropriate use of inline no speed penalty as well.
What prevents you from using #2? Or even simpler - just use C++ String class directly.

If you prefer something without space penalty then a lot of work expects you. But the memory is cheap. However, you still can try the hard way.
My previous account (embryo) was accidentally deleted, so I have no chance but to use something new. But may be it was a good lesson about software reliability :)
linguofreak
Member
Member
Posts: 510
Joined: Wed Mar 09, 2011 3:55 am

Re: Unicode Character vs String

Post by linguofreak »

hubris wrote:there is not really much point in using UTF16 because that has already been superseded by UTF4 (32 bits) so that does not solve the issue it simply moves it. I know I could inflate and deflate the values as they traverse between the OS and external non unicode representations but this seems like a kludge.
UTF-32 (there is no UTF4) does not "supersede" UTF-16. They are both perfectly valid ways of representing Unicode with different minimum character widths.
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Unicode Character vs String

Post by Brendan »

Hi,
linguofreak wrote:
hubris wrote:there is not really much point in using UTF16 because that has already been superseded by UTF4 (32 bits) so that does not solve the issue it simply moves it. I know I could inflate and deflate the values as they traverse between the OS and external non unicode representations but this seems like a kludge.
UTF-32 (there is no UTF4) does not "supersede" UTF-16. They are both perfectly valid ways of representing Unicode with different minimum character widths.
USC-2 is superseded (not capable of storing all Unicode code-points); and virtually everything that used it has been modified/upgraded to use UTF-16 instead.

Mostly, you need 2 encodings - one for efficient storage and transfer (e.g. UTF-8) and one for efficient internal processing (e.g. UTF-32LE on little-endian architectures like 80x86). UTF-16 isn't ideal for either case and probably shouldn't be used (unless you have to deal with a backward compatibility mess caused by "USC-2 upgraded to UTF-16").


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
hubris
Member
Member
Posts: 28
Joined: Sun May 24, 2015 12:38 am
Location: Brisbane, Australia

Re: Unicode Character vs String

Post by hubris »

Thank you all. Opinions, even when not in agreement, provide a different perspective. Brendan's last post kind aligned to where I was going; with obviously many more questions to come. The compatibility issue is some thing I am taking a considered stance of not conforming although I know this is likely to lead to failure but I am predominately doing this for learning and experimenting rather than world domination.

However I take the point that this has to be embedded correctly from the ground up, given my starting that is no constraint. Will see if suicide or success is on the future.

My next topic is how to represent unicode/utf8 and the architecture require to display characters on the screen. This is not an urgent topic for me as I am far away from that position. Currently I am still battling the evils of trying to build xcompiler on cygwin, one step forward and 3 back seems to be my current progress.
Post Reply