Page 1 of 2

how to implement UTF-8

Posted: Sun Mar 01, 2009 8:32 am
by negcit.K
HI, all.
In unicode, all the character are unique, but with the different encode way, there are so many different standard.
what i want implement is UTF-8, which is quite popular now. but one thing i can't understand is:

in a ASCII code, if we use printf("%c",0x41), it will put character 'A';

in UTF-8 of unicode, the character '空', which will be coded in three char, 0xBAA9E7. yeah, off cource, we can detect whether the char is a unicode or not, but if we detected, then
char * utf_str = "空";
printf("%s",uft_str);

will it print the right char '空‘?

i mean we can detect whether the char is a utf-8 form char, but how the computer know this is a utf-8 char? how it will put '空' on the screen?
Thanks!

Re: how to implement UTF-8

Posted: Sun Mar 01, 2009 9:26 am
by xenos
That depends on the implementation of the printf function. To display UTF-8 on the screen, you need two things:
  • A font that contains character data (i.e. graphical representations) for each character you want to display.
  • A printf function that recognizes UTF-8 and looks up UTF-8 characters in the given font. If the character is found in the font, it is displayed on the screen. If it can't be found in the font, printf should output some placeholder instead, maybe a box. (My FireFox does this with the UTF-8 character in your post.)

Re: how to implement UTF-8

Posted: Sun Mar 01, 2009 10:54 pm
by teraflop
Having your console driver interpret UTF-8 will probably be fairly tricky, because it's a variable-length encoding. It would be much simpler to use an underlying fixed-size encoding like UCS-2 or UCS-4, along with a separate converter to decode UTF-8 strings.

Re: how to implement UTF-8

Posted: Sun Mar 01, 2009 11:38 pm
by Solar
Use multibyte (variable-length, e.g. UTF-8) encodings for storage, and wide (fixed-length, e.g. UTF-32) encodings for internal handling.

Don't use "%s", that's for 8-bit chars. "%ls" is for wide character encodings. Check out the functions defined in <wctype.h>, <wchar.h>, and <uchar.h>.

Re: how to implement UTF-8

Posted: Mon Mar 02, 2009 2:49 am
by Solar
berkus wrote:I think there's no UTF-32 technically, it should be UCS-4.
Copypaste from Wikipedia:UTF-32:
UTF-32 was originally a subset of the UCS-4 standard, but the Principles and Procedures document of JTC1/SC2/WG2 states that all future assignments of characters will be constrained to the BMP or the first 14 supplementary planes, and has removed former provisions for private-use code positions in groups 60 to 7F and in planes E0 to FF.

Accordingly UCS-4 and UTF-32 are now identical except that the UTF-32 standard has additional Unicode semantics.
Note that for many uses, UTF-16 could be absolutely sufficient.

Re: how to implement UTF-8

Posted: Mon Mar 02, 2009 3:08 am
by Solar
berkus wrote:So, technically there's no UTF-32, since UCS-4 pretty much covers it :P
Erm... no?!?

UCS-4 and UTF-32 are now identical except that the UTF-32 standard has additional Unicode semantics.

Re: how to implement UTF-8

Posted: Mon Mar 02, 2009 5:28 am
by jal
berkus wrote:UTF-8 is self-syncronizing, but parsing it all the time is a total waste of CPU
I don't agree. It is not that difficult to parse it, nor is it very CPU intensive: a few compares and shifts, basically. Yes, it makes life a little bit more difficult when using stuff like strlen(), but that function is overused as it is anyway...

The real challenge in supporting Unicode in general is all the combining character stuff vs. pre-combined characters. And no matter which storing mechanism you use, that's gonna be a pain anyway (and the strlen() isn't solved even when using UTF-32).

JAL

Re: how to implement UTF-8

Posted: Mon Mar 02, 2009 9:39 am
by jal
berkus wrote:Yes, but it's still a lot easier to work with UTF-16.
It really depends. 'Easier' in this case becomes easily 'sloppy', imho. All you need is a few low-level functions and you're done.
True, a smarter string class in that case is going to help.
Indeed. And such a class can take care of UTF-8 anyway :).


JAL

Re: how to implement UTF-8

Posted: Mon Mar 02, 2009 2:15 pm
by jal
berkus wrote:We can argue for a long time, but a practical implementation I trust design-wise uses UTF-16 for speed reasons and I would believe UTF-8 implementation would be horrendously slow in more cases than utf-16 one.
True, one can argue a lot :). UTF-16 is faster, although one needs to consider UTF-32 if one wants to be truly compatible with all East-Asian extensions and whatnot. The slowness of UTF-8 largely depends on the characters used: in English text there's hardly a speed penalty, in e.g. French text with many accents it may be more.
It's relatively easy to benchmark though.
I think it's very difficult to benchmark, as it'll depend heavily on the characters used (so the language/script used as well), and on the operations performed. But UTF-16 will be faster, there's no doubt about it.
Another assumption is that glyph calculations will be a lot more slow than codepoints manipulation and therefore utf-8 vs utf-16 is largely irrelevant.
I agree. There's so much going on on a modern OS, that I doubt that it'll make a large difference. Therefore I prefer UTF-8, as it is the most concise form available for most languages and scripts.


JAL

Re: how to implement UTF-8

Posted: Tue Mar 03, 2009 4:41 pm
by frazzledjazz
char and widechar, right? widechar is a word, right? gotcha so far.

sorry, the structures are not quite the same in pascal.

Re: how to implement UTF-8

Posted: Wed Mar 04, 2009 7:30 am
by jal
Yes, that's a good one everybody should read.


JAL

Re: how to implement UTF-8

Posted: Wed Mar 04, 2009 8:25 am
by AJ

Re: how to implement UTF-8

Posted: Wed Mar 04, 2009 8:54 am
by jal
AJ wrote:Nice - interesting read.
It was only quite recently I felt the need to scold someone here who claimed he didn't need Unicode since he was from the US and would never support any other language. Some people are so ignorant...


JAL

Re: how to implement UTF-8

Posted: Wed Mar 04, 2009 9:14 am
by AJ
I'm embarrased to say that I currently only support ANSI encoding and English character sets. This situation should change in the near future, though.

IMO, even if you only plan on Americans using your OS, this argument evaporates as soon as you plan on a) Adding network or email functionality to your OS, or b) Exchanging files with any other computer.

Cheers,
Adam

Re: how to implement UTF-8

Posted: Thu Mar 05, 2009 4:47 am
by jal
AJ wrote:IMO, even if you only plan on Americans using your OS, this argument evaporates as soon as you plan on a) Adding network or email functionality to your OS, or b) Exchanging files with any other computer.
c) Defining 'American' other than 'WASP'.


JAL