how to implement UTF-8

negcit.K · Post by **negcit.K** » Sun Mar 01, 2009 8:32 am

HI, all.
In unicode, all the character are unique, but with the different encode way, there are so many different standard.
what i want implement is UTF-8, which is quite popular now. but one thing i can't understand is:

in a ASCII code, if we use printf("%c",0x41), it will put character 'A';

in UTF-8 of unicode, the character '空', which will be coded in three char, 0xBAA9E7. yeah, off cource, we can detect whether the char is a unicode or not, but if we detected, then
char * utf_str = "空";
printf("%s",uft_str);

will it print the right char '空‘?

i mean we can detect whether the char is a utf-8 form char, but how the computer know this is a utf-8 char? how it will put '空' on the screen?
Thanks!

xenos · Post by **xenos** » Sun Mar 01, 2009 9:26 am

That depends on the implementation of the printf function. To display UTF-8 on the screen, you need two things:

A font that contains character data (i.e. graphical representations) for each character you want to display.
A printf function that recognizes UTF-8 and looks up UTF-8 characters in the given font. If the character is found in the font, it is displayed on the screen. If it can't be found in the font, printf should output some placeholder instead, maybe a box. (My FireFox does this with the UTF-8 character in your post.)

teraflop · Post by **teraflop** » Sun Mar 01, 2009 10:54 pm

Having your console driver interpret UTF-8 will probably be fairly tricky, because it's a variable-length encoding. It would be much simpler to use an underlying fixed-size encoding like UCS-2 or UCS-4, along with a separate converter to decode UTF-8 strings.

Solar · Post by **Solar** » Sun Mar 01, 2009 11:38 pm

Use multibyte (variable-length, e.g. UTF-8) encodings for storage, and wide (fixed-length, e.g. UTF-32) encodings for internal handling.

Don't use "%s", that's for 8-bit chars. "%ls" is for wide character encodings. Check out the functions defined in <wctype.h>, <wchar.h>, and <uchar.h>.

Solar · Post by **Solar** » Mon Mar 02, 2009 2:49 am

berkus wrote:I think there's no UTF-32 technically, it should be UCS-4.

Copypaste from Wikipedia:UTF-32:

UTF-32 was originally a subset of the UCS-4 standard, but the Principles and Procedures document of JTC1/SC2/WG2 states that all future assignments of characters will be constrained to the BMP or the first 14 supplementary planes, and has removed former provisions for private-use code positions in groups 60 to 7F and in planes E0 to FF.

Accordingly UCS-4 and UTF-32 are now identical except that the UTF-32 standard has additional Unicode semantics.

Note that for many uses, UTF-16 could be absolutely sufficient.

Solar · Post by **Solar** » Mon Mar 02, 2009 3:08 am

berkus wrote:So, technically there's no UTF-32, since UCS-4 pretty much covers it

Erm... no?!?

UCS-4 and UTF-32 are now identical except that the UTF-32 standard has additional Unicode semantics.

jal · Post by **jal** » Mon Mar 02, 2009 5:28 am

berkus wrote:UTF-8 is self-syncronizing, but parsing it all the time is a total waste of CPU

I don't agree. It is not that difficult to parse it, nor is it very CPU intensive: a few compares and shifts, basically. Yes, it makes life a little bit more difficult when using stuff like strlen(), but that function is overused as it is anyway...

The real challenge in supporting Unicode in general is all the combining character stuff vs. pre-combined characters. And no matter which storing mechanism you use, that's gonna be a pain anyway (and the strlen() isn't solved even when using UTF-32).

JAL

jal · Post by **jal** » Mon Mar 02, 2009 9:39 am

berkus wrote:Yes, but it's still a lot easier to work with UTF-16.

It really depends. 'Easier' in this case becomes easily 'sloppy', imho. All you need is a few low-level functions and you're done.

True, a smarter string class in that case is going to help.

Indeed. And such a class can take care of UTF-8 anyway :).

JAL

jal · Post by **jal** » Mon Mar 02, 2009 2:15 pm

berkus wrote:We can argue for a long time, but a practical implementation I trust design-wise uses UTF-16 for speed reasons and I would believe UTF-8 implementation would be horrendously slow in more cases than utf-16 one.

True, one can argue a lot :). UTF-16 is faster, although one needs to consider UTF-32 if one wants to be truly compatible with all East-Asian extensions and whatnot. The slowness of UTF-8 largely depends on the characters used: in English text there's hardly a speed penalty, in e.g. French text with many accents it may be more.

It's relatively easy to benchmark though.

I think it's very difficult to benchmark, as it'll depend heavily on the characters used (so the language/script used as well), and on the operations performed. But UTF-16 will be faster, there's no doubt about it.

Another assumption is that glyph calculations will be a lot more slow than codepoints manipulation and therefore utf-8 vs utf-16 is largely irrelevant.

I agree. There's so much going on on a modern OS, that I doubt that it'll make a large difference. Therefore I prefer UTF-8, as it is the most concise form available for most languages and scripts.

JAL

frazzledjazz · Post by **frazzledjazz** » Tue Mar 03, 2009 4:41 pm

char and widechar, right? widechar is a word, right? gotcha so far.

sorry, the structures are not quite the same in pascal.

jal · Post by **jal** » Wed Mar 04, 2009 7:30 am

berkus wrote:Just stumbled upon a nice writing by Joel: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Yes, that's a good one everybody should read.

JAL

AJ · Post by AJ » Wed Mar 04, 2009 8:25 am

berkus wrote:Just stumbled upon a nice writing by Joel: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Nice - interesting read.

jal · Post by **jal** » Wed Mar 04, 2009 8:54 am

AJ wrote:Nice - interesting read.

It was only quite recently I felt the need to scold someone here who claimed he didn't need Unicode since he was from the US and would never support any other language. Some people are so ignorant...

JAL

AJ · Post by AJ » Wed Mar 04, 2009 9:14 am

I'm embarrased to say that I currently only support ANSI encoding and English character sets. This situation should change in the near future, though.

IMO, even if you only plan on Americans using your OS, this argument evaporates as soon as you plan on a) Adding network or email functionality to your OS, or b) Exchanging files with any other computer.

Cheers,
Adam

jal · Post by **jal** » Thu Mar 05, 2009 4:47 am

AJ wrote:IMO, even if you only plan on Americans using your OS, this argument evaporates as soon as you plan on a) Adding network or email functionality to your OS, or b) Exchanging files with any other computer.

c) Defining 'American' other than 'WASP'.

JAL

OSDev.org

how to implement UTF-8

how to implement UTF-8

Re: how to implement UTF-8

Re: how to implement UTF-8

Re: how to implement UTF-8

Re: how to implement UTF-8

Re: how to implement UTF-8

Re: how to implement UTF-8

Re: how to implement UTF-8

Re: how to implement UTF-8

Re: how to implement UTF-8

Re: how to implement UTF-8

Re: how to implement UTF-8

Re: how to implement UTF-8

Re: how to implement UTF-8

Re: how to implement UTF-8