how to implement UTF-8
how to implement UTF-8
HI, all.
In unicode, all the character are unique, but with the different encode way, there are so many different standard.
what i want implement is UTF-8, which is quite popular now. but one thing i can't understand is:
in a ASCII code, if we use printf("%c",0x41), it will put character 'A';
in UTF-8 of unicode, the character '空', which will be coded in three char, 0xBAA9E7. yeah, off cource, we can detect whether the char is a unicode or not, but if we detected, then
char * utf_str = "空";
printf("%s",uft_str);
will it print the right char '空‘?
i mean we can detect whether the char is a utf-8 form char, but how the computer know this is a utf-8 char? how it will put '空' on the screen?
Thanks!
In unicode, all the character are unique, but with the different encode way, there are so many different standard.
what i want implement is UTF-8, which is quite popular now. but one thing i can't understand is:
in a ASCII code, if we use printf("%c",0x41), it will put character 'A';
in UTF-8 of unicode, the character '空', which will be coded in three char, 0xBAA9E7. yeah, off cource, we can detect whether the char is a unicode or not, but if we detected, then
char * utf_str = "空";
printf("%s",uft_str);
will it print the right char '空‘?
i mean we can detect whether the char is a utf-8 form char, but how the computer know this is a utf-8 char? how it will put '空' on the screen?
Thanks!
- xenos
- Member
- Posts: 1121
- Joined: Thu Aug 11, 2005 11:00 pm
- Libera.chat IRC: xenos1984
- Location: Tartu, Estonia
- Contact:
Re: how to implement UTF-8
That depends on the implementation of the printf function. To display UTF-8 on the screen, you need two things:
- A font that contains character data (i.e. graphical representations) for each character you want to display.
- A printf function that recognizes UTF-8 and looks up UTF-8 characters in the given font. If the character is found in the font, it is displayed on the screen. If it can't be found in the font, printf should output some placeholder instead, maybe a box. (My FireFox does this with the UTF-8 character in your post.)
Re: how to implement UTF-8
Having your console driver interpret UTF-8 will probably be fairly tricky, because it's a variable-length encoding. It would be much simpler to use an underlying fixed-size encoding like UCS-2 or UCS-4, along with a separate converter to decode UTF-8 strings.
Re: how to implement UTF-8
Use multibyte (variable-length, e.g. UTF-8) encodings for storage, and wide (fixed-length, e.g. UTF-32) encodings for internal handling.
Don't use "%s", that's for 8-bit chars. "%ls" is for wide character encodings. Check out the functions defined in <wctype.h>, <wchar.h>, and <uchar.h>.
Don't use "%s", that's for 8-bit chars. "%ls" is for wide character encodings. Check out the functions defined in <wctype.h>, <wchar.h>, and <uchar.h>.
Every good solution is obvious once you've found it.
Re: how to implement UTF-8
Copypaste from Wikipedia:UTF-32:berkus wrote:I think there's no UTF-32 technically, it should be UCS-4.
Note that for many uses, UTF-16 could be absolutely sufficient.UTF-32 was originally a subset of the UCS-4 standard, but the Principles and Procedures document of JTC1/SC2/WG2 states that all future assignments of characters will be constrained to the BMP or the first 14 supplementary planes, and has removed former provisions for private-use code positions in groups 60 to 7F and in planes E0 to FF.
Accordingly UCS-4 and UTF-32 are now identical except that the UTF-32 standard has additional Unicode semantics.
Every good solution is obvious once you've found it.
Re: how to implement UTF-8
Erm... no?!?berkus wrote:So, technically there's no UTF-32, since UCS-4 pretty much covers it
UCS-4 and UTF-32 are now identical except that the UTF-32 standard has additional Unicode semantics.
Every good solution is obvious once you've found it.
Re: how to implement UTF-8
I don't agree. It is not that difficult to parse it, nor is it very CPU intensive: a few compares and shifts, basically. Yes, it makes life a little bit more difficult when using stuff like strlen(), but that function is overused as it is anyway...berkus wrote:UTF-8 is self-syncronizing, but parsing it all the time is a total waste of CPU
The real challenge in supporting Unicode in general is all the combining character stuff vs. pre-combined characters. And no matter which storing mechanism you use, that's gonna be a pain anyway (and the strlen() isn't solved even when using UTF-32).
JAL
Re: how to implement UTF-8
It really depends. 'Easier' in this case becomes easily 'sloppy', imho. All you need is a few low-level functions and you're done.berkus wrote:Yes, but it's still a lot easier to work with UTF-16.
Indeed. And such a class can take care of UTF-8 anyway :).True, a smarter string class in that case is going to help.
JAL
Re: how to implement UTF-8
True, one can argue a lot :). UTF-16 is faster, although one needs to consider UTF-32 if one wants to be truly compatible with all East-Asian extensions and whatnot. The slowness of UTF-8 largely depends on the characters used: in English text there's hardly a speed penalty, in e.g. French text with many accents it may be more.berkus wrote:We can argue for a long time, but a practical implementation I trust design-wise uses UTF-16 for speed reasons and I would believe UTF-8 implementation would be horrendously slow in more cases than utf-16 one.
I think it's very difficult to benchmark, as it'll depend heavily on the characters used (so the language/script used as well), and on the operations performed. But UTF-16 will be faster, there's no doubt about it.It's relatively easy to benchmark though.
I agree. There's so much going on on a modern OS, that I doubt that it'll make a large difference. Therefore I prefer UTF-8, as it is the most concise form available for most languages and scripts.Another assumption is that glyph calculations will be a lot more slow than codepoints manipulation and therefore utf-8 vs utf-16 is largely irrelevant.
JAL
-
- Posts: 7
- Joined: Tue Mar 03, 2009 3:52 pm
Re: how to implement UTF-8
char and widechar, right? widechar is a word, right? gotcha so far.
sorry, the structures are not quite the same in pascal.
sorry, the structures are not quite the same in pascal.
Re: how to implement UTF-8
Yes, that's a good one everybody should read.berkus wrote:Just stumbled upon a nice writing by Joel: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
JAL
Re: how to implement UTF-8
Nice - interesting read.berkus wrote:Just stumbled upon a nice writing by Joel: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Re: how to implement UTF-8
It was only quite recently I felt the need to scold someone here who claimed he didn't need Unicode since he was from the US and would never support any other language. Some people are so ignorant...AJ wrote:Nice - interesting read.
JAL
Re: how to implement UTF-8
I'm embarrased to say that I currently only support ANSI encoding and English character sets. This situation should change in the near future, though.
IMO, even if you only plan on Americans using your OS, this argument evaporates as soon as you plan on a) Adding network or email functionality to your OS, or b) Exchanging files with any other computer.
Cheers,
Adam
IMO, even if you only plan on Americans using your OS, this argument evaporates as soon as you plan on a) Adding network or email functionality to your OS, or b) Exchanging files with any other computer.
Cheers,
Adam
Re: how to implement UTF-8
c) Defining 'American' other than 'WASP'.AJ wrote:IMO, even if you only plan on Americans using your OS, this argument evaporates as soon as you plan on a) Adding network or email functionality to your OS, or b) Exchanging files with any other computer.
JAL