how to implement UTF-8

Question about which tools to use, bugs, the best way to implement a function, etc should go here. Don't forget to see if your question is answered in the wiki first! When in doubt post here.
User avatar
negcit.K
Member
Member
Posts: 34
Joined: Fri Dec 07, 2007 9:57 am

how to implement UTF-8

Post by negcit.K »

HI, all.
In unicode, all the character are unique, but with the different encode way, there are so many different standard.
what i want implement is UTF-8, which is quite popular now. but one thing i can't understand is:

in a ASCII code, if we use printf("%c",0x41), it will put character 'A';

in UTF-8 of unicode, the character '空', which will be coded in three char, 0xBAA9E7. yeah, off cource, we can detect whether the char is a unicode or not, but if we detected, then
char * utf_str = "空";
printf("%s",uft_str);

will it print the right char '空‘?

i mean we can detect whether the char is a utf-8 form char, but how the computer know this is a utf-8 char? how it will put '空' on the screen?
Thanks!
User avatar
xenos
Member
Member
Posts: 1121
Joined: Thu Aug 11, 2005 11:00 pm
Libera.chat IRC: xenos1984
Location: Tartu, Estonia
Contact:

Re: how to implement UTF-8

Post by xenos »

That depends on the implementation of the printf function. To display UTF-8 on the screen, you need two things:
  • A font that contains character data (i.e. graphical representations) for each character you want to display.
  • A printf function that recognizes UTF-8 and looks up UTF-8 characters in the given font. If the character is found in the font, it is displayed on the screen. If it can't be found in the font, printf should output some placeholder instead, maybe a box. (My FireFox does this with the UTF-8 character in your post.)
Programmers' Hardware Database // GitHub user: xenos1984; OS project: NOS
teraflop
Posts: 2
Joined: Mon Nov 17, 2008 12:42 am

Re: how to implement UTF-8

Post by teraflop »

Having your console driver interpret UTF-8 will probably be fairly tricky, because it's a variable-length encoding. It would be much simpler to use an underlying fixed-size encoding like UCS-2 or UCS-4, along with a separate converter to decode UTF-8 strings.
User avatar
Solar
Member
Member
Posts: 7615
Joined: Thu Nov 16, 2006 12:01 pm
Location: Germany
Contact:

Re: how to implement UTF-8

Post by Solar »

Use multibyte (variable-length, e.g. UTF-8) encodings for storage, and wide (fixed-length, e.g. UTF-32) encodings for internal handling.

Don't use "%s", that's for 8-bit chars. "%ls" is for wide character encodings. Check out the functions defined in <wctype.h>, <wchar.h>, and <uchar.h>.
Every good solution is obvious once you've found it.
User avatar
Solar
Member
Member
Posts: 7615
Joined: Thu Nov 16, 2006 12:01 pm
Location: Germany
Contact:

Re: how to implement UTF-8

Post by Solar »

berkus wrote:I think there's no UTF-32 technically, it should be UCS-4.
Copypaste from Wikipedia:UTF-32:
UTF-32 was originally a subset of the UCS-4 standard, but the Principles and Procedures document of JTC1/SC2/WG2 states that all future assignments of characters will be constrained to the BMP or the first 14 supplementary planes, and has removed former provisions for private-use code positions in groups 60 to 7F and in planes E0 to FF.

Accordingly UCS-4 and UTF-32 are now identical except that the UTF-32 standard has additional Unicode semantics.
Note that for many uses, UTF-16 could be absolutely sufficient.
Every good solution is obvious once you've found it.
User avatar
Solar
Member
Member
Posts: 7615
Joined: Thu Nov 16, 2006 12:01 pm
Location: Germany
Contact:

Re: how to implement UTF-8

Post by Solar »

berkus wrote:So, technically there's no UTF-32, since UCS-4 pretty much covers it :P
Erm... no?!?

UCS-4 and UTF-32 are now identical except that the UTF-32 standard has additional Unicode semantics.
Every good solution is obvious once you've found it.
jal
Member
Member
Posts: 1385
Joined: Wed Oct 31, 2007 9:09 am

Re: how to implement UTF-8

Post by jal »

berkus wrote:UTF-8 is self-syncronizing, but parsing it all the time is a total waste of CPU
I don't agree. It is not that difficult to parse it, nor is it very CPU intensive: a few compares and shifts, basically. Yes, it makes life a little bit more difficult when using stuff like strlen(), but that function is overused as it is anyway...

The real challenge in supporting Unicode in general is all the combining character stuff vs. pre-combined characters. And no matter which storing mechanism you use, that's gonna be a pain anyway (and the strlen() isn't solved even when using UTF-32).

JAL
jal
Member
Member
Posts: 1385
Joined: Wed Oct 31, 2007 9:09 am

Re: how to implement UTF-8

Post by jal »

berkus wrote:Yes, but it's still a lot easier to work with UTF-16.
It really depends. 'Easier' in this case becomes easily 'sloppy', imho. All you need is a few low-level functions and you're done.
True, a smarter string class in that case is going to help.
Indeed. And such a class can take care of UTF-8 anyway :).


JAL
jal
Member
Member
Posts: 1385
Joined: Wed Oct 31, 2007 9:09 am

Re: how to implement UTF-8

Post by jal »

berkus wrote:We can argue for a long time, but a practical implementation I trust design-wise uses UTF-16 for speed reasons and I would believe UTF-8 implementation would be horrendously slow in more cases than utf-16 one.
True, one can argue a lot :). UTF-16 is faster, although one needs to consider UTF-32 if one wants to be truly compatible with all East-Asian extensions and whatnot. The slowness of UTF-8 largely depends on the characters used: in English text there's hardly a speed penalty, in e.g. French text with many accents it may be more.
It's relatively easy to benchmark though.
I think it's very difficult to benchmark, as it'll depend heavily on the characters used (so the language/script used as well), and on the operations performed. But UTF-16 will be faster, there's no doubt about it.
Another assumption is that glyph calculations will be a lot more slow than codepoints manipulation and therefore utf-8 vs utf-16 is largely irrelevant.
I agree. There's so much going on on a modern OS, that I doubt that it'll make a large difference. Therefore I prefer UTF-8, as it is the most concise form available for most languages and scripts.


JAL
frazzledjazz
Posts: 7
Joined: Tue Mar 03, 2009 3:52 pm

Re: how to implement UTF-8

Post by frazzledjazz »

char and widechar, right? widechar is a word, right? gotcha so far.

sorry, the structures are not quite the same in pascal.
jal
Member
Member
Posts: 1385
Joined: Wed Oct 31, 2007 9:09 am

Re: how to implement UTF-8

Post by jal »

Yes, that's a good one everybody should read.


JAL
User avatar
AJ
Member
Member
Posts: 2646
Joined: Sun Oct 22, 2006 7:01 am
Location: Devon, UK
Contact:

Re: how to implement UTF-8

Post by AJ »

jal
Member
Member
Posts: 1385
Joined: Wed Oct 31, 2007 9:09 am

Re: how to implement UTF-8

Post by jal »

AJ wrote:Nice - interesting read.
It was only quite recently I felt the need to scold someone here who claimed he didn't need Unicode since he was from the US and would never support any other language. Some people are so ignorant...


JAL
User avatar
AJ
Member
Member
Posts: 2646
Joined: Sun Oct 22, 2006 7:01 am
Location: Devon, UK
Contact:

Re: how to implement UTF-8

Post by AJ »

I'm embarrased to say that I currently only support ANSI encoding and English character sets. This situation should change in the near future, though.

IMO, even if you only plan on Americans using your OS, this argument evaporates as soon as you plan on a) Adding network or email functionality to your OS, or b) Exchanging files with any other computer.

Cheers,
Adam
jal
Member
Member
Posts: 1385
Joined: Wed Oct 31, 2007 9:09 am

Re: how to implement UTF-8

Post by jal »

AJ wrote:IMO, even if you only plan on Americans using your OS, this argument evaporates as soon as you plan on a) Adding network or email functionality to your OS, or b) Exchanging files with any other computer.
c) Defining 'American' other than 'WASP'.


JAL
Post Reply