Wide-character strings in C/C++
Wide-character strings in C/C++
..
Last edited by Perica on Tue Dec 05, 2006 9:33 pm, edited 1 time in total.
Re:Wide-character strings in C/C++
No, that's implementation-defined. (And depending on the current locale, too, unless I'm mistaken - but that's run-time, not compile-time.)Perica wrote:
Q1. Wide-character strings are always stored in UTF-16, right?
C99, 5.2.1 Character sets, paragraph 2:Q2. When C/C++ is written, it's usually written using an 8-bit character set. Wide-characters are wider than 8-bits (hence the name =P), so how are wide-character strings mixed with the rest of the 8-bit characters?
"In a character constant or string literal, members of the execution character set shall be represented by corresponding members of the source character set or by escape sequences consisting of the backslash \ followed by one or more characters. [...]"
The source character set is what your source is written in. If your editor and compiler support e.g. UTF-32, then UTF-32 is your source character set; otherwise you'll have to escape 'em.
Every good solution is obvious once you've found it.
- Colonel Kernel
- Member
- Posts: 1437
- Joined: Tue Oct 17, 2006 6:06 pm
- Location: Vancouver, BC, Canada
- Contact:
Re:Wide-character strings in C/C++
I don't think Unicode is ever dependent on any locale... otherwise it wouldn't be Unicode. FWIW, VC++ encodes wide-character string literals in UTF-16. I believe GCC on Linux encodes them as UTF-32 (not sure if this is true of GCC on other platforms).Solar wrote:No, that's implementation-defined. (And depending on the current locale, too, unless I'm mistaken - but that's run-time, not compile-time.)Perica wrote:
Q1. Wide-character strings are always stored in UTF-16, right?
Top three reasons why my OS project died:
- Too much overtime at work
- Got married
- My brain got stuck in an infinite loop while trying to design the memory manager
Re:Wide-character strings in C/C++
You can usually assume that you can get your widestrings as either UCS-2/UTF-16 or UTF-32. In the OSS community consensus seems to be that UTF-32 is the only sane approach, but MS Windows uses UCS-2 (or UTF-16 on XP).
In unicode you have three sets of 'characters'. You have codepoints, graphemes, and glyphs. Codepoints are what you encode into strings. Graphemes are what you get when you process things like letters and combining characters into "user characters". Say, the letter ? could be written as either single ? character or a+umlaut. The reason the letter ? exists as a single character is to ease backwards compability with older character sets. Finally glyphs are what get drawn to screen, and sometimes the same grapheme may result in several different glyphs depending on what other graphemes are there around it.
Once you accept that codepoints and graphemes are distict, you notice that as long as most of your text is ascii, UTF-8 is going to be shortest. For most latin scripts the difference between UTF-8 and UTF-16 is insignificant though. The nice thing about UTF-16 though, is that characters up to U+FFFF always take 2 bytes, while characters beyond this "basic multilingual plane" take 4 bytes, that is, the same as UTF-32.
The benefit of using UTF-32 over UTF-16 is that you can do O(1) codepoint indexing, but since you should usually do either code-unit (byte) indexing which is O(1) anyway, or grapheme indexing (which is O(n) no matter what encoding you use), the actually difference in using UTF-8/16/32 is actually quite irrelevant.
UCS-2 what Windows prior to XP and Java and many others use, is simply UTF-16 which doesn't handle the "surrogates" which allow you to access characters beyond U+FFFF which means it can only access the BMP. Most UCS-2 systems pass UTF-16 through though.
Depending on things including politics, your long string can be anything (except usually UTF-8). That said, if you write your source in UTF-8, you can use perfectly normal strings for your unicode. In any case, the choice of encoding is more or less irrelevant, since supporting all three is still the smallest part of supporting Unicode properly, even if you forget about the more esoteric grapheme->glyph conversion rules and bi-directional text.
Personally, I'm considering to forget about UTF-16/UTF-32 in sources, and just put UTF-8 into normal strings.
In unicode you have three sets of 'characters'. You have codepoints, graphemes, and glyphs. Codepoints are what you encode into strings. Graphemes are what you get when you process things like letters and combining characters into "user characters". Say, the letter ? could be written as either single ? character or a+umlaut. The reason the letter ? exists as a single character is to ease backwards compability with older character sets. Finally glyphs are what get drawn to screen, and sometimes the same grapheme may result in several different glyphs depending on what other graphemes are there around it.
Once you accept that codepoints and graphemes are distict, you notice that as long as most of your text is ascii, UTF-8 is going to be shortest. For most latin scripts the difference between UTF-8 and UTF-16 is insignificant though. The nice thing about UTF-16 though, is that characters up to U+FFFF always take 2 bytes, while characters beyond this "basic multilingual plane" take 4 bytes, that is, the same as UTF-32.
The benefit of using UTF-32 over UTF-16 is that you can do O(1) codepoint indexing, but since you should usually do either code-unit (byte) indexing which is O(1) anyway, or grapheme indexing (which is O(n) no matter what encoding you use), the actually difference in using UTF-8/16/32 is actually quite irrelevant.
UCS-2 what Windows prior to XP and Java and many others use, is simply UTF-16 which doesn't handle the "surrogates" which allow you to access characters beyond U+FFFF which means it can only access the BMP. Most UCS-2 systems pass UTF-16 through though.
Depending on things including politics, your long string can be anything (except usually UTF-8). That said, if you write your source in UTF-8, you can use perfectly normal strings for your unicode. In any case, the choice of encoding is more or less irrelevant, since supporting all three is still the smallest part of supporting Unicode properly, even if you forget about the more esoteric grapheme->glyph conversion rules and bi-directional text.
Personally, I'm considering to forget about UTF-16/UTF-32 in sources, and just put UTF-8 into normal strings.
Re:Wide-character strings in C/C++
The thing is that it doesn't say anywhere that "wide characters" as in wchar_t are Unicode. They could be something completely different, and that is possibly locale-dependent, but most assuredly implementation dependent.Colonel Kernel wrote:I don't think Unicode is ever dependent on any locale... otherwise it wouldn't be Unicode.Solar wrote:No, that's implementation-defined. (And depending on the current locale, too, unless I'm mistaken - but that's run-time, not compile-time.)Perica wrote:
Q1. Wide-character strings are always stored in UTF-16, right?
See?FWIW, VC++ encodes wide-character string literals in UTF-16. I believe GCC on Linux encodes them as UTF-32 (not sure if this is true of GCC on other platforms).
Every good solution is obvious once you've found it.
- Colonel Kernel
- Member
- Posts: 1437
- Joined: Tue Oct 17, 2006 6:06 pm
- Location: Vancouver, BC, Canada
- Contact:
Re:Wide-character strings in C/C++
See what? There's no locale dependency there...See?FWIW, VC++ encodes wide-character string literals in UTF-16. I believe GCC on Linux encodes them as UTF-32 (not sure if this is true of GCC on other platforms).
Top three reasons why my OS project died:
- Too much overtime at work
- Got married
- My brain got stuck in an infinite loop while trying to design the memory manager
Re:Wide-character strings in C/C++
for gcc, depending on what version you are running here is an interesting convo..
I like the part where it says for a particular version of gcc that instead of converting utf-8 into utf-16, it just widens each byte! ouch ouch ouch!!
anyway..
personally id go UCS-4 instead of UTF-32 but hey......
Code: Select all
matz> For string literals basically the byte values are transfered literally
matz> from the input file to the object file. (C99 is more complicated than
matz> this, but with UTF-8 files this is what it boils down to).
matz>
matz> With newer GCC versions (3.4 onward) normal string literals have their own
matz> character set, which can be chosen by the user. By default this also is
matz> UTF-8, so the behaviour is the same, i.e. UTF-8 strings are copied
matz> literally to the object file.
matz>
matz> This is different with _wide_ character literals, i.e. strings of the form
matz> L"Blabla" . With GCC 3.3 the input bytes are simply widened, i.e. such
matz> wide character literal contains the same bytes as the UTF-8 encoding,
matz> except that each byte is extended to 32 bit. With GCC 3.5 this is
matz> different. There the input characters (in UTF-8) are converted into the
matz> wide character set, which is UTF-32 (i.e. UCS-4) by default on machines
matz> where wchar is 32 bit wide (i.e. all linux platforms).
matz>
matz> Hence, with gcc 3.3 there is no recoding involved. UTF-8 strings will be
matz> placed into the object file.
matz>
matz> With newer gcc there is recoding for _wide_ string literals, from UTF-8 to
matz> UCS-4 (i.e. 32 bit per character). Without user intervention there is no
matz> recoding for normal string literals, i.e. UTF-8 strings remain such.
matz>
matz> > If I have something like
matz> >
matz> > char * lstr = Sch?ne Gr??e
matz>
matz> So, given this (assuming the input is UTF-8):
matz> char *str = "Sch?ne Gr??e";
matz> wchar *wstr = L"Sch?ne Gr??e";
matz>
matz> with our compiler 'str' will be in UTF-8, 'wstr' will be in something
matz> strange (UTF-8 but each byte extended to 32 bit).
matz>
matz> with newer compilers and no special options while compiling 'str' will be
matz> in UTF-8, and 'wstr' will be in UCS-4.
anyway..
personally id go UCS-4 instead of UTF-32 but hey......
-- Stu --
Re:Wide-character strings in C/C++
Eh, what's the difference with UCS-4 and UTF-32 (hint: different names come from different standards).
That said, the above post seems to support my view that the easiest option is to simply stuff UTF-8 into the normal strings and be happy... would mostly work with legacy compilers that know nothing about unicode whatsoever too...
That said, the above post seems to support my view that the easiest option is to simply stuff UTF-8 into the normal strings and be happy... would mostly work with legacy compilers that know nothing about unicode whatsoever too...
Re:Wide-character strings in C/C++
i use utf-8 everywhere.
i only reconstruct utf-32 values as needed for processing (such as when laying out text for display).
i'd say the main reason is its being ascii-compatible, so in most places i just ignore the fact that there may be multibyte sequences and treat them as bytes.
i dislike the idea of utf-16, since it is both ascii-incompatible, and still has the extra complexity of value pairs. all for the sake of _potential_ insignificant space savings under rare circumstances..
i only reconstruct utf-32 values as needed for processing (such as when laying out text for display).
i'd say the main reason is its being ascii-compatible, so in most places i just ignore the fact that there may be multibyte sequences and treat them as bytes.
i dislike the idea of utf-16, since it is both ascii-incompatible, and still has the extra complexity of value pairs. all for the sake of _potential_ insignificant space savings under rare circumstances..
Re:Wide-character strings in C/C++
Read again: ...that is possibly locale-dependent, but most assuredly implementation dependent.Colonel Kernel wrote:See what? There's no locale dependency there...See?FWIW, VC++ encodes wide-character string literals in UTF-16. I believe GCC on Linux encodes them as UTF-32 (not sure if this is true of GCC on other platforms).
There simply is no rule in the standard as to what a wide character actually is.
Every good solution is obvious once you've found it.
Re:Wide-character strings in C/C++
UTF-32 encoding is a subset of UCS-4, not the whole shebangmystran wrote: Eh, what's the difference with UCS-4 and UTF-32 (hint: different names come from different standards).
-- Stu --
Re:Wide-character strings in C/C++
Ah indeed.. from http://www.unicode.org/unicode/reports/tr19/tr19-9.html
You're right.# Over and above ISO 10646, the Unicode Standard adds a number of conformance constraints on character semantics (see The Unicode Standard, Version 3.0, Chapter 3). Declaring UTF-32 instead of UCS-4 allows implementations to explicitly commit to Unicode semantics.