Wide-character strings in C/C++

Solar · Post by **Solar** » Fri Nov 05, 2004 6:11 am

Perica wrote:
Q1. Wide-character strings are always stored in UTF-16, right?

No, that's implementation-defined. (And depending on the current locale, too, unless I'm mistaken - but that's run-time, not compile-time.)

Q2. When C/C++ is written, it's usually written using an 8-bit character set. Wide-characters are wider than 8-bits (hence the name =P), so how are wide-character strings mixed with the rest of the 8-bit characters?

C99, 5.2.1 Character sets, paragraph 2:

"In a character constant or string literal, members of the execution character set shall be represented by corresponding members of the source character set or by escape sequences consisting of the backslash \ followed by one or more characters. [...]"

The source character set is what your source is written in. If your editor and compiler support e.g. UTF-32, then UTF-32 is your source character set; otherwise you'll have to escape 'em.

Colonel Kernel · Post by **Colonel Kernel** » Fri Nov 05, 2004 9:47 am

Solar wrote:
Perica wrote:
Q1. Wide-character strings are always stored in UTF-16, right?
No, that's implementation-defined. (And depending on the current locale, too, unless I'm mistaken - but that's run-time, not compile-time.)

I don't think Unicode is ever dependent on any locale... otherwise it wouldn't be Unicode. FWIW, VC++ encodes wide-character string literals in UTF-16. I believe GCC on Linux encodes them as UTF-32 (not sure if this is true of GCC on other platforms).

mystran · Post by **mystran** » Fri Nov 05, 2004 10:35 am

You can usually assume that you can get your widestrings as either UCS-2/UTF-16 or UTF-32. In the OSS community consensus seems to be that UTF-32 is the only sane approach, but MS Windows uses UCS-2 (or UTF-16 on XP).

In unicode you have three sets of 'characters'. You have codepoints, graphemes, and glyphs. Codepoints are what you encode into strings. Graphemes are what you get when you process things like letters and combining characters into "user characters". Say, the letter ? could be written as either single ? character or a+umlaut. The reason the letter ? exists as a single character is to ease backwards compability with older character sets. Finally glyphs are what get drawn to screen, and sometimes the same grapheme may result in several different glyphs depending on what other graphemes are there around it.

Once you accept that codepoints and graphemes are distict, you notice that as long as most of your text is ascii, UTF-8 is going to be shortest. For most latin scripts the difference between UTF-8 and UTF-16 is insignificant though. The nice thing about UTF-16 though, is that characters up to U+FFFF always take 2 bytes, while characters beyond this "basic multilingual plane" take 4 bytes, that is, the same as UTF-32.

The benefit of using UTF-32 over UTF-16 is that you can do O(1) codepoint indexing, but since you should usually do either code-unit (byte) indexing which is O(1) anyway, or grapheme indexing (which is O(n) no matter what encoding you use), the actually difference in using UTF-8/16/32 is actually quite irrelevant.

UCS-2 what Windows prior to XP and Java and many others use, is simply UTF-16 which doesn't handle the "surrogates" which allow you to access characters beyond U+FFFF which means it can only access the BMP. Most UCS-2 systems pass UTF-16 through though.

Depending on things including politics, your long string can be anything (except usually UTF-8). That said, if you write your source in UTF-8, you can use perfectly normal strings for your unicode. In any case, the choice of encoding is more or less irrelevant, since supporting all three is still the smallest part of supporting Unicode properly, even if you forget about the more esoteric grapheme->glyph conversion rules and bi-directional text.

Personally, I'm considering to forget about UTF-16/UTF-32 in sources, and just put UTF-8 into normal strings.

Solar · Post by **Solar** » Sat Nov 06, 2004 11:45 am

Colonel Kernel wrote:
Solar wrote:
Perica wrote:
Q1. Wide-character strings are always stored in UTF-16, right?
No, that's implementation-defined. (And depending on the current locale, too, unless I'm mistaken - but that's run-time, not compile-time.)
I don't think Unicode is ever dependent on any locale... otherwise it wouldn't be Unicode.

The thing is that it doesn't say anywhere that "wide characters" as in wchar_t are Unicode. They could be something completely different, and that is possibly locale-dependent, but most assuredly implementation dependent.

FWIW, VC++ encodes wide-character string literals in UTF-16. I believe GCC on Linux encodes them as UTF-32 (not sure if this is true of GCC on other platforms).

See?

Colonel Kernel · Post by **Colonel Kernel** » Sat Nov 06, 2004 1:24 pm

FWIW, VC++ encodes wide-character string literals in UTF-16. I believe GCC on Linux encodes them as UTF-32 (not sure if this is true of GCC on other platforms).
See?

See what? There's no locale dependency there...

df · Post by df » Sat Nov 06, 2004 2:10 pm

for gcc, depending on what version you are running

here is an interesting convo..

Code: Select all

matz> For string literals basically the byte values are transfered literally
matz> from the input file to the object file. (C99 is more complicated than
matz> this, but with UTF-8 files this is what it boils down to).
matz>
matz> With newer GCC versions (3.4 onward) normal string literals have their own
matz> character set, which can be chosen by the user. By default this also is
matz> UTF-8, so the behaviour is the same, i.e. UTF-8 strings are copied
matz> literally to the object file.
matz>
matz> This is different with _wide_ character literals, i.e. strings of the form
matz> L"Blabla" . With GCC 3.3 the input bytes are simply widened, i.e. such
matz> wide character literal contains the same bytes as the UTF-8 encoding,
matz> except that each byte is extended to 32 bit. With GCC 3.5 this is
matz> different. There the input characters (in UTF-8) are converted into the
matz> wide character set, which is UTF-32 (i.e. UCS-4) by default on machines
matz> where wchar is 32 bit wide (i.e. all linux platforms).
matz>
matz> Hence, with gcc 3.3 there is no recoding involved. UTF-8 strings will be
matz> placed into the object file.
matz>
matz> With newer gcc there is recoding for _wide_ string literals, from UTF-8 to
matz> UCS-4 (i.e. 32 bit per character). Without user intervention there is no
matz> recoding for normal string literals, i.e. UTF-8 strings remain such.
matz>
matz> > If I have something like
matz> >
matz> > char * lstr = Sch?ne Gr??e
matz>
matz> So, given this (assuming the input is UTF-8):
matz> char *str = "Sch?ne Gr??e";
matz> wchar *wstr = L"Sch?ne Gr??e";
matz>
matz> with our compiler 'str' will be in UTF-8, 'wstr' will be in something
matz> strange (UTF-8 but each byte extended to 32 bit).
matz>
matz> with newer compilers and no special options while compiling 'str' will be
matz> in UTF-8, and 'wstr' will be in UCS-4.

I like the part where it says for a particular version of gcc that instead of converting utf-8 into utf-16, it just widens each byte! ouch ouch ouch!!

anyway..

personally id go UCS-4 instead of UTF-32 but hey......

mystran · Post by **mystran** » Sat Nov 06, 2004 11:55 pm

Eh, what's the difference with UCS-4 and UTF-32 (hint: different names come from different standards).

That said, the above post seems to support my view that the easiest option is to simply stuff UTF-8 into the normal strings and be happy... would mostly work with legacy compilers that know nothing about unicode whatsoever too...

zloba · Post by **zloba** » Sun Nov 07, 2004 2:02 am

i use utf-8 everywhere.
i only reconstruct utf-32 values as needed for processing (such as when laying out text for display).

i'd say the main reason is its being ascii-compatible, so in most places i just ignore the fact that there may be multibyte sequences and treat them as bytes.

i dislike the idea of utf-16, since it is both ascii-incompatible, and still has the extra complexity of value pairs. all for the sake of _potential_ insignificant space savings under rare circumstances..

Solar · Post by **Solar** » Sun Nov 07, 2004 2:08 am

Colonel Kernel wrote:

FWIW, VC++ encodes wide-character string literals in UTF-16. I believe GCC on Linux encodes them as UTF-32 (not sure if this is true of GCC on other platforms).
See?
See what? There's no locale dependency there...

Read again: ...that is possibly locale-dependent, but most assuredly implementation dependent.

There simply is no rule in the standard as to what a wide character actually is.

df · Post by df » Sun Nov 07, 2004 2:45 am

mystran wrote: Eh, what's the difference with UCS-4 and UTF-32 (hint: different names come from different standards).

UTF-32 encoding is a subset of UCS-4, not the whole shebang

mystran · Post by **mystran** » Sun Nov 07, 2004 2:43 pm

Ah indeed.. from http://www.unicode.org/unicode/reports/tr19/tr19-9.html

# Over and above ISO 10646, the Unicode Standard adds a number of conformance constraints on character semantics (see The Unicode Standard, Version 3.0, Chapter 3). Declaring UTF-32 instead of UCS-4 allows implementations to explicitly commit to Unicode semantics.

You're right.

OSDev.org

Wide-character strings in C/C++

Wide-character strings in C/C++

Re:Wide-character strings in C/C++

Re:Wide-character strings in C/C++

Re:Wide-character strings in C/C++

Re:Wide-character strings in C/C++

Re:Wide-character strings in C/C++

Re:Wide-character strings in C/C++

Re:Wide-character strings in C/C++

Re:Wide-character strings in C/C++

Re:Wide-character strings in C/C++

Re:Wide-character strings in C/C++

Re:Wide-character strings in C/C++