How to deal with UTF-8 in C++

OSwhatever · Post by **OSwhatever** » Mon Aug 29, 2011 5:13 am

By definition all strings must be passed as an UTF8 encoded string to any native API in my system. Stdlib in C++ provide wstring which is basically an array of wchar_t (almost always 32-bit). In order to get the UTF8 string into the wstring, you can use codecvt. A process usually check and/or modify the string and send it further to another process. This means that there is a conversion when the process gets the string and a conversion when the process pass it on. Also codecvt can be cluttery to use.

Often you don't need the random access speed provided with wstring so the UTF8 could be left converted in an UTF8 sting class. Do you know how well UTF8 strings are supported in C++ or do I have to write my class or are there any good libraries for this? What I'm after is like std::string that can handle UTF8 natively. wstring can be used in those cases random access speed is important.

How did you solve this dilemma?

Combuster · Post by **Combuster** » Mon Aug 29, 2011 7:48 am

UTF8 was designed to fit in a char *, and if you only use it for passing data verbatim and don't intend to modify or interpret any of it, and don't need to care if two encodings of the exact same text fail comparison, then there's no reason to use anything beyond cstring/string.h. The same logic holds for wchar_t * and wstring.h

In all other cases, ICU is most likely the appropriate solution.

Owen · Post by **Owen** » Mon Aug 29, 2011 11:08 am

C++0x (and C1X) adds Unicode literals; we now have the following system:

Code: Select all

    types_equal<decltype("This string has unspecified encoding, and is in the 'execution character set'"), char>::value == true; 
    types_equal<decltype(u8"This is a UTF-8 string"), char>::value == true; 
    types_equal<decltype(u"This is a UTF-16 string"), char16_t>::value == true; 
    types_equal<decltype(U"This is a UCS-4 string"), char32_t>::value == true; 
    types_equal<decltype(L"This is a wide string of unspecified encoding"), wchar_t>::value == true;

If "__STDC_ISO_10646__" is defined, it is a value in form yyyymmddL which specifies that a wchar_t can contain every Unicode character defined as of that date, with the value of the wchar_t being that character's Unicode defined scalar value. Practically, this means that the wide character set is UCS-4 (Unless the date predates ~1996, in which case it may be UCS-2)

u and U are not defined to produce UTF-16/UCS-4 strings under C1X. They are under C++0X. Practically, I doubt we will ever see implementations contrary.

OSwhatever · Post by **OSwhatever** » Mon Sep 05, 2011 5:10 pm

Owen wrote:C++0x (and C1X) adds Unicode literals; we now have the following system:
Code: Select all
    types_equal<decltype("This string has unspecified encoding, and is in the 'execution character set'"), char>::value == true; 
    types_equal<decltype(u8"This is a UTF-8 string"), char>::value == true; 
    types_equal<decltype(u"This is a UTF-16 string"), char16_t>::value == true; 
    types_equal<decltype(U"This is a UCS-4 string"), char32_t>::value == true; 
    types_equal<decltype(L"This is a wide string of unspecified encoding"), wchar_t>::value == true;
If "__STDC_ISO_10646__" is defined, it is a value in form yyyymmddL which specifies that a wchar_t can contain every Unicode character defined as of that date, with the value of the wchar_t being that character's Unicode defined scalar value. Practically, this means that the wide character set is UCS-4 (Unless the date predates ~1996, in which case it may be UCS-2)

u and U are not defined to produce UTF-16/UCS-4 strings under C1X. They are under C++0X. Practically, I doubt we will ever see implementations contrary.

Still the string class isn't updated for UTF-8. Even before C++0x string literals were often set to be UTF-8 since the source file itself was assumed to be UTF-8. This something I think they have missed, adding UTF-8 in stdlib by default.

Owen · Post by **Owen** » Tue Sep 06, 2011 1:01 am

...adding UTF-8 support in what way?

The only parts where C(++) gets involved in encoding are

Input & Output
Encoding conversion (wcstombs, etc)
Source parsing

std::string is not concerned with any of these.

OSwhatever · Post by **OSwhatever** » Tue Sep 06, 2011 2:58 pm

Owen wrote:...adding UTF-8 support in what way?

The only parts where C(++) gets involved in encoding are

Input & Output

Encoding conversion (wcstombs, etc)

Source parsing
std::string is not concerned with any of these.

Basically what I want is to do string manipulation directly on UTF-8 without conversion to UTF-32. An example where this is implemented is in ustl (http://ustl.sourceforge.net/) where the class ustl::string is using UTF-8 directly. ICU also seems to provide support for native UTF-8 strings so I'm probably going port that. Check this link out http://userguide.icu-project.org/strings. C++0x missed that I think.

Solar · Post by **Solar** » Wed Sep 07, 2011 2:03 am

Combuster wrote:UTF8 was designed to fit in a char *, and if you only use it for passing data verbatim and don't intend to modify or interpret any of it, and don't need to care if two encodings of the exact same text fail comparison, then there's no reason to use anything beyond cstring/string.h.

In all other cases, ICU is most likely the appropriate solution.

I would like to give this a +1 bump. QFT. Note the italics, though.

However:

Combuster wrote:The same logic holds for wchar_t * and wstring.h

I disagree. The problem is that there are both 16-bit and 32-bit wchar_t definitions out there, and compiler settings for both native and externally-declared handling of the type. Playing it safe here is an ugly business.

My recommendation is to stay away from wchar_t and wstring. You need wide characters? That means you need full Unicode support, i.e. you need ICU. Sad that it happened this way, but that's how it is for the veteran languages that predated Unicode switching to 32 bit.

davidv1992 · Post by **davidv1992** » Wed Sep 07, 2011 6:47 am

One slight disgression, how can in UTF-8 two strings equal the same whilst being encoded differently, appart from unicode codepoints which just look very similar?

Owen · Post by **Owen** » Wed Sep 07, 2011 7:09 am

davidv1992 wrote:One slight disgression, how can in UTF-8 two strings equal the same whilst being encoded differently, appart from unicode codepoints which just look very similar?

"Unicode Normalization Forms". For example, the character å can be composed as a single scalar value (As it was in legacy encodings), or as the scalar value for LATIN LOWER CASE A plus the combining mark for whatever the dot above is called (my memory fails me

)

Generally things like file systems will define a normalization form for file names. If they don't... bad things happen. Windows prefers precomposed form (i.e. where a character can be encoded in one scalar value, that is chosen), while OS X prefers uncomposed form (i.e. characters will be encoded as character + combining mark)

Combuster wrote:UTF8 was designed to fit in a char *, and if you only use it for passing data verbatim and don't intend to modify or interpret any of it, and don't need to care if two encodings of the exact same text fail comparison, then there's no reason to use anything beyond cstring/string.h. The same logic holds for wchar_t * and wstring.h

In all other cases, ICU is most likely the appropriate solution.

Also note that ICU natively works in UTF-16. Trying to use anything else puts you in for a world of hurt/

Solar · Post by **Solar** » Wed Sep 07, 2011 7:17 am

davidv1992 wrote:One slight disgression, how can in UTF-8 two strings equal the same whilst being encoded differently, appart from unicode codepoints which just look very similar?

Well, it depends, does it? You get SPACE, NO-BREAK SPACE, EN SPACE, EM SPACE, PUNCTUATION SPACE... are you checking for "equal" in the syntactical or the typographic sense?

Latin Capital letter A with diaeresis or Capital letter A / Combining Diaresis? Looks the same to me...

OSwhatever · Post by **OSwhatever** » Wed Sep 07, 2011 9:46 am

Owen wrote:Also note that ICU natively works in UTF-16. Trying to use anything else puts you in for a world of hurt/

That's why I'm a bit reluctant towards ICU. Also it's very big, a little bit more than I ask for.

GTK+ inti library seems to work with UTF-8. GTK+ seems to be in the same situation I'm in, the API assumes UTF-8.

Solar · Post by **Solar** » Thu Sep 08, 2011 12:19 am

OSwhatever wrote:
Owen wrote:Also note that ICU natively works in UTF-16. Trying to use anything else puts you in for a world of hurt/
That's why I'm a bit reluctant towards ICU.

One, not quite true. ICU has some provisions for working with UTF-8 directly.

Two, whatever ICU uses internally doesn't matter much, now does it? Just like UTF-8, UTF-16 is a compromise between memory efficiency and performance efficiency, just biased towards cases where ASCII-7 is not enough. And it's not as if converting to or from internal format is a difficult thing:

Code: Select all

UnicodeString::UnicodeString( const char * codepageData, const char * codepage )
UnicodeString::fromUTF32( const UChar32 * utf32, int32_t length )
UnicodeString::fromUTF8( const StringPiece & utf8 )
UnicodeString::fromUTF32( const UChar32 * utf32, int32_t length )

...and several more.

Also it's very big, a little bit more than I ask for.

Just like with the standard C library: Sooner or later you will have to provide it at the application level, because users will expect it to be present. Why not implement / port it now instead of later, and enjoy its availability in your kernel?

GTK+ inti library seems to work with UTF-8. GTK+ seems to be in the same situation I'm in, the API assumes UTF-8.

It's a bit funny that you should point towards GTK+ as a solution for ICU being "very big".

OSwhatever · Post by **OSwhatever** » Thu Sep 08, 2011 6:24 pm

easl seems to be nice.

http://code.google.com/p/easl/wiki/About

If you don't need any advanced converting this might be a good candidate.

Combuster · Post by **Combuster** » Fri Sep 09, 2011 2:47 am

OSwhatever wrote:easl seems to be nice. If you don't need any advanced converting this might be a good candidate.

And it's just as broken as any other ctype/string wrapper. Heck, it even neglects to use setlocale() and expect that all the character indexing and transformation work... And of course there's the mandatory lack of surrogate pair support.

OSDev.org

How to deal with UTF-8 in C++

How to deal with UTF-8 in C++

Re: How to deal with UTF-8 in C++

Re: How to deal with UTF-8 in C++

Re: How to deal with UTF-8 in C++

Re: How to deal with UTF-8 in C++

Re: How to deal with UTF-8 in C++

Re: How to deal with UTF-8 in C++

Re: How to deal with UTF-8 in C++

Re: How to deal with UTF-8 in C++

Re: How to deal with UTF-8 in C++

Re: How to deal with UTF-8 in C++

Re: How to deal with UTF-8 in C++

Re: How to deal with UTF-8 in C++

Re: How to deal with UTF-8 in C++