How to deal with UTF-8 in C++

Programming, for all ages and all languages.
Post Reply
OSwhatever
Member
Member
Posts: 595
Joined: Mon Jul 05, 2010 4:15 pm

How to deal with UTF-8 in C++

Post by OSwhatever »

By definition all strings must be passed as an UTF8 encoded string to any native API in my system. Stdlib in C++ provide wstring which is basically an array of wchar_t (almost always 32-bit). In order to get the UTF8 string into the wstring, you can use codecvt. A process usually check and/or modify the string and send it further to another process. This means that there is a conversion when the process gets the string and a conversion when the process pass it on. Also codecvt can be cluttery to use.

Often you don't need the random access speed provided with wstring so the UTF8 could be left converted in an UTF8 sting class. Do you know how well UTF8 strings are supported in C++ or do I have to write my class or are there any good libraries for this? What I'm after is like std::string that can handle UTF8 natively. wstring can be used in those cases random access speed is important.

How did you solve this dilemma?
User avatar
Combuster
Member
Member
Posts: 9301
Joined: Wed Oct 18, 2006 3:45 am
Libera.chat IRC: [com]buster
Location: On the balcony, where I can actually keep 1½m distance
Contact:

Re: How to deal with UTF-8 in C++

Post by Combuster »

UTF8 was designed to fit in a char *, and if you only use it for passing data verbatim and don't intend to modify or interpret any of it, and don't need to care if two encodings of the exact same text fail comparison, then there's no reason to use anything beyond cstring/string.h. The same logic holds for wchar_t * and wstring.h

In all other cases, ICU is most likely the appropriate solution.
"Certainly avoid yourself. He is a newbie and might not realize it. You'll hate his code deeply a few years down the road." - Sortie
[ My OS ] [ VDisk/SFS ]
User avatar
Owen
Member
Member
Posts: 1700
Joined: Fri Jun 13, 2008 3:21 pm
Location: Cambridge, United Kingdom
Contact:

Re: How to deal with UTF-8 in C++

Post by Owen »

C++0x (and C1X) adds Unicode literals; we now have the following system:

Code: Select all

    types_equal<decltype("This string has unspecified encoding, and is in the 'execution character set'"), char>::value == true; 
    types_equal<decltype(u8"This is a UTF-8 string"), char>::value == true; 
    types_equal<decltype(u"This is a UTF-16 string"), char16_t>::value == true; 
    types_equal<decltype(U"This is a UCS-4 string"), char32_t>::value == true; 
    types_equal<decltype(L"This is a wide string of unspecified encoding"), wchar_t>::value == true;
If "__STDC_ISO_10646__" is defined, it is a value in form yyyymmddL which specifies that a wchar_t can contain every Unicode character defined as of that date, with the value of the wchar_t being that character's Unicode defined scalar value. Practically, this means that the wide character set is UCS-4 (Unless the date predates ~1996, in which case it may be UCS-2)

u and U are not defined to produce UTF-16/UCS-4 strings under C1X. They are under C++0X. Practically, I doubt we will ever see implementations contrary.
OSwhatever
Member
Member
Posts: 595
Joined: Mon Jul 05, 2010 4:15 pm

Re: How to deal with UTF-8 in C++

Post by OSwhatever »

Owen wrote:C++0x (and C1X) adds Unicode literals; we now have the following system:

Code: Select all

    types_equal<decltype("This string has unspecified encoding, and is in the 'execution character set'"), char>::value == true; 
    types_equal<decltype(u8"This is a UTF-8 string"), char>::value == true; 
    types_equal<decltype(u"This is a UTF-16 string"), char16_t>::value == true; 
    types_equal<decltype(U"This is a UCS-4 string"), char32_t>::value == true; 
    types_equal<decltype(L"This is a wide string of unspecified encoding"), wchar_t>::value == true;
If "__STDC_ISO_10646__" is defined, it is a value in form yyyymmddL which specifies that a wchar_t can contain every Unicode character defined as of that date, with the value of the wchar_t being that character's Unicode defined scalar value. Practically, this means that the wide character set is UCS-4 (Unless the date predates ~1996, in which case it may be UCS-2)

u and U are not defined to produce UTF-16/UCS-4 strings under C1X. They are under C++0X. Practically, I doubt we will ever see implementations contrary.
Still the string class isn't updated for UTF-8. Even before C++0x string literals were often set to be UTF-8 since the source file itself was assumed to be UTF-8. This something I think they have missed, adding UTF-8 in stdlib by default.
User avatar
Owen
Member
Member
Posts: 1700
Joined: Fri Jun 13, 2008 3:21 pm
Location: Cambridge, United Kingdom
Contact:

Re: How to deal with UTF-8 in C++

Post by Owen »

...adding UTF-8 support in what way?

The only parts where C(++) gets involved in encoding are
  • Input & Output
  • Encoding conversion (wcstombs, etc)
  • Source parsing
std::string is not concerned with any of these.
OSwhatever
Member
Member
Posts: 595
Joined: Mon Jul 05, 2010 4:15 pm

Re: How to deal with UTF-8 in C++

Post by OSwhatever »

Owen wrote:...adding UTF-8 support in what way?

The only parts where C(++) gets involved in encoding are
  • Input & Output
  • Encoding conversion (wcstombs, etc)
  • Source parsing
std::string is not concerned with any of these.
Basically what I want is to do string manipulation directly on UTF-8 without conversion to UTF-32. An example where this is implemented is in ustl (http://ustl.sourceforge.net/) where the class ustl::string is using UTF-8 directly. ICU also seems to provide support for native UTF-8 strings so I'm probably going port that. Check this link out http://userguide.icu-project.org/strings. C++0x missed that I think.
User avatar
Solar
Member
Member
Posts: 7615
Joined: Thu Nov 16, 2006 12:01 pm
Location: Germany
Contact:

Re: How to deal with UTF-8 in C++

Post by Solar »

Combuster wrote:UTF8 was designed to fit in a char *, and if you only use it for passing data verbatim and don't intend to modify or interpret any of it, and don't need to care if two encodings of the exact same text fail comparison, then there's no reason to use anything beyond cstring/string.h.

In all other cases, ICU is most likely the appropriate solution.
I would like to give this a +1 bump. QFT. Note the italics, though.

However:
Combuster wrote:The same logic holds for wchar_t * and wstring.h
I disagree. The problem is that there are both 16-bit and 32-bit wchar_t definitions out there, and compiler settings for both native and externally-declared handling of the type. Playing it safe here is an ugly business.

My recommendation is to stay away from wchar_t and wstring. You need wide characters? That means you need full Unicode support, i.e. you need ICU. Sad that it happened this way, but that's how it is for the veteran languages that predated Unicode switching to 32 bit.
Every good solution is obvious once you've found it.
davidv1992
Member
Member
Posts: 223
Joined: Thu Jul 05, 2007 8:58 am

Re: How to deal with UTF-8 in C++

Post by davidv1992 »

One slight disgression, how can in UTF-8 two strings equal the same whilst being encoded differently, appart from unicode codepoints which just look very similar?
User avatar
Owen
Member
Member
Posts: 1700
Joined: Fri Jun 13, 2008 3:21 pm
Location: Cambridge, United Kingdom
Contact:

Re: How to deal with UTF-8 in C++

Post by Owen »

davidv1992 wrote:One slight disgression, how can in UTF-8 two strings equal the same whilst being encoded differently, appart from unicode codepoints which just look very similar?
"Unicode Normalization Forms". For example, the character å can be composed as a single scalar value (As it was in legacy encodings), or as the scalar value for LATIN LOWER CASE A plus the combining mark for whatever the dot above is called (my memory fails me :P)

Generally things like file systems will define a normalization form for file names. If they don't... bad things happen. Windows prefers precomposed form (i.e. where a character can be encoded in one scalar value, that is chosen), while OS X prefers uncomposed form (i.e. characters will be encoded as character + combining mark)
Combuster wrote:UTF8 was designed to fit in a char *, and if you only use it for passing data verbatim and don't intend to modify or interpret any of it, and don't need to care if two encodings of the exact same text fail comparison, then there's no reason to use anything beyond cstring/string.h. The same logic holds for wchar_t * and wstring.h

In all other cases, ICU is most likely the appropriate solution.
Also note that ICU natively works in UTF-16. Trying to use anything else puts you in for a world of hurt/
User avatar
Solar
Member
Member
Posts: 7615
Joined: Thu Nov 16, 2006 12:01 pm
Location: Germany
Contact:

Re: How to deal with UTF-8 in C++

Post by Solar »

davidv1992 wrote:One slight disgression, how can in UTF-8 two strings equal the same whilst being encoded differently, appart from unicode codepoints which just look very similar?
Well, it depends, does it? You get SPACE, NO-BREAK SPACE, EN SPACE, EM SPACE, PUNCTUATION SPACE... are you checking for "equal" in the syntactical or the typographic sense?

Latin Capital letter A with diaeresis or Capital letter A / Combining Diaresis? Looks the same to me...
Every good solution is obvious once you've found it.
OSwhatever
Member
Member
Posts: 595
Joined: Mon Jul 05, 2010 4:15 pm

Re: How to deal with UTF-8 in C++

Post by OSwhatever »

Owen wrote:Also note that ICU natively works in UTF-16. Trying to use anything else puts you in for a world of hurt/
That's why I'm a bit reluctant towards ICU. Also it's very big, a little bit more than I ask for.

GTK+ inti library seems to work with UTF-8. GTK+ seems to be in the same situation I'm in, the API assumes UTF-8.
User avatar
Solar
Member
Member
Posts: 7615
Joined: Thu Nov 16, 2006 12:01 pm
Location: Germany
Contact:

Re: How to deal with UTF-8 in C++

Post by Solar »

OSwhatever wrote:
Owen wrote:Also note that ICU natively works in UTF-16. Trying to use anything else puts you in for a world of hurt/
That's why I'm a bit reluctant towards ICU.
One, not quite true. ICU has some provisions for working with UTF-8 directly.

Two, whatever ICU uses internally doesn't matter much, now does it? Just like UTF-8, UTF-16 is a compromise between memory efficiency and performance efficiency, just biased towards cases where ASCII-7 is not enough. And it's not as if converting to or from internal format is a difficult thing:

Code: Select all

UnicodeString::UnicodeString( const char * codepageData, const char * codepage )
UnicodeString::fromUTF32( const UChar32 * utf32, int32_t length )
UnicodeString::fromUTF8( const StringPiece & utf8 )
UnicodeString::fromUTF32( const UChar32 * utf32, int32_t length )
...and several more.
Also it's very big, a little bit more than I ask for.
Just like with the standard C library: Sooner or later you will have to provide it at the application level, because users will expect it to be present. Why not implement / port it now instead of later, and enjoy its availability in your kernel?
GTK+ inti library seems to work with UTF-8. GTK+ seems to be in the same situation I'm in, the API assumes UTF-8.
It's a bit funny that you should point towards GTK+ as a solution for ICU being "very big". 8)
Every good solution is obvious once you've found it.
OSwhatever
Member
Member
Posts: 595
Joined: Mon Jul 05, 2010 4:15 pm

Re: How to deal with UTF-8 in C++

Post by OSwhatever »

easl seems to be nice.

http://code.google.com/p/easl/wiki/About

If you don't need any advanced converting this might be a good candidate.
User avatar
Combuster
Member
Member
Posts: 9301
Joined: Wed Oct 18, 2006 3:45 am
Libera.chat IRC: [com]buster
Location: On the balcony, where I can actually keep 1½m distance
Contact:

Re: How to deal with UTF-8 in C++

Post by Combuster »

OSwhatever wrote:easl seems to be nice. If you don't need any advanced converting this might be a good candidate.
And it's just as broken as any other ctype/string wrapper. Heck, it even neglects to use setlocale() and expect that all the character indexing and transformation work... And of course there's the mandatory lack of surrogate pair support.
"Certainly avoid yourself. He is a newbie and might not realize it. You'll hate his code deeply a few years down the road." - Sortie
[ My OS ] [ VDisk/SFS ]
Post Reply