How to deal with UTF-8 in C++
-
- Member
- Posts: 595
- Joined: Mon Jul 05, 2010 4:15 pm
How to deal with UTF-8 in C++
By definition all strings must be passed as an UTF8 encoded string to any native API in my system. Stdlib in C++ provide wstring which is basically an array of wchar_t (almost always 32-bit). In order to get the UTF8 string into the wstring, you can use codecvt. A process usually check and/or modify the string and send it further to another process. This means that there is a conversion when the process gets the string and a conversion when the process pass it on. Also codecvt can be cluttery to use.
Often you don't need the random access speed provided with wstring so the UTF8 could be left converted in an UTF8 sting class. Do you know how well UTF8 strings are supported in C++ or do I have to write my class or are there any good libraries for this? What I'm after is like std::string that can handle UTF8 natively. wstring can be used in those cases random access speed is important.
How did you solve this dilemma?
Often you don't need the random access speed provided with wstring so the UTF8 could be left converted in an UTF8 sting class. Do you know how well UTF8 strings are supported in C++ or do I have to write my class or are there any good libraries for this? What I'm after is like std::string that can handle UTF8 natively. wstring can be used in those cases random access speed is important.
How did you solve this dilemma?
- Combuster
- Member
- Posts: 9301
- Joined: Wed Oct 18, 2006 3:45 am
- Libera.chat IRC: [com]buster
- Location: On the balcony, where I can actually keep 1½m distance
- Contact:
Re: How to deal with UTF-8 in C++
UTF8 was designed to fit in a char *, and if you only use it for passing data verbatim and don't intend to modify or interpret any of it, and don't need to care if two encodings of the exact same text fail comparison, then there's no reason to use anything beyond cstring/string.h. The same logic holds for wchar_t * and wstring.h
In all other cases, ICU is most likely the appropriate solution.
In all other cases, ICU is most likely the appropriate solution.
- Owen
- Member
- Posts: 1700
- Joined: Fri Jun 13, 2008 3:21 pm
- Location: Cambridge, United Kingdom
- Contact:
Re: How to deal with UTF-8 in C++
C++0x (and C1X) adds Unicode literals; we now have the following system:
If "__STDC_ISO_10646__" is defined, it is a value in form yyyymmddL which specifies that a wchar_t can contain every Unicode character defined as of that date, with the value of the wchar_t being that character's Unicode defined scalar value. Practically, this means that the wide character set is UCS-4 (Unless the date predates ~1996, in which case it may be UCS-2)
u and U are not defined to produce UTF-16/UCS-4 strings under C1X. They are under C++0X. Practically, I doubt we will ever see implementations contrary.
Code: Select all
types_equal<decltype("This string has unspecified encoding, and is in the 'execution character set'"), char>::value == true;
types_equal<decltype(u8"This is a UTF-8 string"), char>::value == true;
types_equal<decltype(u"This is a UTF-16 string"), char16_t>::value == true;
types_equal<decltype(U"This is a UCS-4 string"), char32_t>::value == true;
types_equal<decltype(L"This is a wide string of unspecified encoding"), wchar_t>::value == true;
u and U are not defined to produce UTF-16/UCS-4 strings under C1X. They are under C++0X. Practically, I doubt we will ever see implementations contrary.
-
- Member
- Posts: 595
- Joined: Mon Jul 05, 2010 4:15 pm
Re: How to deal with UTF-8 in C++
Still the string class isn't updated for UTF-8. Even before C++0x string literals were often set to be UTF-8 since the source file itself was assumed to be UTF-8. This something I think they have missed, adding UTF-8 in stdlib by default.Owen wrote:C++0x (and C1X) adds Unicode literals; we now have the following system:If "__STDC_ISO_10646__" is defined, it is a value in form yyyymmddL which specifies that a wchar_t can contain every Unicode character defined as of that date, with the value of the wchar_t being that character's Unicode defined scalar value. Practically, this means that the wide character set is UCS-4 (Unless the date predates ~1996, in which case it may be UCS-2)Code: Select all
types_equal<decltype("This string has unspecified encoding, and is in the 'execution character set'"), char>::value == true; types_equal<decltype(u8"This is a UTF-8 string"), char>::value == true; types_equal<decltype(u"This is a UTF-16 string"), char16_t>::value == true; types_equal<decltype(U"This is a UCS-4 string"), char32_t>::value == true; types_equal<decltype(L"This is a wide string of unspecified encoding"), wchar_t>::value == true;
u and U are not defined to produce UTF-16/UCS-4 strings under C1X. They are under C++0X. Practically, I doubt we will ever see implementations contrary.
- Owen
- Member
- Posts: 1700
- Joined: Fri Jun 13, 2008 3:21 pm
- Location: Cambridge, United Kingdom
- Contact:
Re: How to deal with UTF-8 in C++
...adding UTF-8 support in what way?
The only parts where C(++) gets involved in encoding are
The only parts where C(++) gets involved in encoding are
- Input & Output
- Encoding conversion (wcstombs, etc)
- Source parsing
-
- Member
- Posts: 595
- Joined: Mon Jul 05, 2010 4:15 pm
Re: How to deal with UTF-8 in C++
Basically what I want is to do string manipulation directly on UTF-8 without conversion to UTF-32. An example where this is implemented is in ustl (http://ustl.sourceforge.net/) where the class ustl::string is using UTF-8 directly. ICU also seems to provide support for native UTF-8 strings so I'm probably going port that. Check this link out http://userguide.icu-project.org/strings. C++0x missed that I think.Owen wrote:...adding UTF-8 support in what way?
The only parts where C(++) gets involved in encoding arestd::string is not concerned with any of these.
- Input & Output
- Encoding conversion (wcstombs, etc)
- Source parsing
Re: How to deal with UTF-8 in C++
I would like to give this a +1 bump. QFT. Note the italics, though.Combuster wrote:UTF8 was designed to fit in a char *, and if you only use it for passing data verbatim and don't intend to modify or interpret any of it, and don't need to care if two encodings of the exact same text fail comparison, then there's no reason to use anything beyond cstring/string.h.
In all other cases, ICU is most likely the appropriate solution.
However:
I disagree. The problem is that there are both 16-bit and 32-bit wchar_t definitions out there, and compiler settings for both native and externally-declared handling of the type. Playing it safe here is an ugly business.Combuster wrote:The same logic holds for wchar_t * and wstring.h
My recommendation is to stay away from wchar_t and wstring. You need wide characters? That means you need full Unicode support, i.e. you need ICU. Sad that it happened this way, but that's how it is for the veteran languages that predated Unicode switching to 32 bit.
Every good solution is obvious once you've found it.
-
- Member
- Posts: 223
- Joined: Thu Jul 05, 2007 8:58 am
Re: How to deal with UTF-8 in C++
One slight disgression, how can in UTF-8 two strings equal the same whilst being encoded differently, appart from unicode codepoints which just look very similar?
- Owen
- Member
- Posts: 1700
- Joined: Fri Jun 13, 2008 3:21 pm
- Location: Cambridge, United Kingdom
- Contact:
Re: How to deal with UTF-8 in C++
"Unicode Normalization Forms". For example, the character å can be composed as a single scalar value (As it was in legacy encodings), or as the scalar value for LATIN LOWER CASE A plus the combining mark for whatever the dot above is called (my memory fails me )davidv1992 wrote:One slight disgression, how can in UTF-8 two strings equal the same whilst being encoded differently, appart from unicode codepoints which just look very similar?
Generally things like file systems will define a normalization form for file names. If they don't... bad things happen. Windows prefers precomposed form (i.e. where a character can be encoded in one scalar value, that is chosen), while OS X prefers uncomposed form (i.e. characters will be encoded as character + combining mark)
Also note that ICU natively works in UTF-16. Trying to use anything else puts you in for a world of hurt/Combuster wrote:UTF8 was designed to fit in a char *, and if you only use it for passing data verbatim and don't intend to modify or interpret any of it, and don't need to care if two encodings of the exact same text fail comparison, then there's no reason to use anything beyond cstring/string.h. The same logic holds for wchar_t * and wstring.h
In all other cases, ICU is most likely the appropriate solution.
Re: How to deal with UTF-8 in C++
Well, it depends, does it? You get SPACE, NO-BREAK SPACE, EN SPACE, EM SPACE, PUNCTUATION SPACE... are you checking for "equal" in the syntactical or the typographic sense?davidv1992 wrote:One slight disgression, how can in UTF-8 two strings equal the same whilst being encoded differently, appart from unicode codepoints which just look very similar?
Latin Capital letter A with diaeresis or Capital letter A / Combining Diaresis? Looks the same to me...
Every good solution is obvious once you've found it.
-
- Member
- Posts: 595
- Joined: Mon Jul 05, 2010 4:15 pm
Re: How to deal with UTF-8 in C++
That's why I'm a bit reluctant towards ICU. Also it's very big, a little bit more than I ask for.Owen wrote:Also note that ICU natively works in UTF-16. Trying to use anything else puts you in for a world of hurt/
GTK+ inti library seems to work with UTF-8. GTK+ seems to be in the same situation I'm in, the API assumes UTF-8.
Re: How to deal with UTF-8 in C++
One, not quite true. ICU has some provisions for working with UTF-8 directly.OSwhatever wrote:That's why I'm a bit reluctant towards ICU.Owen wrote:Also note that ICU natively works in UTF-16. Trying to use anything else puts you in for a world of hurt/
Two, whatever ICU uses internally doesn't matter much, now does it? Just like UTF-8, UTF-16 is a compromise between memory efficiency and performance efficiency, just biased towards cases where ASCII-7 is not enough. And it's not as if converting to or from internal format is a difficult thing:
Code: Select all
UnicodeString::UnicodeString( const char * codepageData, const char * codepage )
UnicodeString::fromUTF32( const UChar32 * utf32, int32_t length )
UnicodeString::fromUTF8( const StringPiece & utf8 )
UnicodeString::fromUTF32( const UChar32 * utf32, int32_t length )
Just like with the standard C library: Sooner or later you will have to provide it at the application level, because users will expect it to be present. Why not implement / port it now instead of later, and enjoy its availability in your kernel?Also it's very big, a little bit more than I ask for.
It's a bit funny that you should point towards GTK+ as a solution for ICU being "very big".GTK+ inti library seems to work with UTF-8. GTK+ seems to be in the same situation I'm in, the API assumes UTF-8.
Every good solution is obvious once you've found it.
-
- Member
- Posts: 595
- Joined: Mon Jul 05, 2010 4:15 pm
Re: How to deal with UTF-8 in C++
easl seems to be nice.
http://code.google.com/p/easl/wiki/About
If you don't need any advanced converting this might be a good candidate.
http://code.google.com/p/easl/wiki/About
If you don't need any advanced converting this might be a good candidate.
- Combuster
- Member
- Posts: 9301
- Joined: Wed Oct 18, 2006 3:45 am
- Libera.chat IRC: [com]buster
- Location: On the balcony, where I can actually keep 1½m distance
- Contact:
Re: How to deal with UTF-8 in C++
And it's just as broken as any other ctype/string wrapper. Heck, it even neglects to use setlocale() and expect that all the character indexing and transformation work... And of course there's the mandatory lack of surrogate pair support.OSwhatever wrote:easl seems to be nice. If you don't need any advanced converting this might be a good candidate.