OSDev.org

Posted: **Sun May 02, 2010 10:22 pm**

hello!
i'm thinking about using UTF16 as the Internal Code of my OS. but after a long time, i found that gcc don't like it,and i hate gcc for this

we can use prefix 'L' in VS to declare that a string is UTF16 but not in gcc..e.g, L"abc".
and i find there is a 'u' act as 'L' in gcc,e.g. u"abc".
but i still can't get it work now even i add -std=gnu99.what's more, i don't like this argument, so i hope to write a macro to replace 'u'.
i hope to know how does the 'u' or 'L' works, and if it is done on compiling(then 'u' is considerable) or running(forget it,i dont like such a function running in my kernel).and i'm not good at preprocess,so i hope someone can help.
i use gcc 4.4.1.
problem: how does 'u' work? how to write a macro converting ascii to UTF16?
thx

Posted: **Mon May 03, 2010 2:01 am**

Tried it, no problem.

DJGPP and MinGW both accept long characters L'X' (even in non-C99 mode). A long character is 2 bytes, so it may be UCS-2 or UTF-16, haven't tested that.

It's probably dependent on the target environment your compiler is built for.

Posted: **Mon May 03, 2010 2:27 am**

during these hours i read and tried some more.
yes,L"XXX" is really supported.and in windows environment it is 16bits, but 32bits in linux.
now i'm trying to make it 16bits in linux.
it seems that gcc has supported utf16 since 4.4 but i don't know how to use it without change too much.
and it seems that the u"xxx" is designed for this but not work on my machine

trying.
thx

Posted: **Mon May 03, 2010 2:43 am**

I've tried out a bunch of crosscompilers, u"..." is unsupported in 3.4.4 and 4.2.2 as well, and seems to default to 32-bits in all cases. I even tried the freebasic compiler, and it can only output one wide character standard at any run (but it can be changed by passing -target windows or -target linux, which in turn has a bunch of other unwanted sideeffects

).

I suggest you stick to wchar and wctype - they are meant to cover for the variation in standards the compiler uses. If you really need UTF-16, you'd have to convert it. There's no reason gcc or any compiler should use that standard in the next release.

Posted: **Mon May 03, 2010 3:24 am**

yeah...
but there is no reason to store so many 0s in a string or use utf8 to slow down my system,right?
how i hope i can get u"xxx" work!

but it can be changed by passing -target windows or -target linux, which in turn has a bunch of other unwanted sideeffects

yeah,that's what i care most.
i've been searching the L"xxx" macro prototype for long. i think i can write one myself or copy it somewhere.
but the problem is, search L as the keyword will get so many results!
UTF16,UTF16!
thx!

Posted: **Mon May 03, 2010 3:32 am**

I'm afraid you have to configure your own GCC build. That's not something I can help you with, unfortunately.

EDIT: Strike this, I see you already found your answer.

Posted: **Mon May 03, 2010 3:37 am**

http://blogs.oracle.com/ezannoni/2008/0 ... n_gcc.html
this url is why i have been searching all day.
the reason why i use utf16 but not utf8 is that English is not my native language(and that's the reason why i search so slow).utf16 will greatly improve the speed in indexing the font set in my future GUI and limit the memory occupation.Obviously,if not for historical reason,UNIX and LINUX will also choose utf16 like winNT,right?
so, i really hope i can use utf16.
the url above will stick me keep searching.
help
thx
merci
谢谢!

Posted: **Mon May 03, 2010 3:40 am**

Hobbes wrote:I'm afraid you have to configure your own GCC build.

i think...unneccessarily!
the url above proved that gcc is not so weak.
and rebuild is really a tough thing which i have never done.
i will keep searching.
thanks!

Posted: **Mon May 03, 2010 3:47 am**

Combuster wrote:I suggest you stick to wchar and wctype - they are meant to cover for the variation in standards the compiler uses. If you really need UTF-16, you'd have to convert it. There's no reason gcc or any compiler should use that standard in the next release.

Please. C's Unicode/l10n/i18n support is about the worst implementation of it I have ever seen. It is fundamentally broken.

And all the decent internationalization libraries use UTF-16

Posted: **Mon May 03, 2010 4:06 am**

so,do you have some good advise?
say, how to use UTF16 in gcc?
thx

Posted: **Mon May 03, 2010 4:08 am**

Owen wrote:
Combuster wrote:I suggest you stick to wchar and wctype - they are meant to cover for the variation in standards the compiler uses. If you really need UTF-16, you'd have to convert it. There's no reason gcc or any compiler should use that standard in the next release.
Please. C's Unicode/l10n/i18n support is about the worst implementation of it I have ever seen. It is fundamentally broken.

And all the decent internationalization libraries use UTF-16

Unicode characters are 20 bits long. For future compatibility I'd use UTF32 as my internal representation. Fixed width AND can store the entire unicode character set.

That's one thumbs-up from me!

lemonyii: GCC is designed for UNIX environments that do not use UTF-16. The entire userspace was built around ASCII, and UTF-8 is fully backwards compatible with ASCII. They're never going to change any time soon - this is why I'm not surprised there is no UTF16 support.

To be more helpful, I would create a small script in the language of your choice that recognises a u" ... " sequence and transforms it into the byte-ified equivalent in UTF-16. For example:

u"ab" -> "\00\65\00\66"

(my calculations are guesswork as to the UTF-16 representation, but you get the idea)

Posted: **Mon May 03, 2010 5:09 am**

maybe ...

UTF32 is a good choice!
think about it... it will waste about 1M in my kernel when it grow to about 4M (several years later), 20M in my system at about 100M.
that's really nothing for a 64bit system with a memory above 4G.

but it will really take some time to accept this.
i am hesitating...
thank you all!

POST END

Posted: **Mon May 03, 2010 3:53 pm**

JamesM wrote: Unicode characters are 20 bits long. For future compatibility I'd use UTF32 as my internal representation. Fixed width AND can store the entire unicode character set.

Unicode characters are of variable (potentially infinite) length, in the form of a base character plus combining marks. In UCS-4, these are still variable length. Additionally, Unicode warrant that no character will ever be introduced which cannot be represented in UTF-16.

Unicode Scalar Values however are actually 21bits, from one base multilingual plane plus 16 "astral planes", each plane comprising 65536 scalar values, some of which are reserved for special purposes.

As I see it, you are going to be doing one of the following:

Binary string equality checks. Ignores character sets.
String sorts and linguistic/uniform collations. Needs to work with full characters. Whatever UTF you use, you are going to have to interpret it
Render text to a graphic. Whether using UCS-4 or UTF-16, you're going to have a big state machine here.

For the first, less memory bandwidth is used on average (Astral plane characters are very rare), and compensates for the extra instructions required in order to produce in general a speedup. For the second, everything is about equal because you're going to have to reference a mass of tables.

Posted: **Fri May 07, 2010 5:53 am**

Maybe you're looking for the '-fshort-wchar' compiler switch?

Posted: **Fri May 07, 2010 11:53 am**

thank you so much, but why don't you tell me earlier?
i have been using UTF32 for some time,but i turned them into UTF16 just now.
luckly not too much work. it is so easy, and will spare much memory for me

thx

OSDev.org

GCC UTF16

GCC UTF16

Re: GCC UTF16

Re: GCC UTF16

Re: GCC UTF16

Re: GCC UTF16

Re: GCC UTF16

Re: GCC UTF16

Re: GCC UTF16

Re: GCC UTF16

Re: GCC UTF16

Re: GCC UTF16

Re: GCC UTF16

Re: GCC UTF16

Re: GCC UTF16

Re: GCC UTF16