Unicode

MTJM · Post by **MTJM** » Wed Feb 20, 2008 5:14 am

UTFs use byte order mark, which is represented differently in UTF-8 and UTF-16 with any endian, to distinguish themselves. So it won't be difficult to determine character encoding if it is known that one of them is used.

Unicode stores most characters used in modern languages in its first 16-bit plane (BMP). Characters beyond 0xffff are used only in ancient languages, for presentation of mathematics, or for less popular CJK ideograms.

The decision to include many precomposed characters was made to support one-to-one conversion from/to other character encoding systems. If it was designed without this possibility, it would be probably possible to store English or Chinese using its native writting in 7-bit encoding.

Solar · Post by **Solar** » Wed Feb 20, 2008 7:25 am

Heh, fun.

wchar_t is whatever width your C library supports. If it doesn't support full 32bit Unicode, trash it and get one that does.

<wchar.h> is part of the C language since 1995. It provides support for wide characters (e.g. UTF-32), and multibyte characters (e.g. UTF-8, UTF-16), so there's little you have to worry about. Of course, if you are still ignorant of anything other but ASCII and think of "'a' <= x <= 'z'" as a valid way to test for lowercase-ness, the shame is yours, no-one's else.

And the thing with standards is that being accepted standard is a virtue in itself. The world has more or less decided on Unicode, which is a good thing simply because the decision has been made. Databases around the world are transiting to Unicode, if they haven't already long ago. Websites are transiting to Unicode, only slower because of the thick-headedness of webmasters worldwide. But there is hope that computers one day will work with only one system, i.e. Unicode. It might not be perfect (and I resent that they removed Klingon from it), but it's the best we have.

You could come up with a "better" system, but you would still have to convince many, many people the world over to actually implement it, and until you do, your better system for compatibility would have done nothing for compatibility.

Again, the acceptance is part of the value of a standard. As such, Unicode is A Good Thing (tm). And I haven't seen valid criticism on Unicode in this thread so far.

Colonel Kernel · Post by **Colonel Kernel** » Wed Feb 20, 2008 12:14 pm

Solar wrote:Databases around the world are transiting to Unicode, if they haven't already long ago.

<sigh> I only wish this would happen faster...

A large part of the project I've been working on for the past year to is to provide a Unicode-enabled API to non-Unicode data sources. It's yucky.

jal · Post by **jal** » Wed Feb 20, 2008 1:55 pm

MTJM wrote:UTFs use byte order mark, which is represented differently in UTF-8 and UTF-16 with any endian, to distinguish themselves. So it won't be difficult to determine character encoding if it is known that one of them is used.

UTF-8 is a byte stream, and hence not influenced by endianness. That's so great about UTF-8.

Unicode stores most characters used in modern languages in its first 16-bit plane (BMP). Characters beyond 0xffff are used only in ancient languages, for presentation of mathematics, or for less popular CJK ideograms.

They are not called ideograms, but logographs or just characters.

The decision to include many precomposed characters was made to support one-to-one conversion from/to other character encoding systems. If it was designed without this possibility, it would be probably possible to store English or Chinese using its native writting in 7-bit encoding.

That's nonsense. Chinese has an enormous amount of different characters, even taking into account that many are composed of two elements. For Korean it's true, but the reason for precomposed characters is that at the time, composition software just wasn't good enough to actually compose the characters in a nice way.

JAL

lollynoob · Post by **lollynoob** » Fri Feb 22, 2008 8:19 pm

Sorry to be the devil's advocate here, but I don't see much of a point for Unicode either. Sure, it's a great design that can represent a great deal of characters, but what good does that do? I only know one language, and so does a large portion of the world (and before anyone gets on to me about this, I'm from the US, so I'm not really sure how many people are bilingual, but you get the point), so why does any one computer need the capability to display any character in existence?

It would make a lot more sense to, say, define a "standard computing language" that would be supported on every operating system, and have a nice fixed-width encoding (hint: English and ASCII). Not only would this solve the character representation problem, but it would also stop a lot of the communication issues between people who don't speak the same language. If there was only one language usable on computers, those folks who didn't know it well enough to use one wouldn't be typing in the first place.

I realize this is counter-globalization or whatever you want to call it (oh no, we're excluding <country>!), but it's the most logical solution.

Zenith · Post by **Zenith** » Fri Feb 22, 2008 8:36 pm

It would make a lot more sense to, say, define a "standard computing language" that would be supported on every operating system, and have a nice fixed-width encoding (hint: English and ASCII).

Are you going to deny computer access to anyone that doesn't speak English? Because you force computer users to use that single language when using that OS. Windows and Linux would not have such big user bases if they were only English.

but it would also stop a lot of the communication issues between people who don't speak the same language.

No, it wouldn't. It would side-track the issue by just preventing these people from using their own language on the internet, stopping communication completely.

Also, the people of other languages would just start making their own standards just for their own languages, which can only be read on their own computers / OSes.

Then an OS which wants to support those languages has to implement these seperate standards, and a webpage which uses both languages would need two seperate, possibly incompatible encodings.

Result: Chaos.

This is why we need Unicode.

</rant>

lollynoob · Post by **lollynoob** » Fri Feb 22, 2008 10:24 pm

@karekare0:

I see your point that people wouldn't learn English just to use a computer (that was just wishful thinking, on my part), but the whole "different standards for different countries" thing doesn't seem all that bad. Some examples:

I've never read a document in another language.
I've never read (for content) a foreign webpage.
I've never talked with anyone who didn't speak English over the internet.

Why would I (and I don't really consider myself that odd, statistically) need to be able to view non-English characters, when English is the only language I know? Any sort of non-English document would need to be translated anyways, so why do I need to see the symbols I don't understand in the first place?

Colonel Kernel · Post by **Colonel Kernel** » Fri Feb 22, 2008 11:54 pm

lollynoob wrote:so why does any one computer need the capability to display any character in existence?

There are really big corporations that sell things in many different countries. Chances are they have a cluster of DB servers sitting somewhere, and those databases have millions of records describing product and customer names, as well as cities, regions, the names of sales people, etc. You could easily have, for example, all the sales data for the East Asia region in the same database, which would involve several different Asian languages with disjoint character sets.

I realize this is counter-globalization or whatever you want to call it (oh no, we're excluding <country>!), but it's the most logical solution.

No, it's not logical, because it ignores this amazing thing called Capitalism. You know, where people want to make money by selling things to as many people as possible, which usually involves doing business in countries with different languages.

Language is a vital part of culture, and is something billions of people are unwilling to give up just because you happen to live in a bubble and lack imagination.

binutils · Post by **binutils** » Sat Feb 23, 2008 6:50 am

Plan 9 from Bell Labs wrote:Unicode support

Plan 9 uses Unicode throughout the system. UTF-8 was invented by Ken Thompson to be used as the native encoding in Plan 9 and the whole system was converted to general use in 1992.[4]

--
PS: i facing 3 languages on internet every day, my mother tongue, english, chinese.

skyking · Post by **skyking** » Sat Feb 23, 2008 9:31 am

lollynoob wrote:Sorry to be the devil's advocate here, but I don't see much of a point for Unicode either. Sure, it's a great design that can represent a great deal of characters, but what good does that do? I only know one language, and so does a large portion of the world (and before anyone gets on to me about this, I'm from the US, so I'm not really sure how many people are bilingual, but you get the point), so why does any one computer need the capability to display any character in existence?

There are a lot of languages that use characters outside of ASCII, why shouldn't it be possible to represent text in these languages in a standardized way?

There are perfectly reasonable usecases where people would like to be able to represent well above 256 different characters.

Different standards for different countries is a bad idea since it prevents interoperability across country borders.

And for symbols you don't understand? What's so problematic with the cent sign, pound sign, section sign, copyright sign, registered trademark sign?

bluecode · Post by **bluecode** » Sat Feb 23, 2008 9:41 am

lollynoob wrote:I only know one language, and so does a large portion of the world (and before anyone gets on to me about this, I'm from the US, so I'm not really sure how many people are bilingual, but you get the point), [...]

So 309-400Million (reference, wikipedia) people are a large portion of the something around 6.5Billion people

It might be time to come down from that trip and actually think about the things you say. You might notice here that two other languages actually have more native speakers than yours.

Perhaps it helps you to just try to change perspective: What would you think and how would you feel if it was decided that Chinese is the "internet/computer" language and everything else is banned?
Just another example so that you actually get how this works: How would you feel if Iraq invaded USA? Wouldn't you grab your machine gun and kill 'em, even if they say they free your country?

Sry, but that ignorance in the post just made me furious.

Brynet-Inc · Post by **Brynet-Inc** » Sat Feb 23, 2008 10:27 am

That's very narrow minded thinking lollynoob..

Wave · Post by **Wave** » Sat Feb 23, 2008 10:42 am

I've never read a document in another language.
I've never read (for content) a foreign webpage.
I've never talked with anyone who didn't speak English over the internet.

Sorry to be blunt, but it shows.

Brendan · Post by **Brendan** » Sun Feb 24, 2008 12:09 am

Hi,

lollynoob wrote:I've never read a document in another language.
I've never read (for content) a foreign webpage.
I've never talked with anyone who didn't speak English over the internet.

Why would I (and I don't really consider myself that odd, statistically) need to be able to view non-English characters, when English is the only language I know? Any sort of non-English document would need to be translated anyways, so why do I need to see the symbols I don't understand in the first place?

I see your point - an "English only" user would only really need English, a "Chinese only" user would only really need Chinese, an "Arabic only" user would only really need Arabic, etc.

In this case use Unicode, because once you support Unicode you save almost nothing by only supporting a few languages (and gain configuration hassles, etc). The real killer is the font data, not Unicode itself.

It'd probably be a very good idea to seperate font data into seperate files, where a user can choose to only install the fonts for the language/s they use - it could save them a lot of disk space. Fortunately, Unicode makes this easy - it's designed in groups of codepoints, where each group of codepoints corresponds to a different language (or subset of similar languages), which makes it easy to have a different font data file for each group of codepoints.

All OSs I know of already do this though...

Cheers,

Brendan

lollynoob · Post by **lollynoob** » Sun Feb 24, 2008 12:20 am

Sorry guys, I never realized it was my obligation to keep on the nice side of every foreigner I'll never meet. I'll personally be fine with supporting only ASCII in my (hobby) kernel, since I'm neither a large corporation or a non-English-speaker. Sorry if I don't please everyone with my decisions, but only working with the language I know seems like the best route to avoiding a needless headache. ASCII works, has worked, and with UTF-8 being backwards-compatible with it, will continue to work for a long while. Sorry if some Chinese guy doesn't like my choices, but I wouldn't be able to read his angry e-mails anyways.

What's so problematic with the cent sign, pound sign, section sign, copyright sign, registered trademark sign?

I don't have those keys on my keyboard, and I usually prefer typing "pounds", "section", and "copyright" anyways.