Unicode

Discussions on more advanced topics such as monolithic vs micro-kernels, transactional memory models, and paging vs segmentation should go here. Use this forum to expand and improve the wiki!
MTJM
Posts: 2
Joined: Wed Feb 20, 2008 4:27 am
Location: Katowice, Poland

Post by MTJM »

UTFs use byte order mark, which is represented differently in UTF-8 and UTF-16 with any endian, to distinguish themselves. So it won't be difficult to determine character encoding if it is known that one of them is used.

Unicode stores most characters used in modern languages in its first 16-bit plane (BMP). Characters beyond 0xffff are used only in ancient languages, for presentation of mathematics, or for less popular CJK ideograms.

The decision to include many precomposed characters was made to support one-to-one conversion from/to other character encoding systems. If it was designed without this possibility, it would be probably possible to store English or Chinese using its native writting in 7-bit encoding.
User avatar
Solar
Member
Member
Posts: 7615
Joined: Thu Nov 16, 2006 12:01 pm
Location: Germany
Contact:

Post by Solar »

Heh, fun. :roll:

wchar_t is whatever width your C library supports. If it doesn't support full 32bit Unicode, trash it and get one that does.

<wchar.h> is part of the C language since 1995. It provides support for wide characters (e.g. UTF-32), and multibyte characters (e.g. UTF-8, UTF-16), so there's little you have to worry about. Of course, if you are still ignorant of anything other but ASCII and think of "'a' <= x <= 'z'" as a valid way to test for lowercase-ness, the shame is yours, no-one's else.

And the thing with standards is that being accepted standard is a virtue in itself. The world has more or less decided on Unicode, which is a good thing simply because the decision has been made. Databases around the world are transiting to Unicode, if they haven't already long ago. Websites are transiting to Unicode, only slower because of the thick-headedness of webmasters worldwide. But there is hope that computers one day will work with only one system, i.e. Unicode. It might not be perfect (and I resent that they removed Klingon from it), but it's the best we have.

You could come up with a "better" system, but you would still have to convince many, many people the world over to actually implement it, and until you do, your better system for compatibility would have done nothing for compatibility.

Again, the acceptance is part of the value of a standard. As such, Unicode is A Good Thing (tm). And I haven't seen valid criticism on Unicode in this thread so far.
Every good solution is obvious once you've found it.
User avatar
Colonel Kernel
Member
Member
Posts: 1437
Joined: Tue Oct 17, 2006 6:06 pm
Location: Vancouver, BC, Canada
Contact:

Post by Colonel Kernel »

Solar wrote:Databases around the world are transiting to Unicode, if they haven't already long ago.
<sigh> I only wish this would happen faster... :P A large part of the project I've been working on for the past year to is to provide a Unicode-enabled API to non-Unicode data sources. It's yucky.
Top three reasons why my OS project died:
  1. Too much overtime at work
  2. Got married
  3. My brain got stuck in an infinite loop while trying to design the memory manager
Don't let this happen to you!
jal
Member
Member
Posts: 1385
Joined: Wed Oct 31, 2007 9:09 am

Post by jal »

MTJM wrote:UTFs use byte order mark, which is represented differently in UTF-8 and UTF-16 with any endian, to distinguish themselves. So it won't be difficult to determine character encoding if it is known that one of them is used.
UTF-8 is a byte stream, and hence not influenced by endianness. That's so great about UTF-8.
Unicode stores most characters used in modern languages in its first 16-bit plane (BMP). Characters beyond 0xffff are used only in ancient languages, for presentation of mathematics, or for less popular CJK ideograms.
They are not called ideograms, but logographs or just characters.
The decision to include many precomposed characters was made to support one-to-one conversion from/to other character encoding systems. If it was designed without this possibility, it would be probably possible to store English or Chinese using its native writting in 7-bit encoding.
That's nonsense. Chinese has an enormous amount of different characters, even taking into account that many are composed of two elements. For Korean it's true, but the reason for precomposed characters is that at the time, composition software just wasn't good enough to actually compose the characters in a nice way.


JAL
User avatar
lollynoob
Member
Member
Posts: 150
Joined: Sun Oct 14, 2007 11:49 am

Post by lollynoob »

Sorry to be the devil's advocate here, but I don't see much of a point for Unicode either. Sure, it's a great design that can represent a great deal of characters, but what good does that do? I only know one language, and so does a large portion of the world (and before anyone gets on to me about this, I'm from the US, so I'm not really sure how many people are bilingual, but you get the point), so why does any one computer need the capability to display any character in existence?

It would make a lot more sense to, say, define a "standard computing language" that would be supported on every operating system, and have a nice fixed-width encoding (hint: English and ASCII). Not only would this solve the character representation problem, but it would also stop a lot of the communication issues between people who don't speak the same language. If there was only one language usable on computers, those folks who didn't know it well enough to use one wouldn't be typing in the first place.

I realize this is counter-globalization or whatever you want to call it (oh no, we're excluding <country>!), but it's the most logical solution.
User avatar
Zenith
Member
Member
Posts: 224
Joined: Tue Apr 10, 2007 4:42 pm

Post by Zenith »

It would make a lot more sense to, say, define a "standard computing language" that would be supported on every operating system, and have a nice fixed-width encoding (hint: English and ASCII).
Are you going to deny computer access to anyone that doesn't speak English? Because you force computer users to use that single language when using that OS. Windows and Linux would not have such big user bases if they were only English.
but it would also stop a lot of the communication issues between people who don't speak the same language.
No, it wouldn't. It would side-track the issue by just preventing these people from using their own language on the internet, stopping communication completely.

Also, the people of other languages would just start making their own standards just for their own languages, which can only be read on their own computers / OSes.

Then an OS which wants to support those languages has to implement these seperate standards, and a webpage which uses both languages would need two seperate, possibly incompatible encodings.

Result: Chaos.

This is why we need Unicode.

</rant> :twisted:
"Sufficiently advanced stupidity is indistinguishable from malice."
User avatar
lollynoob
Member
Member
Posts: 150
Joined: Sun Oct 14, 2007 11:49 am

Post by lollynoob »

@karekare0:

I see your point that people wouldn't learn English just to use a computer (that was just wishful thinking, on my part), but the whole "different standards for different countries" thing doesn't seem all that bad. Some examples:

I've never read a document in another language.
I've never read (for content) a foreign webpage.
I've never talked with anyone who didn't speak English over the internet.

Why would I (and I don't really consider myself that odd, statistically) need to be able to view non-English characters, when English is the only language I know? Any sort of non-English document would need to be translated anyways, so why do I need to see the symbols I don't understand in the first place?
User avatar
Colonel Kernel
Member
Member
Posts: 1437
Joined: Tue Oct 17, 2006 6:06 pm
Location: Vancouver, BC, Canada
Contact:

Post by Colonel Kernel »

lollynoob wrote:so why does any one computer need the capability to display any character in existence?
There are really big corporations that sell things in many different countries. Chances are they have a cluster of DB servers sitting somewhere, and those databases have millions of records describing product and customer names, as well as cities, regions, the names of sales people, etc. You could easily have, for example, all the sales data for the East Asia region in the same database, which would involve several different Asian languages with disjoint character sets.
I realize this is counter-globalization or whatever you want to call it (oh no, we're excluding <country>!), but it's the most logical solution.
No, it's not logical, because it ignores this amazing thing called Capitalism. You know, where people want to make money by selling things to as many people as possible, which usually involves doing business in countries with different languages.

Language is a vital part of culture, and is something billions of people are unwilling to give up just because you happen to live in a bubble and lack imagination.
Top three reasons why my OS project died:
  1. Too much overtime at work
  2. Got married
  3. My brain got stuck in an infinite loop while trying to design the memory manager
Don't let this happen to you!
User avatar
binutils
Member
Member
Posts: 214
Joined: Thu Apr 05, 2007 6:07 am

Post by binutils »

Plan 9 from Bell Labs wrote:Unicode support

Plan 9 uses Unicode throughout the system. UTF-8 was invented by Ken Thompson to be used as the native encoding in Plan 9 and the whole system was converted to general use in 1992.[4]
--
PS: i facing 3 languages on internet every day, my mother tongue, english, chinese.
skyking
Member
Member
Posts: 174
Joined: Sun Jan 06, 2008 8:41 am

Post by skyking »

lollynoob wrote:Sorry to be the devil's advocate here, but I don't see much of a point for Unicode either. Sure, it's a great design that can represent a great deal of characters, but what good does that do? I only know one language, and so does a large portion of the world (and before anyone gets on to me about this, I'm from the US, so I'm not really sure how many people are bilingual, but you get the point), so why does any one computer need the capability to display any character in existence?
There are a lot of languages that use characters outside of ASCII, why shouldn't it be possible to represent text in these languages in a standardized way?

There are perfectly reasonable usecases where people would like to be able to represent well above 256 different characters.

Different standards for different countries is a bad idea since it prevents interoperability across country borders.

And for symbols you don't understand? What's so problematic with the cent sign, pound sign, section sign, copyright sign, registered trademark sign?
User avatar
bluecode
Member
Member
Posts: 202
Joined: Wed Nov 17, 2004 12:00 am
Location: Germany
Contact:

Post by bluecode »

lollynoob wrote:I only know one language, and so does a large portion of the world (and before anyone gets on to me about this, I'm from the US, so I'm not really sure how many people are bilingual, but you get the point), [...]
So 309-400Million (reference, wikipedia) people are a large portion of the something around 6.5Billion people :?: It might be time to come down from that trip and actually think about the things you say. You might notice here that two other languages actually have more native speakers than yours.

Perhaps it helps you to just try to change perspective: What would you think and how would you feel if it was decided that Chinese is the "internet/computer" language and everything else is banned?
Just another example so that you actually get how this works: How would you feel if Iraq invaded USA? Wouldn't you grab your machine gun and kill 'em, even if they say they free your country?

Sry, but that ignorance in the post just made me furious.
User avatar
Brynet-Inc
Member
Member
Posts: 2426
Joined: Tue Oct 17, 2006 9:29 pm
Libera.chat IRC: brynet
Location: Canada
Contact:

Post by Brynet-Inc »

That's very narrow minded thinking lollynoob.. :roll:
Image
Twitter: @canadianbryan. Award by smcerm, I stole it. Original was larger.
User avatar
Wave
Member
Member
Posts: 50
Joined: Sun Jan 20, 2008 5:51 am

Post by Wave »

I've never read a document in another language.
I've never read (for content) a foreign webpage.
I've never talked with anyone who didn't speak English over the internet.
Sorry to be blunt, but it shows. :wink:
Conway's Law: If you have four groups working on a compiler, you'll get a 4-pass compiler.
Melvin Conway
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Post by Brendan »

Hi,
lollynoob wrote:I've never read a document in another language.
I've never read (for content) a foreign webpage.
I've never talked with anyone who didn't speak English over the internet.

Why would I (and I don't really consider myself that odd, statistically) need to be able to view non-English characters, when English is the only language I know? Any sort of non-English document would need to be translated anyways, so why do I need to see the symbols I don't understand in the first place?
I see your point - an "English only" user would only really need English, a "Chinese only" user would only really need Chinese, an "Arabic only" user would only really need Arabic, etc.

In this case use Unicode, because once you support Unicode you save almost nothing by only supporting a few languages (and gain configuration hassles, etc). The real killer is the font data, not Unicode itself.

It'd probably be a very good idea to seperate font data into seperate files, where a user can choose to only install the fonts for the language/s they use - it could save them a lot of disk space. Fortunately, Unicode makes this easy - it's designed in groups of codepoints, where each group of codepoints corresponds to a different language (or subset of similar languages), which makes it easy to have a different font data file for each group of codepoints.

All OSs I know of already do this though... ;)


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
User avatar
lollynoob
Member
Member
Posts: 150
Joined: Sun Oct 14, 2007 11:49 am

Post by lollynoob »

Sorry guys, I never realized it was my obligation to keep on the nice side of every foreigner I'll never meet. I'll personally be fine with supporting only ASCII in my (hobby) kernel, since I'm neither a large corporation or a non-English-speaker. Sorry if I don't please everyone with my decisions, but only working with the language I know seems like the best route to avoiding a needless headache. ASCII works, has worked, and with UTF-8 being backwards-compatible with it, will continue to work for a long while. Sorry if some Chinese guy doesn't like my choices, but I wouldn't be able to read his angry e-mails anyways.
What's so problematic with the cent sign, pound sign, section sign, copyright sign, registered trademark sign?
I don't have those keys on my keyboard, and I usually prefer typing "pounds", "section", and "copyright" anyways.
Post Reply