Unicode
UTFs use byte order mark, which is represented differently in UTF-8 and UTF-16 with any endian, to distinguish themselves. So it won't be difficult to determine character encoding if it is known that one of them is used.
Unicode stores most characters used in modern languages in its first 16-bit plane (BMP). Characters beyond 0xffff are used only in ancient languages, for presentation of mathematics, or for less popular CJK ideograms.
The decision to include many precomposed characters was made to support one-to-one conversion from/to other character encoding systems. If it was designed without this possibility, it would be probably possible to store English or Chinese using its native writting in 7-bit encoding.
Unicode stores most characters used in modern languages in its first 16-bit plane (BMP). Characters beyond 0xffff are used only in ancient languages, for presentation of mathematics, or for less popular CJK ideograms.
The decision to include many precomposed characters was made to support one-to-one conversion from/to other character encoding systems. If it was designed without this possibility, it would be probably possible to store English or Chinese using its native writting in 7-bit encoding.
Heh, fun.
wchar_t is whatever width your C library supports. If it doesn't support full 32bit Unicode, trash it and get one that does.
<wchar.h> is part of the C language since 1995. It provides support for wide characters (e.g. UTF-32), and multibyte characters (e.g. UTF-8, UTF-16), so there's little you have to worry about. Of course, if you are still ignorant of anything other but ASCII and think of "'a' <= x <= 'z'" as a valid way to test for lowercase-ness, the shame is yours, no-one's else.
And the thing with standards is that being accepted standard is a virtue in itself. The world has more or less decided on Unicode, which is a good thing simply because the decision has been made. Databases around the world are transiting to Unicode, if they haven't already long ago. Websites are transiting to Unicode, only slower because of the thick-headedness of webmasters worldwide. But there is hope that computers one day will work with only one system, i.e. Unicode. It might not be perfect (and I resent that they removed Klingon from it), but it's the best we have.
You could come up with a "better" system, but you would still have to convince many, many people the world over to actually implement it, and until you do, your better system for compatibility would have done nothing for compatibility.
Again, the acceptance is part of the value of a standard. As such, Unicode is A Good Thing (tm). And I haven't seen valid criticism on Unicode in this thread so far.
wchar_t is whatever width your C library supports. If it doesn't support full 32bit Unicode, trash it and get one that does.
<wchar.h> is part of the C language since 1995. It provides support for wide characters (e.g. UTF-32), and multibyte characters (e.g. UTF-8, UTF-16), so there's little you have to worry about. Of course, if you are still ignorant of anything other but ASCII and think of "'a' <= x <= 'z'" as a valid way to test for lowercase-ness, the shame is yours, no-one's else.
And the thing with standards is that being accepted standard is a virtue in itself. The world has more or less decided on Unicode, which is a good thing simply because the decision has been made. Databases around the world are transiting to Unicode, if they haven't already long ago. Websites are transiting to Unicode, only slower because of the thick-headedness of webmasters worldwide. But there is hope that computers one day will work with only one system, i.e. Unicode. It might not be perfect (and I resent that they removed Klingon from it), but it's the best we have.
You could come up with a "better" system, but you would still have to convince many, many people the world over to actually implement it, and until you do, your better system for compatibility would have done nothing for compatibility.
Again, the acceptance is part of the value of a standard. As such, Unicode is A Good Thing (tm). And I haven't seen valid criticism on Unicode in this thread so far.
Every good solution is obvious once you've found it.
- Colonel Kernel
- Member
- Posts: 1437
- Joined: Tue Oct 17, 2006 6:06 pm
- Location: Vancouver, BC, Canada
- Contact:
<sigh> I only wish this would happen faster... A large part of the project I've been working on for the past year to is to provide a Unicode-enabled API to non-Unicode data sources. It's yucky.Solar wrote:Databases around the world are transiting to Unicode, if they haven't already long ago.
Top three reasons why my OS project died:
- Too much overtime at work
- Got married
- My brain got stuck in an infinite loop while trying to design the memory manager
UTF-8 is a byte stream, and hence not influenced by endianness. That's so great about UTF-8.MTJM wrote:UTFs use byte order mark, which is represented differently in UTF-8 and UTF-16 with any endian, to distinguish themselves. So it won't be difficult to determine character encoding if it is known that one of them is used.
They are not called ideograms, but logographs or just characters.Unicode stores most characters used in modern languages in its first 16-bit plane (BMP). Characters beyond 0xffff are used only in ancient languages, for presentation of mathematics, or for less popular CJK ideograms.
That's nonsense. Chinese has an enormous amount of different characters, even taking into account that many are composed of two elements. For Korean it's true, but the reason for precomposed characters is that at the time, composition software just wasn't good enough to actually compose the characters in a nice way.The decision to include many precomposed characters was made to support one-to-one conversion from/to other character encoding systems. If it was designed without this possibility, it would be probably possible to store English or Chinese using its native writting in 7-bit encoding.
JAL
Sorry to be the devil's advocate here, but I don't see much of a point for Unicode either. Sure, it's a great design that can represent a great deal of characters, but what good does that do? I only know one language, and so does a large portion of the world (and before anyone gets on to me about this, I'm from the US, so I'm not really sure how many people are bilingual, but you get the point), so why does any one computer need the capability to display any character in existence?
It would make a lot more sense to, say, define a "standard computing language" that would be supported on every operating system, and have a nice fixed-width encoding (hint: English and ASCII). Not only would this solve the character representation problem, but it would also stop a lot of the communication issues between people who don't speak the same language. If there was only one language usable on computers, those folks who didn't know it well enough to use one wouldn't be typing in the first place.
I realize this is counter-globalization or whatever you want to call it (oh no, we're excluding <country>!), but it's the most logical solution.
It would make a lot more sense to, say, define a "standard computing language" that would be supported on every operating system, and have a nice fixed-width encoding (hint: English and ASCII). Not only would this solve the character representation problem, but it would also stop a lot of the communication issues between people who don't speak the same language. If there was only one language usable on computers, those folks who didn't know it well enough to use one wouldn't be typing in the first place.
I realize this is counter-globalization or whatever you want to call it (oh no, we're excluding <country>!), but it's the most logical solution.
Are you going to deny computer access to anyone that doesn't speak English? Because you force computer users to use that single language when using that OS. Windows and Linux would not have such big user bases if they were only English.It would make a lot more sense to, say, define a "standard computing language" that would be supported on every operating system, and have a nice fixed-width encoding (hint: English and ASCII).
No, it wouldn't. It would side-track the issue by just preventing these people from using their own language on the internet, stopping communication completely.but it would also stop a lot of the communication issues between people who don't speak the same language.
Also, the people of other languages would just start making their own standards just for their own languages, which can only be read on their own computers / OSes.
Then an OS which wants to support those languages has to implement these seperate standards, and a webpage which uses both languages would need two seperate, possibly incompatible encodings.
Result: Chaos.
This is why we need Unicode.
</rant>
"Sufficiently advanced stupidity is indistinguishable from malice."
@karekare0:
I see your point that people wouldn't learn English just to use a computer (that was just wishful thinking, on my part), but the whole "different standards for different countries" thing doesn't seem all that bad. Some examples:
I've never read a document in another language.
I've never read (for content) a foreign webpage.
I've never talked with anyone who didn't speak English over the internet.
Why would I (and I don't really consider myself that odd, statistically) need to be able to view non-English characters, when English is the only language I know? Any sort of non-English document would need to be translated anyways, so why do I need to see the symbols I don't understand in the first place?
I see your point that people wouldn't learn English just to use a computer (that was just wishful thinking, on my part), but the whole "different standards for different countries" thing doesn't seem all that bad. Some examples:
I've never read a document in another language.
I've never read (for content) a foreign webpage.
I've never talked with anyone who didn't speak English over the internet.
Why would I (and I don't really consider myself that odd, statistically) need to be able to view non-English characters, when English is the only language I know? Any sort of non-English document would need to be translated anyways, so why do I need to see the symbols I don't understand in the first place?
- Colonel Kernel
- Member
- Posts: 1437
- Joined: Tue Oct 17, 2006 6:06 pm
- Location: Vancouver, BC, Canada
- Contact:
There are really big corporations that sell things in many different countries. Chances are they have a cluster of DB servers sitting somewhere, and those databases have millions of records describing product and customer names, as well as cities, regions, the names of sales people, etc. You could easily have, for example, all the sales data for the East Asia region in the same database, which would involve several different Asian languages with disjoint character sets.lollynoob wrote:so why does any one computer need the capability to display any character in existence?
No, it's not logical, because it ignores this amazing thing called Capitalism. You know, where people want to make money by selling things to as many people as possible, which usually involves doing business in countries with different languages.I realize this is counter-globalization or whatever you want to call it (oh no, we're excluding <country>!), but it's the most logical solution.
Language is a vital part of culture, and is something billions of people are unwilling to give up just because you happen to live in a bubble and lack imagination.
Top three reasons why my OS project died:
- Too much overtime at work
- Got married
- My brain got stuck in an infinite loop while trying to design the memory manager
--Plan 9 from Bell Labs wrote:Unicode support
Plan 9 uses Unicode throughout the system. UTF-8 was invented by Ken Thompson to be used as the native encoding in Plan 9 and the whole system was converted to general use in 1992.[4]
PS: i facing 3 languages on internet every day, my mother tongue, english, chinese.
There are a lot of languages that use characters outside of ASCII, why shouldn't it be possible to represent text in these languages in a standardized way?lollynoob wrote:Sorry to be the devil's advocate here, but I don't see much of a point for Unicode either. Sure, it's a great design that can represent a great deal of characters, but what good does that do? I only know one language, and so does a large portion of the world (and before anyone gets on to me about this, I'm from the US, so I'm not really sure how many people are bilingual, but you get the point), so why does any one computer need the capability to display any character in existence?
There are perfectly reasonable usecases where people would like to be able to represent well above 256 different characters.
Different standards for different countries is a bad idea since it prevents interoperability across country borders.
And for symbols you don't understand? What's so problematic with the cent sign, pound sign, section sign, copyright sign, registered trademark sign?
So 309-400Million (reference, wikipedia) people are a large portion of the something around 6.5Billion people It might be time to come down from that trip and actually think about the things you say. You might notice here that two other languages actually have more native speakers than yours.lollynoob wrote:I only know one language, and so does a large portion of the world (and before anyone gets on to me about this, I'm from the US, so I'm not really sure how many people are bilingual, but you get the point), [...]
Perhaps it helps you to just try to change perspective: What would you think and how would you feel if it was decided that Chinese is the "internet/computer" language and everything else is banned?
Just another example so that you actually get how this works: How would you feel if Iraq invaded USA? Wouldn't you grab your machine gun and kill 'em, even if they say they free your country?
Sry, but that ignorance in the post just made me furious.
- Brynet-Inc
- Member
- Posts: 2426
- Joined: Tue Oct 17, 2006 9:29 pm
- Libera.chat IRC: brynet
- Location: Canada
- Contact:
Sorry to be blunt, but it shows.I've never read a document in another language.
I've never read (for content) a foreign webpage.
I've never talked with anyone who didn't speak English over the internet.
Conway's Law: If you have four groups working on a compiler, you'll get a 4-pass compiler.
Melvin Conway
Melvin Conway
Hi,
In this case use Unicode, because once you support Unicode you save almost nothing by only supporting a few languages (and gain configuration hassles, etc). The real killer is the font data, not Unicode itself.
It'd probably be a very good idea to seperate font data into seperate files, where a user can choose to only install the fonts for the language/s they use - it could save them a lot of disk space. Fortunately, Unicode makes this easy - it's designed in groups of codepoints, where each group of codepoints corresponds to a different language (or subset of similar languages), which makes it easy to have a different font data file for each group of codepoints.
All OSs I know of already do this though...
Cheers,
Brendan
I see your point - an "English only" user would only really need English, a "Chinese only" user would only really need Chinese, an "Arabic only" user would only really need Arabic, etc.lollynoob wrote:I've never read a document in another language.
I've never read (for content) a foreign webpage.
I've never talked with anyone who didn't speak English over the internet.
Why would I (and I don't really consider myself that odd, statistically) need to be able to view non-English characters, when English is the only language I know? Any sort of non-English document would need to be translated anyways, so why do I need to see the symbols I don't understand in the first place?
In this case use Unicode, because once you support Unicode you save almost nothing by only supporting a few languages (and gain configuration hassles, etc). The real killer is the font data, not Unicode itself.
It'd probably be a very good idea to seperate font data into seperate files, where a user can choose to only install the fonts for the language/s they use - it could save them a lot of disk space. Fortunately, Unicode makes this easy - it's designed in groups of codepoints, where each group of codepoints corresponds to a different language (or subset of similar languages), which makes it easy to have a different font data file for each group of codepoints.
All OSs I know of already do this though...
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Sorry guys, I never realized it was my obligation to keep on the nice side of every foreigner I'll never meet. I'll personally be fine with supporting only ASCII in my (hobby) kernel, since I'm neither a large corporation or a non-English-speaker. Sorry if I don't please everyone with my decisions, but only working with the language I know seems like the best route to avoiding a needless headache. ASCII works, has worked, and with UTF-8 being backwards-compatible with it, will continue to work for a long while. Sorry if some Chinese guy doesn't like my choices, but I wouldn't be able to read his angry e-mails anyways.
I don't have those keys on my keyboard, and I usually prefer typing "pounds", "section", and "copyright" anyways.What's so problematic with the cent sign, pound sign, section sign, copyright sign, registered trademark sign?