Page 1 of 4

Unicode

Posted: Fri Jan 18, 2008 4:15 am
by mathematician
With at least three different ways of encoding unicode characters (UTF's 8, 16 and 32) it seems to me that the standard is that there is no standard. Which seems to defeat the object of the exercise

Posted: Fri Jan 18, 2008 5:01 am
by Combuster
Troll?

Posted: Fri Jan 18, 2008 5:15 am
by Korona
Unicode is a standard that assigns character codes to characters. UTF-8, UTF-16LE/BE and UTF-32LE/BE are just encodings for Unicode. UTF-8 is great when want to store characters on a hard disk or when you want to send them over a network, but if you want to apply some operations to them, variable length characters soon become inconvenient. You could use UTF-8 for storing characters on a disk and store them in UTF-32LE encoding while manipulating them in main memory.

Posted: Fri Jan 18, 2008 1:58 pm
by mathematician
Korona wrote:Unicode is a standard that assigns character codes to characters. UTF-8, UTF-16LE/BE and UTF-32LE/BE are just encodings for Unicode. UTF-8 is great when want to store characters on a hard disk or when you want to send them over a network, but if you want to apply some operations to them, variable length characters soon become inconvenient. You could use UTF-8 for storing characters on a disk and store them in UTF-32LE encoding while manipulating them in main memory.
As I understand the point of unicode, it is so that computers around thw world can talk to one another in a way they couldn't with DOS code pages. However, if you have a Windows program which turns out text using the UTF-16 encoding, and a computer somewhere else is expecting to read text encoded as UTF-8, the point of the exercise seems to be defeated.

Posted: Fri Jan 18, 2008 4:09 pm
by bewing
I basically agree with you, fellow mathematician. :wink:

ASCII-7 was originally developed with the intent of just being a US standard (as is clearly implied by the name) -- back when every bit was precious. When the PC-XT/MSDOS was created, the industry had one, and only one, opportunity to completely re-engineer the character encoding to meet international and programming needs. IBM completely failed to do this.

For programming purposes, anything except fixed-width characters (in terms of storage, of course) ends up being so ugly as to be ridiculous. Of course, for those of us who choke at the idea of wasting half your computer's memory storing zeroes as every other byte of character data, there would be other problems.

But UTF8 is just a really ugly patch on top of ASCII -- and ASCII was probably a mistake in the first place. Adding more UTFs does not make Unicode any better. As you say, it just devolves the "standard" further.

The question is: can it be possible at this stage to drop Unicode/ASCII, and come up with something better? I would like ot believe that it is possible, yes. I do not want to implement something as poorly designed as UTF8 in my kernel.

As I understand it, there are fewer than 70K languages on the planet, and the number is dropping by 3K a year. Hopefully, within a century or two, this means that the pictographic languages will either die or be simplified -- to the point where we can encode everything/anything in 16 bits -- and with 256GB of RAM on the system, even I will be content to waste half of it. :D

Posted: Fri Jan 18, 2008 4:39 pm
by crazygray1
Different languages are a good thing, don't let them die. Though I do see your point.

Posted: Fri Jan 18, 2008 8:59 pm
by Brendan
Hi,
bewing wrote:The question is: can it be possible at this stage to drop Unicode/ASCII, and come up with something better? I would like ot believe that it is possible, yes. I do not want to implement something as poorly designed as UTF8 in my kernel.
You'd end up with something similar to Unicode anyway. You'd need to assign a value to each possible character, and you'd need some control codes (space, tab, newline, backspace, "right to left" text direction, "left to right" text direction, etc). It'd make sense to group these into sets of characters (e.g. one range of values for English characters, one range of values for Russian characters, one range of values for Chinese characters, etc) so that it's easy to load font data for a range of characters into memory on demand.

Once you've got a standard that says which value represents which character, you'd need to encode it somehow. There's too many characters for 16-bit, so 32-bit integers seems like an obvious choice.

Unfortunately different systems have different endianess, and working with little endian integers on a big endian machine (or the opposite) is painful - it's better to convert input data into the computers native endianess. It'd also be good if you could have a generic 32-bit encoding where the first 32-bit value tells you if it's big endian or little endian. So now we've got a standard to map characters to integers, a generic 32-bit encoding, a 32-bit big endian encoding and a 32-bit little endian encoding.

Now there's a problem. A lot of things were/are designed for 7-bit ASCII and use an 8-bit "char", and don't work with 32-bit characters (most internet protocols, file systems, compilers, etc). To get around that it'd be nice to have an 8-bit encoding that is compatible with software designed for strings of 8-bit characters, so that Unicode data can pass through one of these pieces of software without becoming trashed. By encoding it into a string of 8-bit pieces it also makes it easier for computers to communicate as there is no endianess problem (perfect for something like HTTP, for example).

For 32-bit integers, the characters that are used most often only need 16-bits, so another generic 16-bit encoding would be a good compromise between efficiency and hassle. A 16-bit generic encoding would also have endianess problems, so you'd also want a little endian encoding and a big endian encoding.

So now we've got a standard to map characters to integers (Unicode), a generic 32-bit encoding (UTF-32), a 32-bit big endian encoding (UTF-32BE), a 32-bit little endian encoding (UTF-32LE), a generic 16-bit encoding (UTF-16), a 16-bit big endian encoding (UTF-16BE), a 16-bit little endian encoding (UTF-16LE) and an 8-bit encoding (UTF-8 ).

How is software meant to handle all these different encodings? It's relatively simple to convert from one encoding to another. For example, a normal application (e.g. a text editor) could convert data in any encoding into UTF-32BE, then process UTF-32BE only, then convert the output to any other encoding when it's finished.
bewing wrote:As I understand it, there are fewer than 70K languages on the planet, and the number is dropping by 3K a year. Hopefully, within a century or two, this means that the pictographic languages will either die or be simplified -- to the point where we can encode everything/anything in 16 bits -- and with 256GB of RAM on the system, even I will be content to waste half of it. :D
If you look at the numbers, it's more likely that in a few hundred years we'll all be using a pictographic language, especially when people realise that the English language is absolute crap (too many inconsistancies and double meanings) and is also inefficient (for e.g., something that takes 100 characters in English might only take 10 characters in Japanese).

So, do you want a standard way of representing characters that can be used for everything, or a standard way of representing current characters that's broken because historians can't use it? In several hundred year's time, it'd be nice to be able to store things like the American constitution on a computer even if nobody uses the language it's written in anymore, so that American school kids can look it up on the internet and laugh at the funny squiggly characters (before downloading the International Chinese version of it that they can actually read). :P


Cheers,

Brendan

Posted: Fri Jan 18, 2008 9:39 pm
by Alboin
Brendan wrote:If you look at the numbers, it's more likely that in a few hundred years we'll all be using a pictographic language, especially when people realise that the English language is absolute crap (too many inconsistancies and double meanings) and is also inefficient (for e.g., something that takes 100 characters in English might only take 10 characters in Japanese).

So, do you want a standard way of representing characters that can be used for everything, or a standard way of representing current characters that's broken because historians can't use it? In several hundred year's time, it'd be nice to be able to store things like the American constitution on a computer even if nobody uses the language it's written in anymore, so that American school kids can look it up on the internet and laugh at the funny squiggly characters (before downloading the International Chinese version of it that they can actually read). :P
The idea of having a language of several thousand characters with often similar looks freakishly frightens me.

Posted: Fri Jan 18, 2008 10:06 pm
by Brendan
Hi,
Alboin wrote:The idea of having a language of several thousand characters with often similar looks freakishly frightens me.
How do you feel about the idea of having a language of several thousand words with often similar looks?

How about a sentence like "Might might wind faster than wind, but when the bandage is wound you'll need to remove your dress so I can dress your wound."?

Note: might != might, wind != wind, wound != wound, and dress != dress.

For English, we use context and grammar to distinguish between identical words. A person who doesn't understand a pictographic language couldn't use context and grammar to distinguish between similar looking characters, and similar looking characters might seem much more awkward than they really are until you learn the language....


Cheers,

Brendan

Posted: Sat Jan 19, 2008 2:36 pm
by bewing
I had the language argument on slashdot, and my position is that overall numbers are not the issue. The issue is which language is easiest for a non-native speaker to learn. Ask a multi-lingual non-native speaker which was their easiest language to learn (as I have done), and I am certain that you will hear that English is far easier to learn than any other -- especially if you include reading/writing the language.
But in any case, if you do go by numbers, almost all computers are little endian at this point, so I'd start by trashing both the big-endian formats. And there are only 8 languages (with 7 alphabets/pictographic sets as I recall) that are commonly used. All the rest are marginal, and certain to decrease, and probably die, over time. If you look forward, and cut your supported character set down to these, you are starting to get awfully close to that 16 bit limit. It is not necessary to permanently encode the alphabets of dead languages in current character sets forever, just so ancient documents can be conveniently read. You can have some 16bit flags on the beginnings of files to select "Ancient Dead Language" font.

Posted: Sun Jan 20, 2008 2:32 am
by Brendan
Hi,
bewing wrote:You can have some 16bit flags on the beginnings of files to select "Ancient Dead Language" font.
That doesn't work for mixed language files, for example consider the sentence "The old dead guy said '<foo>'", where <foo> is ancient greek or something.

You could have a special "change fonts" token but this isn't so good either, because how a value is interpreted would depend on previous state, which makes processing characters more complex.

The nice thing about Unicode is that commonly used characters are encoded as 16-bit values. Less frequently used characters may take more than 16-bits (e.g. 2 or 3 words) but there is no "I'm currently using font <foo>" state to keep track of, *and* you can look at any 16-bit value and know if it's the first word of a character, the second word, the third word, etc.

I guess what I'm saying is that if anyone is intending to replace Unicode, they should find something that covers at least as many characters and languages as Unicode, while also having encodings that are at least as good as Unicode's encodings. Solutions that don't meet these minimum requirements just aren't solutions IMHO. Even if you can find something that actually is better than Unicode, I doubt it'd be worth the hassle trying to replace an established standard with something slightly different.


Cheers,

Brendan

Posted: Tue Feb 05, 2008 9:41 am
by jal
mathematician wrote:However, if you have a Windows program which turns out text using the UTF-16 encoding, and a computer somewhere else is expecting to read text encoded as UTF-8, the point of the exercise seems to be defeated.
No, it is not. You mix up the meaning of character values with the encoding of these values. Value 65 will always be capital A in unicode, and value 234 will always be lower case e with circumflex. In contrast, when using code pages, value 234 may be e with circumflex in one code page (e.g. ISO-8859-1, Latin 1) but e with ogonek in another (e.g. ISO-8859-2), and cyrillic small letter hard sign in yet another (e.g. ISO-8859-5). In which case you need to map stuff, and you are lost if you don't know which code page was used.

On the other hand, UTF-8, UTF-16 and UTF-32 (or UCS-4) are encodings, i.e. these standards define how a value of 65 or 234 is represented binary. It is fairly trivial to support all three of these, and converting from one to the other is also trivial. That's a total different situation with code pages, which needs extensive mapping plus knowledge about which code page is used (in contrast, a random string of characters can easily be analyzed to see if it's UTF-8, 16, 32 or something else).


JAL

Posted: Tue Feb 05, 2008 10:11 am
by jal
bewing wrote:When the PC-XT/MSDOS was created, the industry had one, and only one, opportunity to completely re-engineer the character encoding to meet international and programming needs. IBM completely failed to do this.
In those days, every bit was still precious. Anyone suggesting in those days to encode each character with 16 bits would have been ridiculed and then shot.
For programming purposes, anything except fixed-width characters (in terms of storage, of course) ends up being so ugly as to be ridiculous.
I find this supposition ridiculous. When displaying text, one has to sequentially follow strings anyway, so it doesn't matter whether variable length encoding is used. And how often do you want to index strings of text anyway? For text manipulations, it really doesn't matter whether you use fixed length characters or not (especially not when, like UTF-8, one can search in both directions reliably).
Of course, for those of us who choke at the idea of wasting half your computer's memory storing zeroes as every other byte of character data, there would be other problems.
UTF-16 is also a variable length encoding, although many seem to forget. To have fixed length encoding, use UTF-32, wasting 3 bytes every 4 bytes for Western text.
But UTF8 is just a really ugly patch on top of ASCII -- and ASCII was probably a mistake in the first place. Adding more UTFs does not make Unicode any better. As you say, it just devolves the "standard" further.
Bullshit. UTF8 is not an 'ugly path', it's a decent encoding. And what's wrong with ASCII? It's just a collection of most-used US characters, in a certain, not-entirely random order. And there's no need for more UTFs, we have all the UTFs we need.
The question is: can it be possible at this stage to drop Unicode/ASCII, and come up with something better? I would like ot believe that it is possible, yes.
Unicode is not perfect in a few respects. Especially having so many precomposed characters (e.g. Hangul, but also western accented characters) was a bad decision, but understandable from a technical and pragmatical point of view. But the fact remains, there are a few thousand different characters out there, and you have to describe them somehow. You have a better idea than assign each a unique identifier?
I do not want to implement something as poorly designed as UTF8 in my kernel.
UTF-8 is not poorly designed. It is a perfectly valid variable length encoding. I dare you to come up with something better which is at the same time compact enough to not be a huge space waster.
As I understand it, there are fewer than 70K languages on the planet, and the number is dropping by 3K a year.
There are about 6900 languages, you're of by a factor of 10. And the number is dropping by about 25 a year, you're of by a factor of 100 there. Please get your facts straight if you don't know what you're talking about!
Hopefully, within a century or two, this means that the pictographic languages will either die or be simplified -- to the point where we can encode everything/anything in 16 bits
You are so thoroughly misguided, it's hard to even know where to start to debunk this. Ok, I'll try:
1) The most extensively used pictographic writing system is of course Chinese/Japanese. Although most of the characters are within the 16-bit range, do you really think they will give up on the other characters in 200 years?
2) Most code points above 0xffff are already dead languages.
3) Most endangered/dying languages do not have a script.

That's it for now, next time please refrain from making statements about things you have little knowledge about!


JAL

Posted: Tue Feb 05, 2008 10:15 am
by jal
bewing wrote:And there are only 8 languages (with 7 alphabets/pictographic sets as I recall) that are commonly used. All the rest are marginal, and certain to decrease, and probably die, over time.
This, again, is utter bullshit of course. In Europe alone, there are more than 10 languages with more than 20 million speakers. Again, please, refrain from blabbering on about things you have obviously not the slightest clue about.


JAL

Posted: Thu Feb 07, 2008 3:27 pm
by skyking
bewing wrote:I basically agree with you, fellow mathematician. :wink:

ASCII-7 was originally developed with the intent of just being a US standard (as is clearly implied by the name) -- back when every bit was precious. When the PC-XT/MSDOS was created, the industry had one, and only one, opportunity to completely re-engineer the character encoding to meet international and programming needs. IBM completely failed to do this.

For programming purposes, anything except fixed-width characters (in terms of storage, of course) ends up being so ugly as to be ridiculous. Of course, for those of us who choke at the idea of wasting half your computer's memory storing zeroes as every other byte of character data, there would be other problems.

But UTF8 is just a really ugly patch on top of ASCII -- and ASCII was probably a mistake in the first place. Adding more UTFs does not make Unicode any better. As you say, it just devolves the "standard" further.

The question is: can it be possible at this stage to drop Unicode/ASCII, and come up with something better? I would like ot believe that it is possible, yes. I do not want to implement something as poorly designed as UTF8 in my kernel.

As I understand it, there are fewer than 70K languages on the planet, and the number is dropping by 3K a year. Hopefully, within a century or two, this means that the pictographic languages will either die or be simplified -- to the point where we can encode everything/anything in 16 bits -- and with 256GB of RAM on the system, even I will be content to waste half of it. :D
Well we are almost there - Unicode BMP is enough for most situations. Usually wchar_t is only 16 bits which somewhat reflects this (if we encode the text using UTF-16 a wchar_t is not always representing a character).

So if you are satisfied with only 16bits why bother with NOT using the layout of these code points (0-65535 excl surrogate pairs) as specified by unicode?

As for encoding Unicode data using UTF-8 it addresses an important point that people seem to ignore - data transmission as well as some storage devices are octet oriented. UTF-8 don't have any need to specify endianess. And for variable length encoding I assume that you dismiss all kind of compressed data formats with the same argument :wink:

OTOH what need do your kernel have to process japanese text?