bewing wrote:When the PC-XT/MSDOS was created, the industry had one, and only one, opportunity to completely re-engineer the character encoding to meet international and programming needs. IBM completely failed to do this.
In those days, every bit was still precious. Anyone suggesting in those days to encode each character with 16 bits would have been ridiculed and then shot.
For programming purposes, anything except fixed-width characters (in terms of storage, of course) ends up being so ugly as to be ridiculous.
I find this supposition ridiculous. When displaying text, one has to sequentially follow strings anyway, so it doesn't matter whether variable length encoding is used. And how often do you want to index strings of text anyway? For text manipulations, it really doesn't matter whether you use fixed length characters or not (especially not when, like UTF-8, one can search in both directions reliably).
Of course, for those of us who choke at the idea of wasting half your computer's memory storing zeroes as every other byte of character data, there would be other problems.
UTF-16 is also a variable length encoding, although many seem to forget. To have fixed length encoding, use UTF-32, wasting 3 bytes every 4 bytes for Western text.
But UTF8 is just a really ugly patch on top of ASCII -- and ASCII was probably a mistake in the first place. Adding more UTFs does not make Unicode any better. As you say, it just devolves the "standard" further.
Bullshit. UTF8 is not an 'ugly path', it's a decent encoding. And what's wrong with ASCII? It's just a collection of most-used US characters, in a certain, not-entirely random order. And there's no need for more UTFs, we have all the UTFs we need.
The question is: can it be possible at this stage to drop Unicode/ASCII, and come up with something better? I would like ot believe that it is possible, yes.
Unicode is not perfect in a few respects. Especially having so many precomposed characters (e.g. Hangul, but also western accented characters) was a bad decision, but understandable from a technical and pragmatical point of view. But the fact remains, there are a few thousand different characters out there, and you have to describe them somehow. You have a better idea than assign each a unique identifier?
I do not want to implement something as poorly designed as UTF8 in my kernel.
UTF-8 is not poorly designed. It is a perfectly valid variable length encoding. I dare you to come up with something better which is at the same time compact enough to not be a huge space waster.
As I understand it, there are fewer than 70K languages on the planet, and the number is dropping by 3K a year.
There are about 6900 languages, you're of by a factor of 10. And the number is dropping by about 25 a year, you're of by a factor of 100 there. Please get your facts straight if you don't know what you're talking about!
Hopefully, within a century or two, this means that the pictographic languages will either die or be simplified -- to the point where we can encode everything/anything in 16 bits
You are so thoroughly misguided, it's hard to even know where to start to debunk this. Ok, I'll try:
1) The most extensively used pictographic writing system is of course Chinese/Japanese. Although most of the characters are within the 16-bit range, do you really think they will give up on the other characters in 200 years?
2) Most code points above 0xffff
are already dead languages.
3) Most endangered/dying languages do not have a script.
That's it for now, next time please refrain from making statements about things you have little knowledge about!
JAL