OSDev.org

Posted: **Wed Jun 01, 2011 12:09 pm**

berkus wrote:Hereby I declare the trollfest open!

Flamefest was already starting so who bothers.

Posted: **Wed Jun 01, 2011 4:34 pm**

The trollfest imo started before I even mentioned "in rdos' world"...

Anyway,

berkus wrote:- can you figure that out for ASCIIZ string?

No, you didn't say how many newlines and other non-character bytes there are.

- can you figure that out of wchar_t UCS-2 string?

No, I have not been informed of the number of bytepairs that are classified as combining characters, since they form one character with the following bytepair. (now, what was your point again?

)

Posted: **Wed Jun 01, 2011 7:59 pm**

berkus wrote:My point was that I consider myself knowledgeable in character representations and I find it highly amusing when amateurs start arguing about it without knowing a half-bit on the subject. And you just got some of rdos' points for distinguishing between characters and other code points. (Knowing what some combining pairs are is important for rendering - i.e. you will have issues with calculating string's bounding rect without knowing this, but it also may be important to know where the code point boundaries are or how many of them are present.) Also reducing to ASCII only means you're leaving out about 90% of the Earth population based on language-specific glyphs only (and even more if you start counting service code points, non-breakable zero-width spaces or accents and the such). Not that it would matter for a hobby OS but this is a very bad excuse to me.

As you said, that project won't ever have 10% of the userbase.. there is absolutely nothing wrong with supporting only ASCII and letting the rest of the world suffer, they already do.

Let's call it the anti-"chicken scratch" revolution!

Posted: **Wed Jun 01, 2011 8:31 pm**

Just to put my own two cents in. Unless I have gravely misunderstood how UTF-8 works, it is the best format for almost every application (although, a little inefficient if the character set is non-European)

As berkus pointed out in his question, it is difficult to split a string between characters (not bytes), but in most cases (especially the example of splitting paths), you have determined the location of the '/' character beforehand, and can use the byte location instead of character position.

UTF-8 also has the advantage of all extended characters (non ASCII) are sequences of values >128, and won't unintentionally match other characters.

Sure, if you want to get the first 'n' characters (for printing), you will have a bit more of an issue, but that should only really need to be done for presentation (where you already need to parse the string anyway)

Posted: **Thu Jun 02, 2011 3:03 am**

And you just got some of rdos' points for distinguishing between characters and other code points.

TBH, I almost didn't want to enter the discussion because I technically had to choose his side and didn't want to boost his ego, but that wouldn't be fair for a balanced discussion.

Anyway, I think it's safe to say that 16-bit-only unicode would nowadays qualify as a design error.

Posted: **Thu Jun 02, 2011 1:58 pm**

The future will probably be something new with morpheme codes - that's what I'm working on. Two-byte code to identify the language followed by a string of two-byte codes representing semantic components of words. It's the most sensible way to go in if you want the machine to understand efficiently all the text it's working with.

Posted: **Thu Jun 02, 2011 4:37 pm**

UTF8 is really work of a genius.

ASCII compatibility
You can use many of the old standard C functions
Byte order insensitive
Resynchronization possible
You can keep the old APIs intact

UTF8 is here to stay and most people writing OSes today will choose UTF8 for obvious reasons.

Only drawback is that random indexing characters is slow. You can always internally convert to UTF32 in that case. I haven't encountered any case where I have to do this though. Perhaps if you write word processors/editors or something similar.

Posted: **Thu Jun 02, 2011 5:58 pm**

OSwhatever wrote:UTF8 is really work of a genius.

ASCII compatibility

Granted

You can use many of the old standard C functions

This pretends that the wide string versions don't exist

Byte order insensitive

Byte order marks, or external signaling

Resynchronization possible

This presupposes that resynchronization is an issue

You can keep the old APIs intact

You're designing a new OS. What old APIs?

UTF8 is here to stay and most people writing OSes today will choose UTF8 for obvious reasons.

UTF-8 is a fine on disk format for many things. However, when actually processing as text (rather than an opaque blob of characters), UTF-16 or UCS-4 are more useful (and UCS-4 has massive memory and memory bandwidth overhead, so UTF-16 is preferable)

Only drawback is that random indexing characters is slow. You can always internally convert to UTF32 in that case. I haven't encountered any case where I have to do this though. Perhaps if you write word processors/editors or something similar.

Not just random indexing. Any processing which involves converting the characters to scalar values. UTF-8 decoding code is very branchy; UTF-16 code is much less so.

Also: C's wide string support is abysmal. I look forward to the basic but decent support provided by C1X. One of C's wide char support's biggest problems is that it requires UCS-4 (A wchar_t must be one character). Yes, Microsoft's implementation is non-conformant; this is a historical accident (It was rendered non-conformant by a Unicode update!)

Also, note the difference between UCS-2 and UTF-16. The former is strictly two bytes per character, and encodes the basic multilingual plane. The second is either two or four bytes per character.

Posted: **Fri Jun 03, 2011 5:01 am**

Thomas wrote:
Happy to note that there are lot of OpenVMS deployments in Europe !.

---Thomas

That system is not made for humans!

I remember having a great difficulty in understanding the FS layout and logical names...but yeah, it's very stable and still alive, unlike they have predicted years ago.

Posted: **Sat Jun 04, 2011 10:43 am**

OSwhatever wrote:Only drawback is that random indexing characters is slow. You can always internally convert to UTF32 in that case. I haven't encountered any case where I have to do this though. Perhaps if you write word processors/editors or something similar.

That can easily be solved in a C++ string class. Just allocate an extra index array with for string when the user tries to index. At the first access the index vector is created, and on subsequent accesses it works just as fast as UCS-4.

Posted: **Sat Jun 04, 2011 12:46 pm**

berkus wrote:
rdos wrote:That can easily be solved in a C++ string class. Just allocate an extra index array with for string when the user tries to index. At the first access the index vector is created, and on subsequent accesses it works just as fast as UCS-4.
You're kidding me.

That was my reaction, too, at first, but if it'd be done well, it would have smaller memory footprint and it would take less time to "convert" from UTF-8 (as you're just storing offsets).
</rant>
Also, wouldn't be nice if we could do u8"abcd ăâ€îo§ș"s in C++, and get a nice std::string or an equivalent (with complete utf-8 capabilities)?

Posted: **Sat Jun 04, 2011 1:05 pm**

UTF-8 looks like the best format in most cases to me, mostly for the reasons OSwhatever gave. In most string processing you have to go through character by character anyway, and the convenience of being 100% compatible with ASCII is enormous. There's no need for byte order marks, and with UTF-16 you lose that but keep the problems of variable-size characters.

Of course, convert to a wider, fixed-width format internally when it's more convenient (which appears to be rarely), but there's no need to make that the default.

Posted: **Sat Jun 04, 2011 1:49 pm**

berkus wrote:
eddyb wrote:
berkus wrote:You're kidding me.
That was my reaction, too, at first, but if it'd be done well, it would have smaller memory footprint and it would take less time to "convert" from UTF-8 (as you're just storing offsets).
No, it will take a lot more (4 bytes + 1 to 6 bytes per character, this is 5-10 bytes per character, compared to plain UCS-4 with fixed 4 bytes per character), it will also be a lot slower (more potentially disjoint memory accesses). I double dare you to invent a "well" implementation of what rdos proposed to be at least on par with plain UCS-4 in speed/space.

(Remember that C strings are potentially allowed to be up to 2^(address space size) bytes in size, so on x86-64 your "indexed UTF-8" will take 9-14 bytes per character).

Oops... I see now what I've underestimated... You're right, and there are ways to minimize the space used, but they still require some kind of encoding, potentially making it even slower than decoding UTF-8. /facepalm

Posted: **Sat Jun 04, 2011 11:49 pm**

berkus wrote:No, it will take a lot more (4 bytes + 1 to 6 bytes per character, this is 5-10 bytes per character, compared to plain UCS-4 with fixed 4 bytes per character), it will also be a lot slower (more potentially disjoint memory accesses). I double dare you to invent a "well" implementation of what rdos proposed to be at least on par with plain UCS-4 in speed/space.

You forgot that most software is not interested in exact character marks, and will not do character indexing.

Besides, indexes could use 1, 2 or 4 bytes depending on string size. That means 2 to 7 bytes for strings up to 256 bytes. For a typical english language string (up to 256 bytes) it means 2 bytes compared to 4 for UCS-4.

Posted: **Sun Jun 05, 2011 6:34 am**

berkus wrote:You forget that for typical english you do not need a separate index at all. So your code becomes a convoluted mess of flags, indexes and field bit widths just to "save" on those pesky UCS-4 extra bytes. Good luck!

If it is a mess or not makes little difference, as the mess is hidden away in the classlibrary. Buffer allocation, string concat and reference counting is also a mess hidden away in the classlibrary so users don't need to bother with those details. The same could be done for character referencing. Even better, if UCS-4 proves to be better, it could be used internally, while hidding this (awful) detail inside the classlibrary, and not ever importing/exporting UCS-4 strings. But I doubt this is beneficial as all incoming strings then needs to be converted to UCS-4. In the UTF-8 internal representation, conversion would only be done when character indexing is needed.

OSDev.org

windows research kernel

Re: windows research kernel

Re: windows research kernel

Re: windows research kernel

Re: windows research kernel

Re: windows research kernel

Re: windows research kernel

Re: windows research kernel

Re: windows research kernel

Re: windows research kernel

Re: windows research kernel

Re: windows research kernel

Re: windows research kernel

Re: windows research kernel

Re: windows research kernel

Re: windows research kernel