Re: windows research kernel
Posted: Wed Jun 01, 2011 12:09 pm
Flamefest was already starting so who bothers.berkus wrote:Hereby I declare the trollfest open!
The Place to Start for Operating System Developers
http://f.osdev.org/
Flamefest was already starting so who bothers.berkus wrote:Hereby I declare the trollfest open!
No, you didn't say how many newlines and other non-character bytes there are.berkus wrote:- can you figure that out for ASCIIZ string?
No, I have not been informed of the number of bytepairs that are classified as combining characters, since they form one character with the following bytepair. (now, what was your point again?- can you figure that out of wchar_t UCS-2 string?
As you said, that project won't ever have 10% of the userbase.. there is absolutely nothing wrong with supporting only ASCII and letting the rest of the world suffer, they already do.berkus wrote:My point was that I consider myself knowledgeable in character representations and I find it highly amusing when amateurs start arguing about it without knowing a half-bit on the subject. And you just got some of rdos' points for distinguishing between characters and other code points. (Knowing what some combining pairs are is important for rendering - i.e. you will have issues with calculating string's bounding rect without knowing this, but it also may be important to know where the code point boundaries are or how many of them are present.) Also reducing to ASCII only means you're leaving out about 90% of the Earth population based on language-specific glyphs only (and even more if you start counting service code points, non-breakable zero-width spaces or accents and the such). Not that it would matter for a hobby OS but this is a very bad excuse to me.
TBH, I almost didn't want to enter the discussion because I technically had to choose his side and didn't want to boost his ego, but that wouldn't be fair for a balanced discussion.And you just got some of rdos' points for distinguishing between characters and other code points.
GrantedOSwhatever wrote:UTF8 is really work of a genius.
ASCII compatibility
This pretends that the wide string versions don't existYou can use many of the old standard C functions
Byte order marks, or external signalingByte order insensitive
This presupposes that resynchronization is an issueResynchronization possible
You're designing a new OS. What old APIs?You can keep the old APIs intact
UTF-8 is a fine on disk format for many things. However, when actually processing as text (rather than an opaque blob of characters), UTF-16 or UCS-4 are more useful (and UCS-4 has massive memory and memory bandwidth overhead, so UTF-16 is preferable)UTF8 is here to stay and most people writing OSes today will choose UTF8 for obvious reasons.
Not just random indexing. Any processing which involves converting the characters to scalar values. UTF-8 decoding code is very branchy; UTF-16 code is much less so.Only drawback is that random indexing characters is slow. You can always internally convert to UTF32 in that case. I haven't encountered any case where I have to do this though. Perhaps if you write word processors/editors or something similar.
Thomas wrote:
Happy to note that there are lot of OpenVMS deployments in Europe !.
---Thomas
That can easily be solved in a C++ string class. Just allocate an extra index array with for string when the user tries to index. At the first access the index vector is created, and on subsequent accesses it works just as fast as UCS-4.OSwhatever wrote:Only drawback is that random indexing characters is slow. You can always internally convert to UTF32 in that case. I haven't encountered any case where I have to do this though. Perhaps if you write word processors/editors or something similar.
That was my reaction, too, at first, but if it'd be done well, it would have smaller memory footprint and it would take less time to "convert" from UTF-8 (as you're just storing offsets).berkus wrote:You're kidding me.rdos wrote:That can easily be solved in a C++ string class. Just allocate an extra index array with for string when the user tries to index. At the first access the index vector is created, and on subsequent accesses it works just as fast as UCS-4.
Oops... I see now what I've underestimated... You're right, and there are ways to minimize the space used, but they still require some kind of encoding, potentially making it even slower than decoding UTF-8. /facepalmberkus wrote:No, it will take a lot more (4 bytes + 1 to 6 bytes per character, this is 5-10 bytes per character, compared to plain UCS-4 with fixed 4 bytes per character), it will also be a lot slower (more potentially disjoint memory accesses). I double dare you to invent a "well" implementation of what rdos proposed to be at least on par with plain UCS-4 in speed/space.eddyb wrote:That was my reaction, too, at first, but if it'd be done well, it would have smaller memory footprint and it would take less time to "convert" from UTF-8 (as you're just storing offsets).berkus wrote:You're kidding me.
(Remember that C strings are potentially allowed to be up to 2^(address space size) bytes in size, so on x86-64 your "indexed UTF-8" will take 9-14 bytes per character).
You forgot that most software is not interested in exact character marks, and will not do character indexing.berkus wrote:No, it will take a lot more (4 bytes + 1 to 6 bytes per character, this is 5-10 bytes per character, compared to plain UCS-4 with fixed 4 bytes per character), it will also be a lot slower (more potentially disjoint memory accesses). I double dare you to invent a "well" implementation of what rdos proposed to be at least on par with plain UCS-4 in speed/space.
If it is a mess or not makes little difference, as the mess is hidden away in the classlibrary. Buffer allocation, string concat and reference counting is also a mess hidden away in the classlibrary so users don't need to bother with those details. The same could be done for character referencing. Even better, if UCS-4 proves to be better, it could be used internally, while hidding this (awful) detail inside the classlibrary, and not ever importing/exporting UCS-4 strings. But I doubt this is beneficial as all incoming strings then needs to be converted to UCS-4. In the UTF-8 internal representation, conversion would only be done when character indexing is needed.berkus wrote:You forget that for typical english you do not need a separate index at all. So your code becomes a convoluted mess of flags, indexes and field bit widths just to "save" on those pesky UCS-4 extra bytes. Good luck!