windows research kernel

Discussions on more advanced topics such as monolithic vs micro-kernels, transactional memory models, and paging vs segmentation should go here. Use this forum to expand and improve the wiki!
User avatar
qw
Member
Member
Posts: 792
Joined: Mon Jan 26, 2009 2:48 am

Re: windows research kernel

Post by qw »

berkus wrote:Hereby I declare the trollfest open!
Flamefest was already starting so who bothers.
User avatar
Combuster
Member
Member
Posts: 9301
Joined: Wed Oct 18, 2006 3:45 am
Libera.chat IRC: [com]buster
Location: On the balcony, where I can actually keep 1½m distance
Contact:

Re: windows research kernel

Post by Combuster »

The trollfest imo started before I even mentioned "in rdos' world"...

Anyway,
berkus wrote:- can you figure that out for ASCIIZ string?
No, you didn't say how many newlines and other non-character bytes there are. :mrgreen:
- can you figure that out of wchar_t UCS-2 string?
No, I have not been informed of the number of bytepairs that are classified as combining characters, since they form one character with the following bytepair. (now, what was your point again? :twisted:)
"Certainly avoid yourself. He is a newbie and might not realize it. You'll hate his code deeply a few years down the road." - Sortie
[ My OS ] [ VDisk/SFS ]
User avatar
Brynet-Inc
Member
Member
Posts: 2426
Joined: Tue Oct 17, 2006 9:29 pm
Libera.chat IRC: brynet
Location: Canada
Contact:

Re: windows research kernel

Post by Brynet-Inc »

berkus wrote:My point was that I consider myself knowledgeable in character representations and I find it highly amusing when amateurs start arguing about it without knowing a half-bit on the subject. And you just got some of rdos' points for distinguishing between characters and other code points. (Knowing what some combining pairs are is important for rendering - i.e. you will have issues with calculating string's bounding rect without knowing this, but it also may be important to know where the code point boundaries are or how many of them are present.) Also reducing to ASCII only means you're leaving out about 90% of the Earth population based on language-specific glyphs only (and even more if you start counting service code points, non-breakable zero-width spaces or accents and the such). Not that it would matter for a hobby OS but this is a very bad excuse to me.
As you said, that project won't ever have 10% of the userbase.. there is absolutely nothing wrong with supporting only ASCII and letting the rest of the world suffer, they already do.

Let's call it the anti-"chicken scratch" revolution!
Image
Twitter: @canadianbryan. Award by smcerm, I stole it. Original was larger.
User avatar
thepowersgang
Member
Member
Posts: 734
Joined: Tue Dec 25, 2007 6:03 am
Libera.chat IRC: thePowersGang
Location: Perth, Western Australia
Contact:

Re: windows research kernel

Post by thepowersgang »

Just to put my own two cents in. Unless I have gravely misunderstood how UTF-8 works, it is the best format for almost every application (although, a little inefficient if the character set is non-European)

As berkus pointed out in his question, it is difficult to split a string between characters (not bytes), but in most cases (especially the example of splitting paths), you have determined the location of the '/' character beforehand, and can use the byte location instead of character position.

UTF-8 also has the advantage of all extended characters (non ASCII) are sequences of values >128, and won't unintentionally match other characters.

Sure, if you want to get the first 'n' characters (for printing), you will have a bit more of an issue, but that should only really need to be done for presentation (where you already need to parse the string anyway)
Kernel Development, It's the brain surgery of programming.
Acess2 OS (c) | Tifflin OS (rust) | mrustc - Rust compiler
Currently Working on: mrustc
User avatar
Combuster
Member
Member
Posts: 9301
Joined: Wed Oct 18, 2006 3:45 am
Libera.chat IRC: [com]buster
Location: On the balcony, where I can actually keep 1½m distance
Contact:

Re: windows research kernel

Post by Combuster »

And you just got some of rdos' points for distinguishing between characters and other code points.
TBH, I almost didn't want to enter the discussion because I technically had to choose his side and didn't want to boost his ego, but that wouldn't be fair for a balanced discussion.

Anyway, I think it's safe to say that 16-bit-only unicode would nowadays qualify as a design error.
"Certainly avoid yourself. He is a newbie and might not realize it. You'll hate his code deeply a few years down the road." - Sortie
[ My OS ] [ VDisk/SFS ]
User avatar
DavidCooper
Member
Member
Posts: 1150
Joined: Wed Oct 27, 2010 4:53 pm
Location: Scotland

Re: windows research kernel

Post by DavidCooper »

The future will probably be something new with morpheme codes - that's what I'm working on. Two-byte code to identify the language followed by a string of two-byte codes representing semantic components of words. It's the most sensible way to go in if you want the machine to understand efficiently all the text it's working with.
Help the people of Laos by liking - https://www.facebook.com/TheSBInitiative/?ref=py_c

MSB-OS: http://www.magicschoolbook.com/computing/os-project - direct machine code programming
OSwhatever
Member
Member
Posts: 595
Joined: Mon Jul 05, 2010 4:15 pm

Re: windows research kernel

Post by OSwhatever »

UTF8 is really work of a genius.

ASCII compatibility
You can use many of the old standard C functions
Byte order insensitive
Resynchronization possible
You can keep the old APIs intact

UTF8 is here to stay and most people writing OSes today will choose UTF8 for obvious reasons.

Only drawback is that random indexing characters is slow. You can always internally convert to UTF32 in that case. I haven't encountered any case where I have to do this though. Perhaps if you write word processors/editors or something similar.
User avatar
Owen
Member
Member
Posts: 1700
Joined: Fri Jun 13, 2008 3:21 pm
Location: Cambridge, United Kingdom
Contact:

Re: windows research kernel

Post by Owen »

OSwhatever wrote:UTF8 is really work of a genius.

ASCII compatibility
Granted
You can use many of the old standard C functions
This pretends that the wide string versions don't exist
Byte order insensitive
Byte order marks, or external signaling
Resynchronization possible
This presupposes that resynchronization is an issue
You can keep the old APIs intact
You're designing a new OS. What old APIs?
UTF8 is here to stay and most people writing OSes today will choose UTF8 for obvious reasons.
UTF-8 is a fine on disk format for many things. However, when actually processing as text (rather than an opaque blob of characters), UTF-16 or UCS-4 are more useful (and UCS-4 has massive memory and memory bandwidth overhead, so UTF-16 is preferable)
Only drawback is that random indexing characters is slow. You can always internally convert to UTF32 in that case. I haven't encountered any case where I have to do this though. Perhaps if you write word processors/editors or something similar.
Not just random indexing. Any processing which involves converting the characters to scalar values. UTF-8 decoding code is very branchy; UTF-16 code is much less so.

Also: C's wide string support is abysmal. I look forward to the basic but decent support provided by C1X. One of C's wide char support's biggest problems is that it requires UCS-4 (A wchar_t must be one character). Yes, Microsoft's implementation is non-conformant; this is a historical accident (It was rendered non-conformant by a Unicode update!)

Also, note the difference between UCS-2 and UTF-16. The former is strictly two bytes per character, and encodes the basic multilingual plane. The second is either two or four bytes per character.
UX
Posts: 15
Joined: Sun May 08, 2011 2:21 pm

Re: windows research kernel

Post by UX »

Thomas wrote:
Happy to note that there are lot of OpenVMS deployments in Europe !.

---Thomas

That system is not made for humans! :)
I remember having a great difficulty in understanding the FS layout and logical names...but yeah, it's very stable and still alive, unlike they have predicted years ago.
rdos
Member
Member
Posts: 3276
Joined: Wed Oct 01, 2008 1:55 pm

Re: windows research kernel

Post by rdos »

OSwhatever wrote:Only drawback is that random indexing characters is slow. You can always internally convert to UTF32 in that case. I haven't encountered any case where I have to do this though. Perhaps if you write word processors/editors or something similar.
That can easily be solved in a C++ string class. Just allocate an extra index array with for string when the user tries to index. At the first access the index vector is created, and on subsequent accesses it works just as fast as UCS-4.
eddyb
Member
Member
Posts: 248
Joined: Fri Aug 01, 2008 7:52 am

Re: windows research kernel

Post by eddyb »

berkus wrote:
rdos wrote:That can easily be solved in a C++ string class. Just allocate an extra index array with for string when the user tries to index. At the first access the index vector is created, and on subsequent accesses it works just as fast as UCS-4.
You're kidding me.
That was my reaction, too, at first, but if it'd be done well, it would have smaller memory footprint and it would take less time to "convert" from UTF-8 (as you're just storing offsets).
</rant>
Also, wouldn't be nice if we could do u8"abcd ăâ€îo§ș"s in C++, and get a nice std::string or an equivalent (with complete utf-8 capabilities)?
User avatar
Rusky
Member
Member
Posts: 792
Joined: Wed Jan 06, 2010 7:07 pm

Re: windows research kernel

Post by Rusky »

UTF-8 looks like the best format in most cases to me, mostly for the reasons OSwhatever gave. In most string processing you have to go through character by character anyway, and the convenience of being 100% compatible with ASCII is enormous. There's no need for byte order marks, and with UTF-16 you lose that but keep the problems of variable-size characters.

Of course, convert to a wider, fixed-width format internally when it's more convenient (which appears to be rarely), but there's no need to make that the default.
eddyb
Member
Member
Posts: 248
Joined: Fri Aug 01, 2008 7:52 am

Re: windows research kernel

Post by eddyb »

berkus wrote:
eddyb wrote:
berkus wrote:You're kidding me.
That was my reaction, too, at first, but if it'd be done well, it would have smaller memory footprint and it would take less time to "convert" from UTF-8 (as you're just storing offsets).
No, it will take a lot more (4 bytes + 1 to 6 bytes per character, this is 5-10 bytes per character, compared to plain UCS-4 with fixed 4 bytes per character), it will also be a lot slower (more potentially disjoint memory accesses). I double dare you to invent a "well" implementation of what rdos proposed to be at least on par with plain UCS-4 in speed/space.

(Remember that C strings are potentially allowed to be up to 2^(address space size) bytes in size, so on x86-64 your "indexed UTF-8" will take 9-14 bytes per character).
Oops... I see now what I've underestimated... You're right, and there are ways to minimize the space used, but they still require some kind of encoding, potentially making it even slower than decoding UTF-8. /facepalm
rdos
Member
Member
Posts: 3276
Joined: Wed Oct 01, 2008 1:55 pm

Re: windows research kernel

Post by rdos »

berkus wrote:No, it will take a lot more (4 bytes + 1 to 6 bytes per character, this is 5-10 bytes per character, compared to plain UCS-4 with fixed 4 bytes per character), it will also be a lot slower (more potentially disjoint memory accesses). I double dare you to invent a "well" implementation of what rdos proposed to be at least on par with plain UCS-4 in speed/space.
You forgot that most software is not interested in exact character marks, and will not do character indexing.

Besides, indexes could use 1, 2 or 4 bytes depending on string size. That means 2 to 7 bytes for strings up to 256 bytes. For a typical english language string (up to 256 bytes) it means 2 bytes compared to 4 for UCS-4.
rdos
Member
Member
Posts: 3276
Joined: Wed Oct 01, 2008 1:55 pm

Re: windows research kernel

Post by rdos »

berkus wrote:You forget that for typical english you do not need a separate index at all. So your code becomes a convoluted mess of flags, indexes and field bit widths just to "save" on those pesky UCS-4 extra bytes. Good luck!
If it is a mess or not makes little difference, as the mess is hidden away in the classlibrary. Buffer allocation, string concat and reference counting is also a mess hidden away in the classlibrary so users don't need to bother with those details. The same could be done for character referencing. Even better, if UCS-4 proves to be better, it could be used internally, while hidding this (awful) detail inside the classlibrary, and not ever importing/exporting UCS-4 strings. But I doubt this is beneficial as all incoming strings then needs to be converted to UCS-4. In the UTF-8 internal representation, conversion would only be done when character indexing is needed.
Post Reply