Flamefest was already starting so who bothers.berkus wrote:Hereby I declare the trollfest open!
windows research kernel
Re: windows research kernel
- Combuster
- Member
- Posts: 9301
- Joined: Wed Oct 18, 2006 3:45 am
- Libera.chat IRC: [com]buster
- Location: On the balcony, where I can actually keep 1½m distance
- Contact:
Re: windows research kernel
The trollfest imo started before I even mentioned "in rdos' world"...
Anyway,
Anyway,
No, you didn't say how many newlines and other non-character bytes there are.berkus wrote:- can you figure that out for ASCIIZ string?
No, I have not been informed of the number of bytepairs that are classified as combining characters, since they form one character with the following bytepair. (now, what was your point again? )- can you figure that out of wchar_t UCS-2 string?
- Brynet-Inc
- Member
- Posts: 2426
- Joined: Tue Oct 17, 2006 9:29 pm
- Libera.chat IRC: brynet
- Location: Canada
- Contact:
Re: windows research kernel
As you said, that project won't ever have 10% of the userbase.. there is absolutely nothing wrong with supporting only ASCII and letting the rest of the world suffer, they already do.berkus wrote:My point was that I consider myself knowledgeable in character representations and I find it highly amusing when amateurs start arguing about it without knowing a half-bit on the subject. And you just got some of rdos' points for distinguishing between characters and other code points. (Knowing what some combining pairs are is important for rendering - i.e. you will have issues with calculating string's bounding rect without knowing this, but it also may be important to know where the code point boundaries are or how many of them are present.) Also reducing to ASCII only means you're leaving out about 90% of the Earth population based on language-specific glyphs only (and even more if you start counting service code points, non-breakable zero-width spaces or accents and the such). Not that it would matter for a hobby OS but this is a very bad excuse to me.
Let's call it the anti-"chicken scratch" revolution!
- thepowersgang
- Member
- Posts: 734
- Joined: Tue Dec 25, 2007 6:03 am
- Libera.chat IRC: thePowersGang
- Location: Perth, Western Australia
- Contact:
Re: windows research kernel
Just to put my own two cents in. Unless I have gravely misunderstood how UTF-8 works, it is the best format for almost every application (although, a little inefficient if the character set is non-European)
As berkus pointed out in his question, it is difficult to split a string between characters (not bytes), but in most cases (especially the example of splitting paths), you have determined the location of the '/' character beforehand, and can use the byte location instead of character position.
UTF-8 also has the advantage of all extended characters (non ASCII) are sequences of values >128, and won't unintentionally match other characters.
Sure, if you want to get the first 'n' characters (for printing), you will have a bit more of an issue, but that should only really need to be done for presentation (where you already need to parse the string anyway)
As berkus pointed out in his question, it is difficult to split a string between characters (not bytes), but in most cases (especially the example of splitting paths), you have determined the location of the '/' character beforehand, and can use the byte location instead of character position.
UTF-8 also has the advantage of all extended characters (non ASCII) are sequences of values >128, and won't unintentionally match other characters.
Sure, if you want to get the first 'n' characters (for printing), you will have a bit more of an issue, but that should only really need to be done for presentation (where you already need to parse the string anyway)
Kernel Development, It's the brain surgery of programming.
Acess2 OS (c) | Tifflin OS (rust) | mrustc - Rust compiler
Currently Working on: mrustc
Acess2 OS (c) | Tifflin OS (rust) | mrustc - Rust compiler
Currently Working on: mrustc
- Combuster
- Member
- Posts: 9301
- Joined: Wed Oct 18, 2006 3:45 am
- Libera.chat IRC: [com]buster
- Location: On the balcony, where I can actually keep 1½m distance
- Contact:
Re: windows research kernel
TBH, I almost didn't want to enter the discussion because I technically had to choose his side and didn't want to boost his ego, but that wouldn't be fair for a balanced discussion.And you just got some of rdos' points for distinguishing between characters and other code points.
Anyway, I think it's safe to say that 16-bit-only unicode would nowadays qualify as a design error.
- DavidCooper
- Member
- Posts: 1150
- Joined: Wed Oct 27, 2010 4:53 pm
- Location: Scotland
Re: windows research kernel
The future will probably be something new with morpheme codes - that's what I'm working on. Two-byte code to identify the language followed by a string of two-byte codes representing semantic components of words. It's the most sensible way to go in if you want the machine to understand efficiently all the text it's working with.
Help the people of Laos by liking - https://www.facebook.com/TheSBInitiative/?ref=py_c
MSB-OS: http://www.magicschoolbook.com/computing/os-project - direct machine code programming
MSB-OS: http://www.magicschoolbook.com/computing/os-project - direct machine code programming
-
- Member
- Posts: 595
- Joined: Mon Jul 05, 2010 4:15 pm
Re: windows research kernel
UTF8 is really work of a genius.
ASCII compatibility
You can use many of the old standard C functions
Byte order insensitive
Resynchronization possible
You can keep the old APIs intact
UTF8 is here to stay and most people writing OSes today will choose UTF8 for obvious reasons.
Only drawback is that random indexing characters is slow. You can always internally convert to UTF32 in that case. I haven't encountered any case where I have to do this though. Perhaps if you write word processors/editors or something similar.
ASCII compatibility
You can use many of the old standard C functions
Byte order insensitive
Resynchronization possible
You can keep the old APIs intact
UTF8 is here to stay and most people writing OSes today will choose UTF8 for obvious reasons.
Only drawback is that random indexing characters is slow. You can always internally convert to UTF32 in that case. I haven't encountered any case where I have to do this though. Perhaps if you write word processors/editors or something similar.
- Owen
- Member
- Posts: 1700
- Joined: Fri Jun 13, 2008 3:21 pm
- Location: Cambridge, United Kingdom
- Contact:
Re: windows research kernel
GrantedOSwhatever wrote:UTF8 is really work of a genius.
ASCII compatibility
This pretends that the wide string versions don't existYou can use many of the old standard C functions
Byte order marks, or external signalingByte order insensitive
This presupposes that resynchronization is an issueResynchronization possible
You're designing a new OS. What old APIs?You can keep the old APIs intact
UTF-8 is a fine on disk format for many things. However, when actually processing as text (rather than an opaque blob of characters), UTF-16 or UCS-4 are more useful (and UCS-4 has massive memory and memory bandwidth overhead, so UTF-16 is preferable)UTF8 is here to stay and most people writing OSes today will choose UTF8 for obvious reasons.
Not just random indexing. Any processing which involves converting the characters to scalar values. UTF-8 decoding code is very branchy; UTF-16 code is much less so.Only drawback is that random indexing characters is slow. You can always internally convert to UTF32 in that case. I haven't encountered any case where I have to do this though. Perhaps if you write word processors/editors or something similar.
Also: C's wide string support is abysmal. I look forward to the basic but decent support provided by C1X. One of C's wide char support's biggest problems is that it requires UCS-4 (A wchar_t must be one character). Yes, Microsoft's implementation is non-conformant; this is a historical accident (It was rendered non-conformant by a Unicode update!)
Also, note the difference between UCS-2 and UTF-16. The former is strictly two bytes per character, and encodes the basic multilingual plane. The second is either two or four bytes per character.
Re: windows research kernel
Thomas wrote:
Happy to note that there are lot of OpenVMS deployments in Europe !.
---Thomas
That system is not made for humans!
I remember having a great difficulty in understanding the FS layout and logical names...but yeah, it's very stable and still alive, unlike they have predicted years ago.
Re: windows research kernel
That can easily be solved in a C++ string class. Just allocate an extra index array with for string when the user tries to index. At the first access the index vector is created, and on subsequent accesses it works just as fast as UCS-4.OSwhatever wrote:Only drawback is that random indexing characters is slow. You can always internally convert to UTF32 in that case. I haven't encountered any case where I have to do this though. Perhaps if you write word processors/editors or something similar.
Re: windows research kernel
That was my reaction, too, at first, but if it'd be done well, it would have smaller memory footprint and it would take less time to "convert" from UTF-8 (as you're just storing offsets).berkus wrote:You're kidding me.rdos wrote:That can easily be solved in a C++ string class. Just allocate an extra index array with for string when the user tries to index. At the first access the index vector is created, and on subsequent accesses it works just as fast as UCS-4.
</rant>
Also, wouldn't be nice if we could do u8"abcd ăâ€îo§ș"s in C++, and get a nice std::string or an equivalent (with complete utf-8 capabilities)?
Re: windows research kernel
UTF-8 looks like the best format in most cases to me, mostly for the reasons OSwhatever gave. In most string processing you have to go through character by character anyway, and the convenience of being 100% compatible with ASCII is enormous. There's no need for byte order marks, and with UTF-16 you lose that but keep the problems of variable-size characters.
Of course, convert to a wider, fixed-width format internally when it's more convenient (which appears to be rarely), but there's no need to make that the default.
Of course, convert to a wider, fixed-width format internally when it's more convenient (which appears to be rarely), but there's no need to make that the default.
Re: windows research kernel
Oops... I see now what I've underestimated... You're right, and there are ways to minimize the space used, but they still require some kind of encoding, potentially making it even slower than decoding UTF-8. /facepalmberkus wrote:No, it will take a lot more (4 bytes + 1 to 6 bytes per character, this is 5-10 bytes per character, compared to plain UCS-4 with fixed 4 bytes per character), it will also be a lot slower (more potentially disjoint memory accesses). I double dare you to invent a "well" implementation of what rdos proposed to be at least on par with plain UCS-4 in speed/space.eddyb wrote:That was my reaction, too, at first, but if it'd be done well, it would have smaller memory footprint and it would take less time to "convert" from UTF-8 (as you're just storing offsets).berkus wrote:You're kidding me.
(Remember that C strings are potentially allowed to be up to 2^(address space size) bytes in size, so on x86-64 your "indexed UTF-8" will take 9-14 bytes per character).
Re: windows research kernel
You forgot that most software is not interested in exact character marks, and will not do character indexing.berkus wrote:No, it will take a lot more (4 bytes + 1 to 6 bytes per character, this is 5-10 bytes per character, compared to plain UCS-4 with fixed 4 bytes per character), it will also be a lot slower (more potentially disjoint memory accesses). I double dare you to invent a "well" implementation of what rdos proposed to be at least on par with plain UCS-4 in speed/space.
Besides, indexes could use 1, 2 or 4 bytes depending on string size. That means 2 to 7 bytes for strings up to 256 bytes. For a typical english language string (up to 256 bytes) it means 2 bytes compared to 4 for UCS-4.
Re: windows research kernel
If it is a mess or not makes little difference, as the mess is hidden away in the classlibrary. Buffer allocation, string concat and reference counting is also a mess hidden away in the classlibrary so users don't need to bother with those details. The same could be done for character referencing. Even better, if UCS-4 proves to be better, it could be used internally, while hidding this (awful) detail inside the classlibrary, and not ever importing/exporting UCS-4 strings. But I doubt this is beneficial as all incoming strings then needs to be converted to UCS-4. In the UTF-8 internal representation, conversion would only be done when character indexing is needed.berkus wrote:You forget that for typical english you do not need a separate index at all. So your code becomes a convoluted mess of flags, indexes and field bit widths just to "save" on those pesky UCS-4 extra bytes. Good luck!