windows research kernel

Discussions on more advanced topics such as monolithic vs micro-kernels, transactional memory models, and paging vs segmentation should go here. Use this forum to expand and improve the wiki!
User avatar
gravaera
Member
Member
Posts: 737
Joined: Tue Jun 02, 2009 4:35 pm
Location: Supporting the cause: Use \tabs to indent code. NOT \x20 spaces.

Re: windows research kernel

Post by gravaera »

rdos wrote:
Owen wrote:Also: Standardizing on UTF-8 presupposes that it is a better character set than UTF-16. I would disagree. (UTF-16 is far simpler to decode)
UTF-8 is superior for most languages, simply because it preserves ASCII-compability, which unicode and UTF-16 does not do. If it is harder to decode for odd characters is not an issue.
But then MS are known to break standards and compability.
You know that UTF-8 is unicode? Unicode is a glyph standard. UTF-8, UTF-16 and then the uncompressed UTF-32 are all compressions of unicode. Whatever Unicode provides, UTF-8 provides since they UTf-8 is nothing but a compression format for Unicode strings.

Also, MS does not "break" anything in its implementation of Unicode support, they just use UTF-16, and that is all. They didn't do anything askance.
17:56 < sortie> Paging is called paging because you need to draw it on pages in your notebook to succeed at it.
User avatar
Solar
Member
Member
Posts: 7615
Joined: Thu Nov 16, 2006 12:01 pm
Location: Germany
Contact:

Re: windows research kernel

Post by Solar »

gravaera wrote:Also, MS does not "break" anything in its implementation of Unicode support, they just use UTF-16, and that is all. They didn't do anything askance.
On Windows, a wide string is UCS-2, on all other platforms (that I know about) it's UTF-32. And, the Windows API does not support UTF-8 at all (AFAIK), you have to use third-party libs like ICU for that.

However, you can't really blame them. When they implemented their UCS-2 wide strings, UCS-2 was enough to encode all character points defined at that time.

They could've implemented UTF-8 somewhere down the road, though. Then again, their API is bulky enough as it is already.
Every good solution is obvious once you've found it.
User avatar
Thomas
Member
Member
Posts: 281
Joined: Thu Jun 04, 2009 11:12 pm

Re: windows research kernel

Post by Thomas »

Hi,
Just an observation....maybe not very accurate...but why MacOS is way more popular in US then in Europe? I mean...here it's more exotic then...uugghh..OpenVMS.
Yes, they make expensive hardware, but man...Switzerland, Sweeden, Finland, Holland, Germany...all have much higher standard then US.
Happy to note that there are lot of OpenVMS deployments in Europe !.

---Thomas
rdos
Member
Member
Posts: 3276
Joined: Wed Oct 01, 2008 1:55 pm

Re: windows research kernel

Post by rdos »

gravaera wrote:You know that UTF-8 is unicode? Unicode is a glyph standard. UTF-8, UTF-16 and then the uncompressed UTF-32 are all compressions of unicode. Whatever Unicode provides, UTF-8 provides since they UTf-8 is nothing but a compression format for Unicode strings.
The issue is that UTF-16 and UCS-2 are bulky representations of characters that are not compatible with simple text-editors. Most strings in a program can be represented with a single byte, so there is no reason to start using 16-bit wide characters. It only adds problems. Especially if there is an option to use either 8 or 16-bit characters.

Besides, it is only on very few places in the OS that actual decoding and representation needs to be done of UTF-8 strings. Those are in the graphic API for writing text-strings, and for getting string metrics. All other places doesn't care if there are special encodings used. Switching to 16-bit wide characters or supporting both types, requires changes in many instances. I like to use "char" type for character strings, and to assume char is a byte. I do not like custom defined types for strings.
rdos
Member
Member
Posts: 3276
Joined: Wed Oct 01, 2008 1:55 pm

Re: windows research kernel

Post by rdos »

berkus wrote:You do understand that processing strings with fixed 2-byte character sizes is simpler than processing strings with variable 1-6 bytes per character?
No, should I? String processing does not need to care about the special escape-characters used in UTF-8. The null-terminator is still 0, and it is not used in the escape sequences. The only instance escape-characters needs to be taken into account is when displaying strings.

EDIT: You could also provide a function "RealLen" that returns number of characters in the string, but I don't expect it to be used a lot.
User avatar
Combuster
Member
Member
Posts: 9301
Joined: Wed Oct 18, 2006 3:45 am
Libera.chat IRC: [com]buster
Location: On the balcony, where I can actually keep 1½m distance
Contact:

Re: windows research kernel

Post by Combuster »

berkus wrote:You do understand that processing strings with fixed 2-byte character sizes is simpler than processing strings with variable 1-6 bytes per character?
You do understand that in rdos' world 2-byte characters are never smaller in storage :mrgreen:
"Certainly avoid yourself. He is a newbie and might not realize it. You'll hate his code deeply a few years down the road." - Sortie
[ My OS ] [ VDisk/SFS ]
rdos
Member
Member
Posts: 3276
Joined: Wed Oct 01, 2008 1:55 pm

Re: windows research kernel

Post by rdos »

If you speak chinese they would be, but not for english or swedish. :mrgreen:
User avatar
qw
Member
Member
Posts: 792
Joined: Mon Jan 26, 2009 2:48 am

Re: windows research kernel

Post by qw »

OSwhatever wrote:I never understood the hatred towards Microsoft among many software developers.
This explains it all.
User avatar
Solar
Member
Member
Posts: 7615
Joined: Thu Nov 16, 2006 12:01 pm
Location: Germany
Contact:

Re: windows research kernel

Post by Solar »

rdos wrote:The issue is that UTF-16 and UCS-2 are bulky representations of characters that are not compatible with simple text-editors.
Any text editor not capable of handling at least UTF-8 and UTF-16 LE/BE should be ditched. ("Simple"? As in, "even VIM can do Unicode but I cannot be arsed to bother about it"?)
Most strings in a program can be represented with a single byte...
No they can't. You've got a customer in IJmuiden? No? You're lucky. A Danish person sees quite some difference between "gǿr" (barks) and "gør" (does), and good luck finding the former in an 8-bit encoding. (Danish people have become quite used to not using ǿ because of this, but that's not an excuse.)

I won't even get into China, Japan, or any other language that has > 256 glyphs.
Switching to 16-bit wide characters or supporting both types, requires changes in many instances.
No, because today any software engineer worth his paycheck writes his software Unicode-aware from the beginning. There is no "changing" involved other than in patching stupid, brain-dead software that still wasn't patched, twenty years after Unicode 1.0.
I like to use "char" type for character strings, and to assume char is a byte.
Uh-huh. Easy for you, hard for others. Good design is the other way around.
I do not like custom defined types for strings.
Most languages, including C++, Java, C#, Perl and Python, have native string types that are quite capable of handling UTF-16/UCS-2.

Sorry for the aggressive tone, but I'm fed up with stuff like this being impressed on the next generation of developers. Unicode is here to stay, has been around for the most parts (if not all) of our development lives, and any half-baked software should be able to handle it.

There shouldn't be a "Hello World" or I/O how-to in the web that does not handle wide strings.
Last edited by Solar on Wed Jun 01, 2011 6:31 am, edited 1 time in total.
Every good solution is obvious once you've found it.
User avatar
qw
Member
Member
Posts: 792
Joined: Mon Jan 26, 2009 2:48 am

Re: windows research kernel

Post by qw »

I never had any problems with A and W functions. In fact I thought it was a pretty transparent and compatible way to introduce Unicode, especially when writing for multiple platforms (Win9x and WinNT).

Of course replacing char with TCHAR and "X" with TEXT("X") in a large project is a PITA but then again, replacing char with wchar_t and "X" with L"X" is just as much work, and IMNSHO multibyte encodings are an even greater PITA.

At least of the Windows API you know that the encoding of WCHAR is UCS-2. ISO kept the semantics of wchar_t intentionally vague.
rdos
Member
Member
Posts: 3276
Joined: Wed Oct 01, 2008 1:55 pm

Re: windows research kernel

Post by rdos »

Solar wrote:Any text editor not capable of handling at least UTF-8 and UTF-16 LE/BE should be ditched. ("Simple"? As in, "even VIM can do Unicode but I cannot be arsed to bother about it"?)
That would exclude all text editors running in text-mode, as the character generator in a typical PC only can handle a 8-bit character set. I prefer to edit code in text mode, especially since my favorite editor runs in text-mode. RDOS also have a set of utility-programs that runs in text-mode. This is especially useful when some configurations have video-cards that might not work with the standard VESA driver.
Solar wrote:No, because today any software engineer worth his paycheck writes his software Unicode-aware from the beginning. There is no "changing" involved other than in patching stupid, brain-dead software that still wasn't patched, twenty years after Unicode 1.0.
Our terminal software is shared between very old systems from the 80s, with LCDs with 8-bit character generators and more modern PC platforms. This software has no idea about unicode or 16-bit wide characters because it was mostly wrote before unicode. And if I would convert it to 16-bit unicode, the old systems would become broken. The best route therefore is to go UTF-8. That won't break the old software, and it would allow the use of larger character-sets.
User avatar
Solar
Member
Member
Posts: 7615
Joined: Thu Nov 16, 2006 12:01 pm
Location: Germany
Contact:

Re: windows research kernel

Post by Solar »

rdos wrote:
Solar wrote:Any text editor not capable of handling at least UTF-8 and UTF-16 LE/BE should be ditched. ("Simple"? As in, "even VIM can do Unicode but I cannot be arsed to bother about it"?)
That would exclude all text editors running in text-mode, as the character generator in a typical PC only can handle a 8-bit character set.
So be it. It's been some time since I last used an editor in text mode, so I can't really comment on how VIM works under those conditions, but I stand by my statement: If it cannot handle Unicode, it's not fit for use. (That'd be text mode, in this case, that's not fit for use.)
Our terminal software is shared between very old systems from the 80s, with LCDs with 8-bit character generators and more modern PC platforms. This software has no idea about unicode or 16-bit wide characters because it was mostly wrote before unicode. And if I would convert it to 16-bit unicode, the old systems would become broken. The best route therefore is to go UTF-8. That won't break the old software, and it would allow the use of larger character-sets.
So it's a limitation of your system. That doesn't mean that UTF-16 et al. are to be condemned.
Every good solution is obvious once you've found it.
rdos
Member
Member
Posts: 3276
Joined: Wed Oct 01, 2008 1:55 pm

Re: windows research kernel

Post by rdos »

I've put-up multilingual questionaries on the Internet, and those used UTF-8. It worked just fine. And it worked just fine to exchange translations over email using UTF-8. UTF-8 is not a proprietary format that only MS can handle, but something that most browsers can handle. Even the ones that run in text-mode would work provided the texts are in english language.

The above is the primary reason why I think UTF-8 is adequate, and that there is no reason to bother with UTF-16 or UCS-2.
.
rdos
Member
Member
Posts: 3276
Joined: Wed Oct 01, 2008 1:55 pm

Re: windows research kernel

Post by rdos »

berkus wrote:A little quiz for rdos: I have a string of exactly 136 bytes in size. How many characters are there?
Why do you want to know?

You don't need it for:
- Concat
- Copy
- printf/sprintf
- scanf
berkus wrote: - can you figure that out for ASCIIZ string?
- can you figure that out of wchar_t UCS-2 string?
- can you figure that out for an UTF-8 string?
Yes to all, but UTF-8 requires you to go through the string.
berkus wrote:Can you split this string from character 15 to character 25? In UCS-2? In ASCIIZ? Now with the same speed in UTF-8?
Why would you want to do that??
User avatar
Brynet-Inc
Member
Member
Posts: 2426
Joined: Tue Oct 17, 2006 9:29 pm
Libera.chat IRC: brynet
Location: Canada
Contact:

Re: windows research kernel

Post by Brynet-Inc »

ASCII ftw.
Image
Twitter: @canadianbryan. Award by smcerm, I stole it. Original was larger.
Post Reply