OSDev.org

Posted: **Wed Jun 01, 2011 12:15 am**

rdos wrote:
Owen wrote:Also: Standardizing on UTF-8 presupposes that it is a better character set than UTF-16. I would disagree. (UTF-16 is far simpler to decode)
UTF-8 is superior for most languages, simply because it preserves ASCII-compability, which unicode and UTF-16 does not do. If it is harder to decode for odd characters is not an issue.
But then MS are known to break standards and compability.

You know that UTF-8 is unicode? Unicode is a glyph standard. UTF-8, UTF-16 and then the uncompressed UTF-32 are all compressions of unicode. Whatever Unicode provides, UTF-8 provides since they UTf-8 is nothing but a compression format for Unicode strings.

Also, MS does not "break" anything in its implementation of Unicode support, they just use UTF-16, and that is all. They didn't do anything askance.

Posted: **Wed Jun 01, 2011 1:23 am**

gravaera wrote:Also, MS does not "break" anything in its implementation of Unicode support, they just use UTF-16, and that is all. They didn't do anything askance.

On Windows, a wide string is UCS-2, on all other platforms (that I know about) it's UTF-32. And, the Windows API does not support UTF-8 at all (AFAIK), you have to use third-party libs like ICU for that.

However, you can't really blame them. When they implemented their UCS-2 wide strings, UCS-2 was enough to encode all character points defined at that time.

They could've implemented UTF-8 somewhere down the road, though. Then again, their API is bulky enough as it is already.

Posted: **Wed Jun 01, 2011 1:57 am**

Hi,

Just an observation....maybe not very accurate...but why MacOS is way more popular in US then in Europe? I mean...here it's more exotic then...uugghh..OpenVMS.
Yes, they make expensive hardware, but man...Switzerland, Sweeden, Finland, Holland, Germany...all have much higher standard then US.

Happy to note that there are lot of OpenVMS deployments in Europe !.

---Thomas

Posted: **Wed Jun 01, 2011 1:59 am**

gravaera wrote:You know that UTF-8 is unicode? Unicode is a glyph standard. UTF-8, UTF-16 and then the uncompressed UTF-32 are all compressions of unicode. Whatever Unicode provides, UTF-8 provides since they UTf-8 is nothing but a compression format for Unicode strings.

The issue is that UTF-16 and UCS-2 are bulky representations of characters that are not compatible with simple text-editors. Most strings in a program can be represented with a single byte, so there is no reason to start using 16-bit wide characters. It only adds problems. Especially if there is an option to use either 8 or 16-bit characters.

Besides, it is only on very few places in the OS that actual decoding and representation needs to be done of UTF-8 strings. Those are in the graphic API for writing text-strings, and for getting string metrics. All other places doesn't care if there are special encodings used. Switching to 16-bit wide characters or supporting both types, requires changes in many instances. I like to use "char" type for character strings, and to assume char is a byte. I do not like custom defined types for strings.

Posted: **Wed Jun 01, 2011 2:19 am**

berkus wrote:You do understand that processing strings with fixed 2-byte character sizes is simpler than processing strings with variable 1-6 bytes per character?

No, should I? String processing does not need to care about the special escape-characters used in UTF-8. The null-terminator is still 0, and it is not used in the escape sequences. The only instance escape-characters needs to be taken into account is when displaying strings.

EDIT: You could also provide a function "RealLen" that returns number of characters in the string, but I don't expect it to be used a lot.

Posted: **Wed Jun 01, 2011 2:50 am**

berkus wrote:You do understand that processing strings with fixed 2-byte character sizes is simpler than processing strings with variable 1-6 bytes per character?

You do understand that in rdos' world 2-byte characters are never smaller in storage

Posted: **Wed Jun 01, 2011 3:59 am**

If you speak chinese they would be, but not for english or swedish.

Posted: **Wed Jun 01, 2011 4:07 am**

OSwhatever wrote:I never understood the hatred towards Microsoft among many software developers.

This explains it all.

Posted: **Wed Jun 01, 2011 5:22 am**

rdos wrote:The issue is that UTF-16 and UCS-2 are bulky representations of characters that are not compatible with simple text-editors.

Any text editor not capable of handling at least UTF-8 and UTF-16 LE/BE should be ditched. ("Simple"? As in, "even VIM can do Unicode but I cannot be arsed to bother about it"?)

Most strings in a program can be represented with a single byte...

No they can't. You've got a customer in Ĳmuiden? No? You're lucky. A Danish person sees quite some difference between "gǿr" (barks) and "gør" (does), and good luck finding the former in an 8-bit encoding. (Danish people have become quite used to not using ǿ because of this, but that's not an excuse.)

I won't even get into China, Japan, or any other language that has > 256 glyphs.

Switching to 16-bit wide characters or supporting both types, requires changes in many instances.

No, because today any software engineer worth his paycheck writes his software Unicode-aware from the beginning. There is no "changing" involved other than in patching stupid, brain-dead software that still wasn't patched, twenty years after Unicode 1.0.

I like to use "char" type for character strings, and to assume char is a byte.

Uh-huh. Easy for you, hard for others. Good design is the other way around.

I do not like custom defined types for strings.

Most languages, including C++, Java, C#, Perl and Python, have native string types that are quite capable of handling UTF-16/UCS-2.

Sorry for the aggressive tone, but I'm fed up with stuff like this being impressed on the next generation of developers. Unicode is here to stay, has been around for the most parts (if not all) of our development lives, and any half-baked software should be able to handle it.

There shouldn't be a "Hello World" or I/O how-to in the web that does not handle wide strings.

Posted: **Wed Jun 01, 2011 6:30 am**

I never had any problems with A and W functions. In fact I thought it was a pretty transparent and compatible way to introduce Unicode, especially when writing for multiple platforms (Win9x and WinNT).

Of course replacing char with TCHAR and "X" with TEXT("X") in a large project is a PITA but then again, replacing char with wchar_t and "X" with L"X" is just as much work, and IMNSHO multibyte encodings are an even greater PITA.

At least of the Windows API you know that the encoding of WCHAR is UCS-2. ISO kept the semantics of wchar_t intentionally vague.

Posted: **Wed Jun 01, 2011 6:31 am**

Solar wrote:Any text editor not capable of handling at least UTF-8 and UTF-16 LE/BE should be ditched. ("Simple"? As in, "even VIM can do Unicode but I cannot be arsed to bother about it"?)

That would exclude all text editors running in text-mode, as the character generator in a typical PC only can handle a 8-bit character set. I prefer to edit code in text mode, especially since my favorite editor runs in text-mode. RDOS also have a set of utility-programs that runs in text-mode. This is especially useful when some configurations have video-cards that might not work with the standard VESA driver.

Solar wrote:No, because today any software engineer worth his paycheck writes his software Unicode-aware from the beginning. There is no "changing" involved other than in patching stupid, brain-dead software that still wasn't patched, twenty years after Unicode 1.0.

Our terminal software is shared between very old systems from the 80s, with LCDs with 8-bit character generators and more modern PC platforms. This software has no idea about unicode or 16-bit wide characters because it was mostly wrote before unicode. And if I would convert it to 16-bit unicode, the old systems would become broken. The best route therefore is to go UTF-8. That won't break the old software, and it would allow the use of larger character-sets.

Posted: **Wed Jun 01, 2011 6:36 am**

rdos wrote:
Solar wrote:Any text editor not capable of handling at least UTF-8 and UTF-16 LE/BE should be ditched. ("Simple"? As in, "even VIM can do Unicode but I cannot be arsed to bother about it"?)
That would exclude all text editors running in text-mode, as the character generator in a typical PC only can handle a 8-bit character set.

So be it. It's been some time since I last used an editor in text mode, so I can't really comment on how VIM works under those conditions, but I stand by my statement: If it cannot handle Unicode, it's not fit for use. (That'd be text mode, in this case, that's not fit for use.)

Our terminal software is shared between very old systems from the 80s, with LCDs with 8-bit character generators and more modern PC platforms. This software has no idea about unicode or 16-bit wide characters because it was mostly wrote before unicode. And if I would convert it to 16-bit unicode, the old systems would become broken. The best route therefore is to go UTF-8. That won't break the old software, and it would allow the use of larger character-sets.

So it's a limitation of your system. That doesn't mean that UTF-16 et al. are to be condemned.

Posted: **Wed Jun 01, 2011 6:45 am**

I've put-up multilingual questionaries on the Internet, and those used UTF-8. It worked just fine. And it worked just fine to exchange translations over email using UTF-8. UTF-8 is not a proprietary format that only MS can handle, but something that most browsers can handle. Even the ones that run in text-mode would work provided the texts are in english language.

The above is the primary reason why I think UTF-8 is adequate, and that there is no reason to bother with UTF-16 or UCS-2.
.

Posted: **Wed Jun 01, 2011 9:03 am**

berkus wrote:A little quiz for rdos: I have a string of exactly 136 bytes in size. How many characters are there?

Why do you want to know?

You don't need it for:
- Concat
- Copy
- printf/sprintf
- scanf

berkus wrote: - can you figure that out for ASCIIZ string?
- can you figure that out of wchar_t UCS-2 string?
- can you figure that out for an UTF-8 string?

Yes to all, but UTF-8 requires you to go through the string.

berkus wrote:Can you split this string from character 15 to character 25? In UCS-2? In ASCIIZ? Now with the same speed in UTF-8?

Why would you want to do that??

Posted: **Wed Jun 01, 2011 11:21 am**

ASCII ftw.

OSDev.org

windows research kernel

Re: windows research kernel

Re: windows research kernel

Re: windows research kernel

Re: windows research kernel

Re: windows research kernel

Re: windows research kernel

Re: windows research kernel

Re: windows research kernel

Re: windows research kernel

Re: windows research kernel

Re: windows research kernel

Re: windows research kernel

Re: windows research kernel

Re: windows research kernel

Re: windows research kernel