internationalisation

Question about which tools to use, bugs, the best way to implement a function, etc should go here. Don't forget to see if your question is answered in the wiki first! When in doubt post here.
Post Reply
User avatar
df
Member
Member
Posts: 1076
Joined: Fri Oct 22, 2004 11:00 pm
Contact:

internationalisation

Post by df »

has anyone thought of internationalisation in their kernel?? I've looked at UTF-32 and it doesnt fix any of the problems of UTF-16 and UTF-8.

UTF-8 is great if your only using english.. as soon as you use another language, it blows encoding right out to be worthless. utf-16/32 cant even handle chinese properly still... and it can blow the encoding out...

i've got my own internal handling in my os.. and I only have what I need (ie: so problems still exist! haha)

basically it encodes each character as a 32bit number and uses a set of tags for definition...

but like anything, it breaks outside of itself. EG: bios only supports 8bit ascii... so you need your own fonts and display/output system..

then there is mapping internal representation onto font glyphs.. mmmm... (i havnyt got that far....)

my keyboard handler directly encodes it into this format right at the base level.. (key, keyboard flags (caps/shift/ctrl/etc), lang id, sublang id).

its a lot of work tho, each language needs its own sort routine. i disregarded language to language mapping (since you cant map english to chinese, why even try!).

Code: Select all

 (a chunk of my header file)
// string must be encapsulated with start/end tags.
// inside each start tag must be language id, language pair
// zero sub lang id is default lang.
// lang tag is PAIR, not single tag.
// each character is 32bit wide.
// eg:  <start><eng,australian>......<eng,uk>?<eng,us>aluminum<end>
// all initial tags have a value of equal or less than 255 to 0.
// all characters start at 256 and above up to 2^32.
// gives us a working dictionary range of 4294967040 individual characters.

// undefined behaviour.. when an unknown char in input stream not in
dictionary,
// replace with space? (current implementation does this..)

// mapping character set maps to font glyphs?? truetype?

#define DF32_TAG_END      0x00000000
#define DF32_TAG_START      0x00000001
#define DF32_TAG_SETLANG   0x00000002

#define DF32_LANG_EN      0x00000001
#define DF32_LANG_EN_UK         0x00000000
#define DF32_LANG_EN_AU         0x00000001
#define DF32_LANG_EN_US         0x00000002
#define DF32_LANG_JP      0x00000002
#define DF32_LANG_JP_KANA      0x00000000
#define DF32_LANG_JP_HIRAGANA   0x00000001
#define DF32_LANG_JP_KATAKANA   0x00000002

#define DF32_BASE   0x100
eg:

Code: Select all

0x00000001 //start
0x00000002 // set lang
    0x00000001 // ENG
    0x00000000 // UK (default)
.... // string (32bit characters)
0x00000000 // end
anyway, just another useless thing I'm wasting my time on...

a language mapping file (character set) also maps localisation (currency sign, clock / date display, etc). mmm...

how has everyone else done internationalisation??
-- Stu --
User avatar
Pype.Clicker
Member
Member
Posts: 5964
Joined: Wed Oct 18, 2006 2:31 am
Location: In a galaxy, far, far away
Contact:

Re:internationalisation

Post by Pype.Clicker »

personnally (but i might be breaking rulez like "optimization too early is the root of all evil"), i would rather go to a more compact representation.

First, i would assume that the output system has a state machine able to remember the last command output. Then i would use single-character escapes rather than 32bits characters ...

for instance it could look like

Code: Select all

Text txt=new TextWindow;
txt.setup("iso-8859-2");
txt << "Hello World";

If you setup "Chinese", you'll switch to the appropriate decoding.

at any time, you can use *named* characters rather than encoded, with things like HTML encoding: "Parlez-vous le fran\e&ccedil\eais ?"

As an option, you could first define a custom mapping with "\e#<code>&name\e" or "\e#short-code=internationnal long code\e".

I haven't checked chinese/japanese languages yet, but i guess despite of the large amount of symbols, a single text block will use rather few characters, so setting up a "chunk" of symbols before encoding the block, especially if you can invoke pre-defined mappings ...

I dunno if it'd be efficient or useful ... that's just what have came to my mind when i had to decode html files at first time :
"why can't the HTML file program the decoder and tell it 'when you see character number 129, display '?', aka é' ?"
Tim

Re:internationalisation

Post by Tim »

I chose UCS-2 (Unicode with two bytes per character) because it avoids the use of character sets and encodings, yet it still allows the use of most of the characters anyone could want to use.

However, despite the fact that the Mobius kernel and libraries only support Unicode, right now it's not actually outputted anywhere. The best you can hope for is to display the set of Unicode characters supported by IBM code page 437 (which happens to be the code page supported by the fonts I have). But I will put Freetype into the GUI soon.
User avatar
df
Member
Member
Posts: 1076
Joined: Fri Oct 22, 2004 11:00 pm
Contact:

Re:internationalisation

Post by df »

will i support anything besides english? doubtfull.. i'm just thinking big hahah :) once I start thinking about something, it just builds and builds...

do i need each character to be 32bit? nope. do I even need utf8? no...

grand ideas hahah :)
-- Stu --
User avatar
Pype.Clicker
Member
Member
Posts: 5964
Joined: Wed Oct 18, 2006 2:31 am
Location: In a galaxy, far, far away
Contact:

Re:internationalisation

Post by Pype.Clicker »

my very own impression is that internationalization has little to do with the kernel: Kernel message can safely be in english as gurus are the sole person to read them. What's the need for "erreur de segmentation" ? Mr AnyKey will not be able to understand it better than "segmentation fault" or "fatal system error #0D" ...

As for filenames, is it really necessary that the kernel *recognize* them ? can't they just be some meta data "long display name" ?

Isn't internationalization a typic user-library problem coupled with GUI management ?
User avatar
df
Member
Member
Posts: 1076
Joined: Fri Oct 22, 2004 11:00 pm
Contact:

Re:internationalisation

Post by df »

well its got to be 'in kernel' enough to be understood to not be mangled by a routine :)

i originally started my internationalisation from my filesystem stuff. a need to support a unicode type filenames.

in kernel, you dont really pass around raw strings to the task manager etc? so yeah its not really an in-kernel thing.

but my input + output mechanisms will need to deal with them.
-- Stu --
Tim

Re:internationalisation

Post by Tim »

A quick survey reveals that 15 of the Mobius syscalls use at least one string parameter, mostly names of files or other objects. Internally it shouldn't matter how the kernel represents strings, or indeed how user code represents strings, but I need to be able to pass strings for file names etc. without losing any information.

There are plenty of Unicode file systems around (except ext2, which doesn't seem to define any character set), and since the kernel supports files, it must support Unicode. Also the various system services -- not in the kernel, but not UI -- rely on Unicode for their strings.

Really it's just a matter of remembering to write wchar_t instead of char, and to use mbstowcs/wcstombs when translating between external data (e.g. ext2 :/ ).
mystran

Re:internationalisation

Post by mystran »

UTF-8 is quite fine as long as most of the characters of a given piece of text are within the 7bit ascii.

I live in Finland here, and while the local standard is pretty much Latin1 (ISO-8859-1), utf-8 is not terribly worse, as only ? and ? (and ? and ?) are commonly needed outside the 7bit ascii (in case someone doesn't have a font that displays those, they are a and o with two dots on top of them).

It's pretty much the same thing with rest of the Europe, possibly excluding languages with a lot of accents and stuff.

I don't think it's good idea to try to use UTF-8 internally though, since the varying nature of characters make some operations painful. I'd use some form of widechar and provide means of importing and exporting UTF-8.

Ofcourse UTF-8 takes quite a lot more space if you're writing in a language that doesn't use 7bit ascii at all, and requires 2 or 3 bytes all the time. Then again, the use of UTF-8 doesn't interfere considerably with compression, so the only drawback I see is the need to parse it when reading.

As for what to use internally, most systems that do internationalization use 16bit "wchars". AFAIK ucs-2 is the most commonly used format for those. If that doesn't solve your problem, you probably need to support several encodings.
Post Reply