Implementing non-English language in OS

Discussions on more advanced topics such as monolithic vs micro-kernels, transactional memory models, and paging vs segmentation should go here. Use this forum to expand and improve the wiki!
User avatar
osdever
Member
Member
Posts: 492
Joined: Fri Apr 03, 2015 9:41 am
Contact:

Implementing non-English language in OS

Post by osdever »

My friend tells me that my OS is more interested by Russians, so I need to implement the Russian language in it. I don't agree - who need Russian in alpha-stage OS? So, the question is - who I need to listen - me or my friend?
Developing U365.
Source:
only testing: http://gitlab.com/bps-projs/U365/tree/testing

OSDev newbies can copy any code from my repositories, just leave a notice that this code was written by U365 development team, not by you.
User avatar
iansjack
Member
Member
Posts: 4685
Joined: Sat Mar 31, 2012 3:07 am
Location: Chichester, UK

Re: Implementing non-English language in OS

Post by iansjack »

Listen to yourself.
User avatar
osdever
Member
Member
Posts: 492
Joined: Fri Apr 03, 2015 9:41 am
Contact:

Re: Implementing non-English language in OS

Post by osdever »

OK, thanks.
Developing U365.
Source:
only testing: http://gitlab.com/bps-projs/U365/tree/testing

OSDev newbies can copy any code from my repositories, just leave a notice that this code was written by U365 development team, not by you.
User avatar
Combuster
Member
Member
Posts: 9301
Joined: Wed Oct 18, 2006 3:45 am
Libera.chat IRC: [com]buster
Location: On the balcony, where I can actually keep 1½m distance
Contact:

Re: Implementing non-English language in OS

Post by Combuster »

It can be a pain to introduce internationalisation later in development if precautions aren't taken early, and the alpha stage might be a good moment to try it out. But in the end it's just as iansjack says: Your project, your rules.

Psst, you can also look at it differently: it's worth several invisible tokens to beat the master on a niche subject and be really bilingual - who here can say they have better user support than for instance Brendan's OS? ; ) - of course, only if you're interested
"Certainly avoid yourself. He is a newbie and might not realize it. You'll hate his code deeply a few years down the road." - Sortie
[ My OS ] [ VDisk/SFS ]
User avatar
osdever
Member
Member
Posts: 492
Joined: Fri Apr 03, 2015 9:41 am
Contact:

Re: Implementing non-English language in OS

Post by osdever »

I have a UTF-8 and Russian fonts now, but the second isn't used, and the first is very glitchy.
Developing U365.
Source:
only testing: http://gitlab.com/bps-projs/U365/tree/testing

OSDev newbies can copy any code from my repositories, just leave a notice that this code was written by U365 development team, not by you.
dseller
Member
Member
Posts: 84
Joined: Thu Jul 03, 2014 5:18 am
Location: The Netherlands
Contact:

Re: Implementing non-English language in OS

Post by dseller »

I think if you intend to support foreign character sets and localization, you should at least keep it in mind while designing your code. Maybe make a framework to support this stuff and then just only write an English implementation for now? This way you can always implement Russian when you feel like it.
Kevin
Member
Member
Posts: 1071
Joined: Sun Feb 01, 2009 6:11 am
Location: Germany
Contact:

Re: Implementing non-English language in OS

Post by Kevin »

dseller wrote:I think if you intend to support foreign character sets and localization, you should at least keep it in mind while designing your code. Maybe make a framework to support this stuff and then just only write an English implementation for now? This way you can always implement Russian when you feel like it.
If you're not really careful, the problem with that is that you'll only notice the problems in your framework when it's too late. Implementing only one language means that you're prone to write your code as if other languages were just English with different words. But they aren't.
Developer of tyndur - community OS of Lowlevel (German)
User avatar
osdever
Member
Member
Posts: 492
Joined: Fri Apr 03, 2015 9:41 am
Contact:

Re: Implementing non-English language in OS

Post by osdever »

dseller wrote:I think if you intend to support foreign character sets and localization, you should at least keep it in mind while designing your code. Maybe make a framework to support this stuff and then just only write an English implementation for now? This way you can always implement Russian when you feel like it.
It's done a long time ago.
Developing U365.
Source:
only testing: http://gitlab.com/bps-projs/U365/tree/testing

OSDev newbies can copy any code from my repositories, just leave a notice that this code was written by U365 development team, not by you.
User avatar
Candy
Member
Member
Posts: 3882
Joined: Tue Oct 17, 2006 11:33 pm
Location: Eindhoven

Re: Implementing non-English language in OS

Post by Candy »

I picked the name for my OS project to be specifically impossible to write until UTF8 support was properly added. It's Rødvin. That said, I've also implemented UTF8 support & drawing them, so that everything will be supporting any non-English UTF8 language.
User avatar
Rusky
Member
Member
Posts: 792
Joined: Wed Jan 06, 2010 7:07 pm

Re: Implementing non-English language in OS

Post by Rusky »

There is more to non-English languages than just rendering UTF-8 glyphs- layout and shaping can get pretty complex.
User avatar
Solar
Member
Member
Posts: 7615
Joined: Thu Nov 16, 2006 12:01 pm
Location: Germany
Contact:

Re: Implementing non-English language in OS

Post by Solar »

Right-to-left and bidirectional writing. Characters that need to be larger than latin glyps. Different digit separators, date string formats, currency formats. Glyphs being digits but not being in the 0-9 range. Strings taking much more space than in English. Combining characters. The list goes on.
Every good solution is obvious once you've found it.
onlyonemac
Member
Member
Posts: 1146
Joined: Sat Mar 01, 2014 2:59 pm

Re: Implementing non-English language in OS

Post by onlyonemac »

Rusky wrote:There is more to non-English languages than just rendering UTF-8 glyphs- layout and shaping can get pretty complex.
And of course with multi-language support your OS also needs to have a proper localisation framework.
When you start writing an OS you do the minimum possible to get the x86 processor in a usable state, then you try to get as far away from it as possible.

Syntax checkup:
Wrong: OS's, IRQ's, zero'ing
Right: OSes, IRQs, zeroing
User avatar
bellezzasolo
Member
Member
Posts: 110
Joined: Sun Feb 20, 2011 2:01 pm

Re: Implementing non-English language in OS

Post by bellezzasolo »

From my experience, I used UTF-16 as the internal character format, as that eased many of the issues. I used a simple Unicode bitmap font, freely available (I converted it to C format with a utility). Then, you want to load language packs. My approach was to place one in the initrd, and the kernel used that. The format doesn't have to be complex, you could just have a list of strings, each separated by a newline to begin with (you will probably want to move onto things like XML later though).

If you're mainly aiming at supporting Russian, RTL isn't so important. But, since I was attempting Hebrew, this was the nightmare bit. Enjoy!
Needless to say, this is all needing a graphics mode.

Of course, there is a difference between a language pack and a good language pack. Generally, you will want the strings to be complete sentences so you don't get a complete grammatical screw up.

Hope this helps.
Whoever said you can't do OS development on Windows?
https://github.com/ChaiSoft/ChaiOS
User avatar
~
Member
Member
Posts: 1226
Joined: Tue Mar 06, 2007 11:17 am
Libera.chat IRC: ArcheFire

Re: Implementing non-English language in OS

Post by ~ »

You can encode all languages efficiently in UTF-8, and even SQLite3 supports it fully (that's why there are so many end-user programs that make internal use of SQLite3).

You could port SQLite3 to your OS to add indexing, query and even "registry" capabilities for installed programs, configuration values, etc.

I prefer to use UTF-8. It's widely supported in the Web and that's why I should learn to encode it with my own code. UTF-16 as well.


I would recommend you to use a database engine like SQLite3 and make a database that contains all words in all languages. Then put all synonyms, antonyms, paronyms, combinations, etc., in one same row in all languages. It will help you greatly in searches and indexing (will help you create automatic translations even for the GUI of your programs, to search more efficiently the documentation and code, etc.). It will make possible to search in one language, or search a word, and find what you searched in all languages and in all related variants of the word, and for all of its synonyms, antonyms, etc:

This is a simple table definition for that:

Code: Select all

CREATE TABLE multilanguage_words(wordlist TEXT DEFAULT "", dictionary_definition TEXT DEFAULT "", rowid INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL);
pragma encoding="UTF-8";

.mode tabs
.import multilanguage_words.txt multilanguage_words



This is sample text of how to index all words in all languages in one same row (text line). Note how each word has a header between || which contains the language ID and the classification of the word (synonym, antonym, name, etc...). The numbers like 50 are the percentage of positive or negative emotions for each word by default that I felt when I wrote the text, but could be updated with A.I., and the last word like "skill" is an attempt to classify the words from the most basic existent human concept, to the most complex and emotive/subjective one, but is optional... all parameters after the language ID are meant to be optional and parsed for their presence:

Code: Select all

{:synonym:es:50:skill}programador+{:synonym:en}programmer+{:synonym:fr}programmeur+{:synonym:it}programmatore+{:synonym:eo}programisto	{:synonym:es}Persona que diseña los procedimientos a seguir por un dispositivo automatizado.
{:synonym:es:80:quality}listo+{:synonym:en:70}smart+{:homonym:es}listo,{:synonym:es}sagaz,{:synonym:es}astuto,{:synonym:es}ducho,{:synonym:es}despabilado,{:synonym:es}avivado,{:synonym:es}avezado+{:typo:es}avesado+{:synonym:es}avispado,{:synonym:es}perspicaz,{:synonym:es:75}vivo	{:synonym:es}Persona con gran agudeza y agilidad mental y práctica.
{:synonym:es:60:status}listo+{:synonym:es}ready,{:synonym:es}preparado+{:synonym:es}prepared,{:synonym:es}dispuesto+{:synonym:en}willing,{:synonym:es}complaciente	{:synonym:es}Estado de espera y disposición para llevar a cabo una tarea.
{:synonym:es:50:concept}palabra,{:synonym:es}fonema,{:synonym:es}vocablo,{:synonym:es}término,{:synonym:es}verbo,{:synonym:es}dicción,{:synonym:es}expresión,{:synonym:es}lengua,{:synonym:es}lenguaje,{:synonym:es}habla,{:synonym:es}promesa,{:synonym:es}pacto,{:synonym:es}oferta,{:synonym:es}juramento,{:synonym:es}ofrecimiento,{:synonym:es}compromiso	{:synonym:es}Elemento de todo lenguaje que comunica ideas, intenciones y acciones.
{:synonym:es:-60:action}desaparecer+{:synonym:es:-35:action}desaparecerse,{:synonym:es}esfumar+{:synonym:es}esfumarse,{:synonym:es}retirar+{:synonym:es}retirarse	{:synonym:es}Alejar algo de nuestra percepción de modo que no se pueda encontrar.
{:name:es:0:male}Rodolfo+{name:en:0:male}Rudolph
{:synonym:*:medication}Panadol+{:synonym:*:medication}Paracetamol
{:synonym:en:65}conversely+{:synonym:es:65}al contrario de+{:synonym:es:65}a diferencia de
{:synonym:es}amalgama+{:synonym:en}amalgam+{:synonym:es}amalgamation
{:synonym:es}natural+{:synonym:en}natural,{:synonym:es}sincero+{:synonym:en}sincere,{:synonym:es}espontáneo+{:synonym:en}spontaneous,{:synonym:es}genuino+{:synonym:en}genuine
{:synonym:es}calibración+{:synonym:en}calibration+{:synonym:es}calibrar+{:synonym:en}calibrate,{:synonym:es}equilibrio+{:synonym:es}equilibrar+{:synonym:en}equilibrium+{:synonym:en}equilibrate,{:synonym:es}balance+{:synonym:es}balancear+{:synonym:en}balance
{:surname:en}Sonnenreich+{:typo:en}Sonnereich
{:synonym:es}correspondiente+{:synonym:es}coincidente+{:synonym:es}concordante+{:synonym:pt}concorda+{:synonym:pt}concorde+{:synonym:es}+que concuerde+{:synonym:es}acierto+{:synonym:es}coincidencias+{:synonym:es}concordancias+{:synonym:es}acierto+{:synonym:es}aciertos+{:synonym:pt}de acordo+{:synonym:pt}concerta+{:synonym:pt}concertar+{:synonym:pt}concerte
{:synonym:es}endurar+{:synonym:es:0:verb}endurecer+{:synonym:es:0:verb}endurezco+{:synonym:es:0:verb}endureces+{:synonym:es:0:verb}endurece+{:synonym:es:0:verb}endurecemos+{:synonym:es:0:verb}endurecéis+{:typo:es:0:verb}endureceis+{:synonym:es:0:verb}endurecen+{:synonym:en:0:verb}harden+{:synonym:en}hard+{:synonym:en:0:verb}make hard+{:synonym:en:0:verb}to make hard+{:synonym:en:0:verb}make it hard+{:synonym:en:0:verb}making it hard{:synonym:es}endura+{:synonym:es}endurece+{:synonym:es}durar
{:name:*:0:font-face}Calibri
{:synonym:en}keep+{:synonym:en}keeping+{:synonym:en}kept+{:synonym:es:0:verb}mantener+{synonym:es}mantén+{:typo:es}manten+{:synonym:es}mantengo+{:synonym:es}mantienes+{:synonym:es}mantiene+{:synonym:es:0:verb}mantienen+{:synonym:es}mantenemos+{:synonym:es}mantenéis+{:typo:es:verb}manteneis+{:synonym:es}mantén
{:word:es}tu+{:word-plural:es}tus+{:word:en}your
{:name:en:50:organism}eye+{:name-plural:en:50:organism}eyes+{:name:es:50:organism}ojo+{:name-plural:es:50:organism}ojos
{:name:es:100:math}Álgebra+{:name:en:100:math}Algebra
{:name:es:100:math}Aritmética+{:name:en:100:math}Arithmetic
{:name:es:100:math}Cálculo+{:name:en:100:math}Calculus
{:name:en:100:artificial-intelligence}Situation Calculus+{:name:es:100:artificial-intelligence}Cálculo Situacional
{:name:en}dynamical domain+{:name:en}dominio dinámico+{:name:en}dynamical domains+{:name:en}dominios dinámicos
{:name:en}vedic math+{:name:es}matemática védica
{:name:en}Pizza Hut
{:name:en}Toto's Pizza
{:word:es}tan+{:word:en}as+{:word:es}tanto
{:word:es}también+{:typo:es}tambien+{:chat:es}tmb+{:word:en}as well+{:word:en}as well as
{:synonym:en:0:verb}close+{:antonym:en:0:verb}open+{:synonym:en}closed+{:synonym:es}cerrado+{:antonym:es}abierto+{:synonym-plural:es}cerrados+{:antonym-plural:es}abiertos
{:word:en}and+{:word:es}y+{:word:pt}e+{:word:fr}et
{:word:en}of+{:word:es}de
{:word:en}you+{:word:es}tú
{:word:en}state+{:word:es}estado+{:word-plural:en}states+{:word-plural:es}estados
{:name:en}day+{:name:es}día+{:name-plural:en}days+{:name-plural:es}días
{:synonym:en}ensure+{:synonym:en}make sure+{:synonym:en}making sure+{:synonym:es}asegurándose+{:synonym:es}asegurar+{:synonym:es}asegurarse
{:synonym:en}high+{:synonym-male:es}alto+{:synonym-female:es}alta+{:synonym-plural-male:es}altos+{:synonym-plural-female:es}altas
{:synonym:en}level+{:synonym-plural-male:en}levels+{:synonym-male:es}nivel+{:synonym-plural:es}niveles+{:synonym:es}nivelación+{:synonym-plural:es}nivelaciones
{:synonym-male:en}channel+{:synonym-plural-male:en}channels+{:synonym-male:es}canal+{:synonym-plural-male:es}canales
{:word:en}the+{:word-male:es}el+{:word-female:es}la+{:word-female:es}las+{:word:es}lo+{:word-plural:es}los
{:word:en}to+{:word:es}para+{:word:es}a
{:synonym:en:0:verb}deepen+{:synonym:es:0:verb}profundizar
{:synonym:en}least+{:synonym:es}menos+{:synonym:es}menor


Remember that the effect of relating the same word and its synonyms/antonyms/etc., in one same row/record/register in all existing languages (including typos, abbreviations and phrases) makes you find and search more in terms of the core concepts of the words, more than search, find and process for a specific word itself.

It's a very good basic A.I. filter for understanding and processing natural language but it needs a massive database containing ALL existing words in human kind related (one same word in all languages===one database record).

Why hasn't even Google released such a vital language database?
YouTube:
http://youtube.com/@AltComp126

My x86 emulator/kernel project and software tools/documentation:
http://master.dl.sourceforge.net/projec ... 7z?viasf=1
User avatar
max
Member
Member
Posts: 616
Joined: Mon Mar 05, 2012 11:23 am
Libera.chat IRC: maxdev
Location: Germany
Contact:

Re: Implementing non-English language in OS

Post by max »

~ wrote:Then put all synonyms, antonyms, paronyms, combinations, etc., in one same row in all languages. It will help you greatly in searches and indexing (will help you create automatic translations even for the GUI of your programs, to search more efficiently the documentation and code, etc.). It will make possible to search in one language, or search a word, and find what you searched in all languages and in all related variants of the word, and for all of its synonyms, antonyms, etc:

This is a simple table definition for that:

Code: Select all

CREATE TABLE multilanguage_words(wordlist TEXT DEFAULT "", dictionary_definition TEXT DEFAULT "", rowid INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL);
pragma encoding="UTF-8";

.mode tabs
.import multilanguage_words.txt multilanguage_words

[...]

Why hasn't even Google released such a vital language database?
This database structure is terrible. Putting all translations to one word in one row is exactly how you should not do it. A database must be properly normalized so you can effectively work with it, index it and search through it.

You need more than a massive database to do natural language processing. Why should Google release a giant file that is basically only a dictionary? Google has it's AI that properly translates from/to a lot of languages and always learns new stuff. Natural languages are very complex, and the algorithms to process them are as well.
Post Reply