Re: Implementing non-English language in OS
Posted: Sun May 15, 2016 9:06 am
There are things that in A.I. need to be specifically taught to the machine, giving it specifically concepts and information that a human has made sure that is optimal and that is that he/she would use. Maybe that's what publicly available A.I. is lacking (getting people to give it all data clearly in the format that it expects instead of trying to learn and deduct it all, being a machine). Probably it would result in actually more intelligent machines, and programs that no longer talk and sound as if they were reading spam. But it really needs people to massively describe every human aspect in an usable manner to an A.I., and make sure that each major knowledge component -like this special multi-language database- is the simplest possible and formatted in a standard way for all A.I.s.max wrote:This database structure is terrible. Putting all translations to one word in one row is exactly how you should not do it. A database must be properly normalized so you can effectively work with it, index it and search through it.~ wrote:Then put all synonyms, antonyms, paronyms, combinations, etc., in one same row in all languages. It will help you greatly in searches and indexing (will help you create automatic translations even for the GUI of your programs, to search more efficiently the documentation and code, etc.). It will make possible to search in one language, or search a word, and find what you searched in all languages and in all related variants of the word, and for all of its synonyms, antonyms, etc:
This is a simple table definition for that:[...]Code: Select all
CREATE TABLE multilanguage_words(wordlist TEXT DEFAULT "", dictionary_definition TEXT DEFAULT "", rowid INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL); pragma encoding="UTF-8"; .mode tabs .import multilanguage_words.txt multilanguage_words
Why hasn't even Google released such a vital language database?
You need more than a massive database to do natural language processing. Why should Google release a giant file that is basically only a dictionary? Google has it's AI that properly translates from/to a lot of languages and always learns new stuff. Natural languages are very complex, and the algorithms to process them are as well.
Such database can immediately help you detect the language of a document without making mistakes. Anyway the brain also contains all words specifically in one way or another, so the computer should have it too to be nearer to our intellectual/verbal capability level. You'll see what happens when I get to distinguish another fundamental component for filtering natural language and use this database and other simple components together.
Don't think that the header format used for each word isn't well-meditated on. That format won't let you confuse the header for the word, and the fact that you query the database for a LIKE "}"+word or LIKE "}"+word+"+" statement will allow to effectively find only actual words instead of metadata, and then you can count the occurrences of words and count how many times the LANGUAGE ID appears for present words, even detect the percentage of languages written.
I still work based on the simplest concepts, and it's so obvious that such database could help you find many more things with a single search, like you search for juice and it will find documents that don't contain that word but that could contain "drink", "drinks" and would always be found because those are synonyms.
ALSO GROUP DIRECT SYNONYMS WITH +
It's a very good language filter considering that if you have all the words for a same word in all languages (and related variants and antonyms/synonyms), and you know to which language every word belongs, you stop looking at the natural language processing problem upside down, filter it and end up with sort of an ID for a single unique concept in return for a word, and if you do that with all language, it would be much easier to start from there and look at a document in many different perspectives, both programmed and deducted with A.I.
More than hearing complaints about the structure of a database, I'd like to hear a better and more complete implementation than the one I wrote above, for such a language simplification engine that allows to center in concepts no matter the words or language used in a program or a text search or indexation. If there is no such explanation but only the complain, it might only be an attempt of an intention to have that resource done immediately and tested in real programs, maybe, with a more personal style, which is most probably currently undefined and the brain of the other person is currently trying to formulate it at least in the deepest background.
It's a very good structure. If you look, it even has an automatic auto-incrementing ID field at the very end of the records of a table. What it means is that you can input values to that table at any time randomly, conveniently, without having to specify an ID that it's dummy anyway but that will help index internally at runtime. That field no longer gets in the way of actual data, since it's at the end of the records and then it doesn't need to fill a void in a CSV file.
The effect of putting all words in all languages in a single row is that you see at language with the perspective of its language-independent concepts.
But you have to make sure that you have absolutely all existing words in record, one word with its synonyms, antonyms, male/female counterparts, etc., etc., etc. Humanly, it would also help learn any other language. But this database structure is ideal for being able to practically find all words for a single-word concept.
It might be bad for, say, maybe, a default standarized generic case among thousands for a database table, but it serves well a particular database that doesn't seem to exist and so makes computer much dumber in terms of human language. Remember that you have to effectively communicate even the most obvious human tidbit of reasoning and knowledge to machine with software. I still can't believe that there isn't a properly-formatted database available publicly that contains all human language classified by the same word and variants and antonyms, etc., in a single row for being able to access concepts.
But it would be an incredibly valuable product because it would have human grade, and would be about something that will never disappear or become nonstandard given that it is purely about something human.
As you can see, it looks like that when you add absolutely everything that exists about some knowledge or human capability like human language, in a format that is usable by a machine and that is human-readable and easy to understand and update/maintain, you actually get the capability to implement that human-grade skill or sense, you get to synthesize and make it useful in practice in a machine.
I guess that it must be a joke. Human language doesn't have a rigid structure so a database structure must be chosen that will always detect concepts in any order or with any level of informality. Shouldn't be done according to what objective? If you want to translate human language into words, you certainly need all existing words to detect that concept. In this way no matter what language or synonym is used, you will always find what is being referenced and will always understand what is being said at the word level.
Then you can start classifying what was said or written by giving more importance to the most unique distinctive words like names. Also the first things that are mentioned would be the ones that would define the topic of a document the most, and the ones at the end too but maybe in less percentage or in a different perspective.
Then you can classify the rest of the words in the order of occurrences. Then use verbs to try to detect tasks as to what is being described to be done in a document, be it practical or theoretical in content.
This is the most concise way to turn a word into a concept in programming. It would really result in dramatically more results in searches, for example.
With this, now you only need a SELECT wordlist FROM multilanguage_words WHERE wordlist LIKE $_your_word_$
Now with a single searched or translated word, you only need to search and parse the headers and the words you got.