Na'vi frequency dictionary

Taronyu Leleioae · May 06, 2013, 07:39:36 PM

No matter what you do, unless you really are an expert programmer, you won't be able to come up with a near perfect solution. There are, by quick count, approximately 25 infixes by themselves PLUS combinations with moods. (I'll let you figure out the permutations you end up with...) You would have to literally create a data table with every known infix combination for every verb for it to search and reference. And then you'd have to manually add to that list the verbs with special contractions.

I agree wholeheartedly that you won't have a perfect frequency analysis, but does it really matter if you can have, say 95% accuracy (as an estimate?). The reality is that the original request was for top 400. If you go 500 and look at a data pool of 600, you should be covered if you take the raw data, trust in the force as Obiwan suggests, and not use the computer to do your sorting... Realistically, wouldn't the most common infixes be either root or more likely <ol>, <am>, <ay>, <iv> ? The other problem, is that the one verb you chose, tswayon, was never rediscussed. So you have an <ay> combination inside a verb.

Thus I think having lists for the sake of lists isn't productive. I think it better to go with a rough count and let human eyes do the best they can. So if we want 500, then we look at the entire results stack, and quickly try pull out visually verbs with greater than say... 10 or 20 occurrences and look at their frequency. But I have, although manual, a better solution to this... It will take human eyes and more work, but I can't see why we can't do this... Again, it's just to get a solid understanding...

What I wish we could do, is have the words used and cross referenced already with the proper dictionary meanings, parts of speech, etc so we could quickly export that into whatever format is needed. But here's what we can do. We take that data and dump it into Excel as a txt import using UTF-8. Then... we simply add another column into the Excel spreadsheet which lists the root word. If there are 1850 or so words listed, then we need to go through probably 3000 words quickly. It means having to recognize each word, some we'll need to just look up. Then you can take your root word list plus the frequency count, and merge the data into one. Then you'll have the root words data.

But I do think it's valuable to know actually the frequency of some of the most common verbs are. IE... lu, lolu, lamu, layu, livu... It wouldn't surprise me at all to see common verbs such as... lu, yom, hahaw near the top of verbs...

The interesting question with a data file will be..., trying to merge it (either in Excel, Access, or a custom written code) with the dictionary data base file. With a merge then resort, then we (meaning LN as a community) can have an effective data file to help learn and teach with because you can view it will all the terms, definitions, parts of speech, etc...

Tirea Aean · May 06, 2013, 07:43:14 PM

Are we just talking most used VERBS or most used WORDS?

Taronyu Leleioae · May 06, 2013, 07:54:27 PM

Goal is words. But the discussion is how to include frequency with all the infixes, prefixes, suffixes. We'll need to obviously look at quite a few fì-[nouns] as well as possible plurals. It's not going to be perfect.

The bigger problem I actually see is with case endings. We'll need to manually view and strip off words with -l, -ìl, -t, -ti, -it, -ri, -ìri, -r, -ru, -ur, -ä, -yä endings. Realizing there are plenty of words ending in such letters normally.... pamrel, rumut, ìlä, nari/menari...

I'm guessing, we'll end up with about a 3000 word list or bigger, and have to manually verify and pull all the roots out page by page. Which will need a group effort if we want to get this done in a reasonable time frame...

Ftiafpi · May 06, 2013, 08:45:25 PM

Quote from: Taronyu Leleioae on May 06, 2013, 07:54:27 PM
Goal is words. But the discussion is how to include frequency with all the infixes, prefixes, suffixes. We'll need to obviously look at quite a few fì-[nouns] as well as possible plurals. It's not going to be perfect.

The bigger problem I actually see is with case endings. We'll need to manually view and strip off words with -l, -ìl, -t, -ti, -it, -ri, -ìri, -r, -ru, -ur, -ä, -yä endings. Realizing there are plenty of words ending in such letters normally.... pamrel, rumut, ìlä, nari/menari...

I'm guessing, we'll end up with about a 3000 word list or bigger, and have to manually verify and pull all the roots out page by page. Which will need a group effort if we want to get this done in a reasonable time frame...

Well, we can probably assume that oel and ngati are probably number 1 and 2 of the most commonly used words with suffixes. We could automatically lump those into oe and nga respectively. Other than that, I see nothing wrong with manual parsing of words.

Tìtstewan · May 06, 2013, 11:51:02 PM

Quote from: Blue Elf on May 06, 2013, 02:35:02 PM
- ask Markì to get data out of forum database into plain text file(s).

So, I sent Markì my question about it today. Now I just have to wait for an answere.

Quote from: Tirea Aean on May 06, 2013, 07:43:14 PM
Are we just talking most used VERBS or most used WORDS?

Don't worry, this with tswayon was just an example to explain what I mean.

Blue Elf · May 07, 2013, 01:33:49 AM

Quote from: Tìtstewan on May 06, 2013, 05:20:35 PM
Your idea is quiet good, but I see some problems:
Ma Eywa, how I can explain it... The most people (and me) would wish a list with the most used root words. (Why root words? - Because root words will be learned first. If a beginner have learned it, he can work with the root words. He can use a root word with the prefixes, infixes and suffixes.) If I would have a word file from Markì, and I analized every word with a program, what would be the result of it? I would have the 500 most used words, well, but these words are "wrong" and the statistic incorrect and itself "faked". Why? Because the program will count words like kameie or faysawtute etc. but these words are not root words! In this very little example the root words would be kame and tawtute. So, here is need a program which would count the root words and I beleave such a program don't exist. I attached an example, what I mean with the problem.

I really would like to have such a program which can give me the non-root words, because it would be reall simple to compile. The result of the counting work have to correct (remove the prefixes, infixes and suffixes, lenition-letters, wrong words and/or non-Na'vi words). So, I will write Markì a PM to ask him for the word datas of the three sub boards.

Well, I think you didn't read everything

Counting words "as they are" is first step - which preprocess data for further analysis. We must start with inflected words - because from inflected form we can find root word.
Based on count of inflected words it can be faster manual processing when searching for roots, I agree. Problem with infixes is, that some verb in root form looks like they contain infix, although it is not true, like zerok or zamunge. There is need to compare result of infix analysis with dictionary. I suggested some algorithm for this, but don't remember where.... Thinking....

So resulting table would like like this:

Code Select


inflected_form  count  root           count
tswayon         10     tswayon        10
tswayayon       5      tswayon        5
sum=25
...
oe              500    oe             500
oel             250    oe             250
oet             300    oe             300
oeti            330    oe             330
sum=1380

Ezy Ryder · May 07, 2013, 04:40:40 AM

Maybe make vectors or something, one per word, which would contain each conjugation, declension or whatever of the word, and then just count how many times each element in the table occurs in reliable texts, and somehow associate the sum of the occurrences to the vector, and after doing so for every vector, sort the results by the sums in decreasing order.
Haven't done anything programming-related in a while, so sorry if the terminology or the whole idea is just wrong.

Tìtstewan · May 07, 2013, 10:22:02 AM

Quote from: Blue Elf on May 07, 2013, 01:33:49 AM
Quote from: Tìtstewan on May 06, 2013, 05:20:35 PM
Your idea is quiet good, but I see some problems:
Ma Eywa, how I can explain it... The most people (and me) would wish a list with the most used root words. (Why root words? - Because root words will be learned first. If a beginner have learned it, he can work with the root words. He can use a root word with the prefixes, infixes and suffixes.) If I would have a word file from Markì, and I analized every word with a program, what would be the result of it? I would have the 500 most used words, well, but these words are "wrong" and the statistic incorrect and itself "faked". Why? Because the program will count words like kameie or faysawtute etc. but these words are not root words! In this very little example the root words would be kame and tawtute. So, here is need a program which would count the root words and I beleave such a program don't exist. I attached an example, what I mean with the problem.

I really would like to have such a program which can give me the non-root words, because it would be reall simple to compile. The result of the counting work have to correct (remove the prefixes, infixes and suffixes, lenition-letters, wrong words and/or non-Na'vi words). So, I will write Markì a PM to ask him for the word datas of the three sub boards.

Well, I think you didn't read everything
Counting words "as they are" is first step - which preprocess data for further analysis. We must start with inflected words - because from inflected form we can find root word.

Oops! This went away from my head.

Well, right! I just waiting the answere of Markì.

Quote from: Blue Elf on May 07, 2013, 01:33:49 AM
Based on count of inflected words it can be faster manual processing when searching for roots, I agree. Problem with infixes is, that some verb in root form looks like they contain infix, although it is not true, like zerok or zamunge. There is need to compare result of infix analysis with dictionary. I suggested some algorithm for this, but don't remember where.... Thinking....

So resulting table would like like this:
Code Select Expand
inflected_form count root count tswayon 10 tswayon 10 tswayayon 5 tswayon 5 sum=25 ... oe 500 oe 500 oel 250 oe 250 oet 300 oe 300 oeti 330 oe 330 sum=1380

Yeah, same like this I thought too!
Let us wait until I have get the data sources.

So, the next steps will come.

Tìtstewan · May 07, 2013, 12:55:39 PM

Update!

I got the files of these threads! I have it as .csv and I currently create a .html version to 'remove' the stuff in Ayoengl nume Na'viti slÃ¤ ayoeng plltxe nÃ¬'Ã¬nglÃ¬sÃ¬ mÃ¬ fratseng....

And, it's quiet a lot datas!

What I have to do with the quotes? Take them too, or not?

Blue Elf · May 07, 2013, 02:01:29 PM

Quote from: Tìtstewan on May 07, 2013, 12:55:39 PM
Update!

I got the files of these threads! I have it as .csv and I currently create a .html version to 'remove' the stuff in Ayoengl nume Na'viti slÃ¤ ayoeng plltxe nÃ¬'Ã¬nglÃ¬sÃ¬ mÃ¬ fratseng....

And, it's quiet a lot datas!

What I have to do with the quotes? Take them too, or not?

Seems that data are in UTF8, but file contens is not recognised as UTF8. I'd remove quotes, as they only repeat previous conversation.
For analysis plain text is better than HTML

Tìtstewan · May 07, 2013, 02:09:03 PM

Everything is good! This I was able to fix it and to remove the html code like <br />
Now it's look like below

Quote from: just a little bit...fìtseng tìkenong nìNa'vi nì'aw lu: [url=http://forum.learnnavi.org/index.php?topic=175.msg2194#msg2194]
Ngengaru Ätxäle Ayoeyä
[/url] Eywa ngahu!" "Eywa Ngahu!" "Eywa ngahu
Ngayäl eltu si leiu oeti. ;D" "[quote author=Brainiac link=topic=170.msg2204#msg2204 date=1261655341]
Eywa ngahu
Ngayäl eltu si leiu oeti. ;D
[/quote][/td][/tr][/table]

Ngenga, san Ngayäl eltu si leiu oeti sìk, fmawn lì'u täftxu renuti kayaryu?

What we do with the content of the [Quote] and [Spoilers]?

Edit:
A litttle statistic of overall word count:

File: ninavi_niaw
Words: 287946

File: pamrel_ninavi_niaw
Words: 153287

File: pamtseo_ninavi_niaw
Words: 101762

Blue Elf · May 07, 2013, 02:26:45 PM

I would ignore what is inside quote (it just repeats previous text). Not sure about spoilers. Probably include their content (except quotes...)

Tìtstewan · May 07, 2013, 02:33:57 PM

I will delete the quotes. The spoilers...hmm...it's depend what it contains.
On thursday, I will start with the 'cleaning'.

Ma Blue Elf, do you are able to write a little script to analize and count the words?

Tirea Aean · May 07, 2013, 02:36:42 PM

Quote from: Tìtstewan on May 07, 2013, 02:33:57 PM
I will delete the quotes. The spoilers...hmm...it's depend what it contains.
On thursday, I will start with the 'cleaning'.

Ma Blue Elf, do you are able to write a little script to analize and count the words?

If not, I can

Blue Elf · May 07, 2013, 02:37:34 PM

If you give me some piece of file, I can try tomorrow. For development and testing, one page of one thread would be sufficient.

Tìtstewan · May 07, 2013, 02:42:22 PM

I have it in .txt attached.

Quote from: Tirea Aean on May 07, 2013, 02:36:42 PM
If not, I can

Writting a program?

Toruk Makto · May 07, 2013, 02:50:42 PM

I will leave the original csv files in the docs download directory in case anyone needs to pull a fresh copy. Also, since it is a fairly simple sql query to create the files to begin with, I can refresh them as needed in the future if you want to pull a later sample.

Blue Elf · May 07, 2013, 02:52:28 PM

Hmmm, that text contains mostly English, some German and some Na'vi... Ok, I take manually some Na'vi part. Some texts are even colorized, that will be interesting job

Tìtstewan · May 07, 2013, 02:56:13 PM

Quote from: Toruk Makto on May 07, 2013, 02:50:42 PM
I will leave the original csv files in the docs download directory in case anyone needs to pull a fresh copy. Also, since it is a fairly simple sql query to create the files to begin with, I can refresh them as needed in the future if you want to pull a later sample.

That's really good idea!

Spoiler: the links:

Quote from: Blue Elf on May 07, 2013, 02:52:28 PM
Hmmm, that text contains mostly English, some German and some Na'vi... Ok, I take manually some Na'vi part. Some texts are even colorized, that will be interesting job

Indeed, it's contains lots of english, but a lot of Na'vi, too.

Here I attached the other two txt files withouth the html codes.

Blue Elf · May 09, 2013, 02:53:37 PM

Ngaytxoa ma eylan, I started the work, but stuck a little on removing quotes. As I'll travel to Berlin, I will not continue until next weekend. So if someone can be faster.... can continue

As I see the data, there will still a lot of manual work - to remove English words, remove various hyphens and infix markers, repair typos or incorrect words.... Not as easy as I thought.

Na'vi frequency dictionary

Ezy Ryder