Na'vi frequency dictionary

Herwìna · March 28, 2013, 05:26:32 AM

Could someone generate a Na'vi frequency dictionary? 400 most common words, excluding adpositions, prefixes, infixes etc.

Doable y/n?

ETA: Not sure if words like ayoeng should be counted? Has prefix ay- but also has a dictionary definition.

Tsyalatun te Eyktan Txuratu'itan · April 17, 2013, 01:27:41 PM

Quote from: Herwìna on March 28, 2013, 05:26:32 AM
Could someone generate a Na'vi frequency dictionary? 400 most common words, excluding adpositions, prefixes, infixes etc.

Doable y/n?

ETA: Not sure if words like ayoeng should be counted? Has prefix ay- but also has a dictionary definition.

If I have some free time, I'll look into what's required to do this - using the nìNa'vi nì'aw section of the forums - I will probably purposely make it run slowly as there's little point rapidly firing lots of requests to the forums for such a project (not only that but I do wish to also use my 'net connection!) One issue to scripting it is dealing with all the affixes... but as I have a regex-based perl vrrtep, that's not going to be difficult to sort out.

What is difficult to resolve is that some words are spelt the same, but are different root words. For example: kilvan { (kan): vtrm. aim, infix(es): ilv | (kilvan): n. river }. So, unless you augment it with something which tries to parse a sentence to work out if kilvan is a noun or a verb, you could mis-attribute such words.

Nevertheless, building a database of Na'vi-word-to-list-of-URLs sounds like an interesting project in itself...

Taronyu Leleioae · April 17, 2013, 08:27:48 PM

Quote from: Tsyalatun te Eyktan Txuratu'itan on April 17, 2013, 01:27:41 PM
What is difficult to resolve is that some words are spelt the same, but are different root words. For example: kilvan { (kan): vtrm. aim, infix(es): ilv | (kilvan): n. river }. So, unless you augment it with something which tries to parse a sentence to work out if kilvan is a noun or a verb, you could mis-attribute such words.

In the worst case, if the goal is to identify the top 500..., they you could output the results to xml or even text, and we could help you manually read through the list. (Say 700 or even 1000 words) and identify those not in root form... Most common will be past tense such as "lolu" and future "layu" for examples. If it's in a table of some sort, having an extra field to make a note of non-root verbs or common words with plural or fì-/tsa- could be flagged as essentially duplicates. Let the coding do the frequency analysis and sorting. Then let human eyes check the words to verify what they are. I can help you with this if it's the easier solution.

Ftiafpi · April 18, 2013, 11:35:26 AM

I definitely support this and would gladly help parse verb roots out of a list. This would be an extremely useful list.

Taronyu Leleioae · April 28, 2013, 07:14:53 PM

Nudges this thread. I think there is real value in finding out what the top 400 (go for 500 as it will reduce) words are especially from supporting a beginners point of view...

Ftiafpi · April 28, 2013, 08:09:24 PM

Quote from: Taronyu Leleioae on April 28, 2013, 07:14:53 PM
Nudges this thread. I think there is real value in finding out what the top 400 (go for 500 as it will reduce) words are especially from supporting a beginners point of view...

Additional nudge. I lack the technical expertise or I would do this.

`Eylan Ayfalulukanä · April 28, 2013, 09:15:19 PM

A list of words by frequency used would be very useful. But from what I understand about this, the code to do it is not trivial.

If you just want a digest of all words, inflected and uninflected, that is not too difficult. Where the challenge lies is extracting the root from those inflected words. Infixes make it even harder, as they are in the middle of the word, and don't always 'stand out' from a coder's perspective.

There are two ways to do this, I think. One is to prepare a list of all words used in nìNa'vi nìaw for say, the last 5 months, and then let a human de-infledt the words to root form. That way, it would only have to be done once. The second way to do this is prepare some sort of mapping that basically looks at a word and compares it to common inflected versions of that word. That can build a base and inflected count with less work.

In either case, you don't want to go back too far, as usage has changed as we get new vocabulary. You also want to limit the iinfixes counted to the most common ones. The couting system would stop and ask for clarification for the rarely used words.

Taronyu Leleioae · April 28, 2013, 09:35:08 PM

Quote from: Taronyu Leleioae on April 17, 2013, 08:27:48 PM
Let the coding do the frequency analysis and sorting. Then let human eyes check the words to verify what they are. I can help you with this if it's the easier solution.

Agreed with the coding issue about actually identifying. Hence why both myself and Ftiafip offered to help. Give us say, an extra 100 words above the target goal. And then we just export to say Excel, and sort away reading down the page identifying any common use with infixes to pull out just the root word. lolo > lu being highly expected for example.

To make it much easier, just have the coding do the raw count sorting by frequency. We're not looking for exact statistics. Just the frequency in a descending order...

Ftiafpi · April 28, 2013, 09:56:21 PM

Quote from: Taronyu Leleioae on April 28, 2013, 09:35:08 PM
Quote from: Taronyu Leleioae on April 17, 2013, 08:27:48 PM
Let the coding do the frequency analysis and sorting. Then let human eyes check the words to verify what they are. I can help you with this if it's the easier solution.

Agreed with the coding issue about actually identifying. Hence why both myself and Ftiafip offered to help. Give us say, an extra 100 words above the target goal. And then we just export to say Excel, and sort away reading down the page identifying any common use with infixes to pull out just the root word. lolo > lu being highly expected for example.

To make it much easier, just have the coding do the raw count sorting by frequency. We're not looking for exact statistics. Just the frequency in a descending order...

Yep, that's about it. I just am not good with the programming.

Tirea Aean · April 29, 2013, 10:57:49 AM

I actually think it would be useful to know the 500 most used WORDS, not just the 500 most used ROOTS. This would be a little way to know which infixes are top used as well as words. This wouldn't be too hard.

What is the source of the data? Just the single /ninavi-niaw board? I can figure this out given enough time. >

Taronyu Leleioae · April 29, 2013, 11:32:08 AM

Knowing the top words is the goal. I see the potential for a two part step to then export that list to a separate Memrise course for beginners to make using Memrise a little less overwhelming. Maybe, for this Memrise course, we should include the most common verbs with their infixes included. IE... lu, lolu, lamu if they show up in the frequency count? $:-\$

Ftiafpi · April 29, 2013, 03:48:46 PM

No reason we can't have both, but I personally want the root words as I feel that's more useful to beginner and intermediate learners.

Nowfaleena · May 04, 2013, 10:44:10 PM

I as a beginner would love to have a list of the top 500 most use words in any form! I have been reading the posts in na'vi and then using the regular dictionary to translate them to try and learn, but a smaller dictionary with the main used words would make this a lot easier!

Palulukan Maktoyu · May 05, 2013, 01:23:10 PM

mllte oe fìsäfpìlhu

Tìtstewan · May 05, 2013, 01:43:39 PM

Kaltxì ayngru

I also wish a list of the most used Na'vi words.

So I felt free to create a 'prototype file' for collecting the words to get a result of the most used N'vi words. If anyone is creating such a file now, so let me know, otherwise I will start collecting Na'vi words next week.

Here is a little preview of these file attached.

Eywa ayngahu!

Ftiafpi · May 06, 2013, 12:10:30 PM

Txantsan!

Blue Elf · May 06, 2013, 02:35:02 PM

Quote from: Tìtstewan on May 05, 2013, 01:43:39 PM
Kaltxì ayngru

I also wish a list of the most used Na'vi words.
So I felt free to create a 'prototype file' for collecting the words to get a result of the most used N'vi words. If anyone is creating such a file now, so let me know, otherwise I will start collecting Na'vi words next week.

Here is a little preview of these file attached.

Eywa ayngahu!

How do you plan to collect data? Manually? That will be job for bloody ages! IMHO this is work for computer. I would do it by way like this:
- ask Markì to get data out of forum database into plain text file(s). Probably we want to analyse NìNa'vi nì'aw threads. All threads? Or since some time?
- now it shouldn't be hard to write program, which read files, strips words and count them. At this phase we do not care about affixes, all words are taken as is. Result is put into another database (or text file, but db is maybe better for further analysis). Now we know frequencies of all words include inflection. It would be useful to mark also word type (noun, verb, adjective....). This is manual work
- another steps is to write code, which based on word type tries to strip root words and again count them. This is harder task, although Tirea aean maybe has this solved (vrrtepcli v2.0 was able to analyze sentence, but I don't remember how deep)

If we can get thread texts as text files, I can try to write some analyzer (in C#) to support this project...

Tìtstewan · May 06, 2013, 05:20:35 PM

Quote from: Blue Elf on May 06, 2013, 02:35:02 PM
How do you plan to collect data? Manually? That will be job for bloody ages!

That is a quiet good question. My brain is working to get an idea for this.

I guess, manually work will kill me...

Quote from: Blue Elf on May 06, 2013, 02:35:02 PM
IMHO this is work for computer. I would do it by way like this:
- ask Markì to get data out of forum database into plain text file(s). Probably we want to analyse NìNa'vi nì'aw threads. All threads? Or since some time?
- now it shouldn't be hard to write program, which read files, strips words and count them. At this phase we do not care about affixes, all words are taken as is. Result is put into another database (or text file, but db is maybe better for further analysis). Now we know frequencies of all words include inflection. It would be useful to mark also word type (noun, verb, adjective....). This is manual work
- another steps is to write code, which based on word type tries to strip root words and again count them. This is harder task, although Tirea aean maybe has this solved (vrrtepcli v2.0 was able to analyze sentence, but I don't remember how deep)
If we can get thread texts as text files, I can try to write some analyzer (in C#) to support this project...

Well, first, I would use
- nìNa'vi nì'aw
- Pamrel nìNa'vi nì'aw
- Pamtseo nìNa'vi Nì'aw
threads. (mostly in Na'vi

)

Your idea is quiet good, but I see some problems:
Ma Eywa, how I can explain it... The most people (and me) would wish a list with the most used root words. (Why root words? - Because root words will be learned first. If a beginner have learned it, he can work with the root words. He can use a root word with the prefixes, infixes and suffixes.) If I would have a word file from Markì, and I analized every word with a program, what would be the result of it? I would have the 500 most used words, well, but these words are "wrong" and the statistic incorrect and itself "faked". Why? Because the program will count words like kameie or faysawtute etc. but these words are not root words! In this very little example the root words would be kame and tawtute. So, here is need a program which would count the root words and I beleave such a program don't exist. I attached an example, what I mean with the problem.

I really would like to have such a program which can give me the non-root words, because it would be reall simple to compile. The result of the counting work have to correct (remove the prefixes, infixes and suffixes, lenition-letters, wrong words and/or non-Na'vi words). So, I will write Markì a PM to ask him for the word datas of the three sub boards.

Taronyu Leleioae · May 06, 2013, 05:38:17 PM

There seems to be two directions of thought going on here...

What I think we need is a proper subset of the dictionary (with database definitions) to create less intimidating Memrise course with the core words. This could also be put into e-print as a pdf or other document if desired. Plus it would help to know what the top 500 or so words really are, infixes and all. Thus a straight code to analyze words would be most helpful. But it will likely be slightly contaminated by spoiler comments. You just set your results to be top 600, and manually weed the list down...

There has been discussion on root words vs most common infixes and all. I prefer the root words. I don't agree that a statistic would be incorrect or wrong, they are simply the words in use (with infixes). But they are not the top 500 root words. Thus we really need the results for the top... 600? Then do a manual sort able to be imported into Excel. Preferably with a frequency occurrence field/column added to the table.

Don't harp (worry) about having the coding do the stripping of infixes,etc. Human eyes can do it faster and recognize the roots. Then once we get a sorted list, we can export it to suit the various desires... Ftiafpi and I can quickly turn around a raw data dump...

Quote
Well, first, I would use
- nìNa'vi nì'aw
- Pamrel nìNa'vi nì'aw
- Pamtseo nìNa'vi Nì'aw

It would be nice if the frequency analyzer could be coded to ignore anything in spoiler tags. (As well as the word "spoiler"!) This would cut down on the frequency sorting where english or other languages would be included in those threads...

Tìtstewan · May 06, 2013, 06:02:32 PM

Quote from: Taronyu Leleioae on May 06, 2013, 05:38:17 PM
I don't agree that a statistic would be incorrect or wrong, they are simply the words in use (with infixes).

How the program will 'see' that these 8 words below are the same of tswayon (root word)?
tswamayon
tswìmayon
tswìyayon
tswayayon
tswolayon
tswerayon
tswayeion
tswayängon

This is, what I meaning with the risk of a 'faked' statistic by infixes etc. for creating a root word list.
If we create a samly statistic with the infixes etc, we will get an other result as the root word list.
I see here two lists comming, ones with the top 500/600/etc. of most used Na'vi (root) words and other one list with the top 1000/1500/etc. most used Na'vi (with all their infixes suffixes etc) words.