Na'vi Frequency Dictionary - Info and Suggestions

Tìtstewan · April 25, 2014, 08:40:00 PM

Na'vi Frequency Dictionary - Info and Suggestions

The project is now finished. Go HERE

I started this as its own thread, because here you will get some info and more important, my suggestion and structure of the upcomming final document. Yes, this project is not dead. Also, your suggestions or opinions are pretty much welcome, especially about things which I should add to the document too. If everything works like it is, the release of the Na'vi Frequency Dictionary will be the end or the middle of the month MAY.

Current situation:

One file is already corrected, and free from BB code stuff. The file representates all Na'vi words from the Pamtseo nìNa'vi Nì'aw board.
The file for the Pamrel nìNa'vi nì'aw is to ~90% finished
And the file for the nìNa'vi nì'aw is to ~60% finished

Important is, that I have to do also a verification and I want to add also all words which have been posted since 5th May 2013 to keep this as complete as possible.

Problems:

It were too awesome, if there were no stones on the way but unfortunately there are tons of stones on the way...
Examples:

BB-Codes. They are the most annoying things and they are the main reason for consuming so much time. Every code thing must be removed manually(!) from the raw CSV files.
Quotes. Meh...many many many users have quoted things unnecessarily. They must be also removed plus their content, but expect there are additions*.
Special words. Words like pamrel si, or generally all si-verbs are additional work, because I have to add there an underscore between the word and si element. But also words like AyVitrayä Ramunong...
User names. Well, they should be removed too, because it would fake the results. It makes no sense for example to add the word tirea, but suddendly also hundreds of additional tireas, because I find the name Tirea Aean everywhere
Some other small things...

General File Construction:

- Main title / first page
- Index
- Introduction
'- Important notes regarding this whole statistic.

- Category 1: Most Used Root Words
'- Root Words:
| Word | Amount | Percentage | Notes |
-Words: listed root words which is used by the main dictionary of LearnNa'vi.org [learnnavi.org]
-Amount: amount of the words
-Percentage: How much percent related to the whole amount of words been used -> RootWords / WholeWordAmount = X%
-Notes: random notes regarding to that word. Example: yoa - [released: year 2014 1H] (1H = first half of a year | 2H = second half of a year)

- Category 2: Most Used Affixes
'- A) Prefixes
| Prefix | Amount | Percentage |
-Prefix: listed prefixes which is used by the main dictionary of LearnNa'vi.org [learnnavi.org]
-Amount: amount of the prefix
-Percentage: How much percent related to the whole amount of prefixes been used -> Prefixes / WholePrefixAmount = X%

'- B) Infixes
| Infix | Amount | Percentage |
-Prefix: listed infixes which is used by the main dictionary of LearnNa'vi.org [learnnavi.org]
-Amount: amount of the infix
-Percentage: How much percent related to the whole amount of infixes been used -> Infixes / WholeInfixAmount = X%

'- C) Suffixes
| Suffix | Amount | Percentage |
-Prefix: listed suffixes which is used by the main dictionary of LearnNa'vi.org [learnnavi.org]
-Amount: amount of the suffix
-Percentage: How much percent related to the whole amount of suffixes been used -> Suffixes / WholeSuffixAmount = X%

- Category 3: Most Used case Endings
'- A) Short case [-l, -t, -r, -y]
| Case | Amount | Percentage |

'- B) Long case [-ìri, -ri, -ìl, -it, -ti, -ur, -ru, -yä]
| Case | Amount | Percentage |

- Category 4: Most used Carracters
'- Calculating method:
RootWord x TheirCarracters = Amount of carracters [AmCaRootWords] (must be calculated for every single root word)
Affixes x TheirCarracters = Amount of carracters [AmCaAffixes] (must be calculated for every single affix)
CaseEndings x Theircarracters = Amount of carracters [AmCaCases] (must be calculated for every single case ending)
-> [AmCaRootWords] + [AmCaAffixes] + [AmCaCases] = Total carracters
Na'vi alphabet / Total carracters = X% (must be done for every Na'vi letters of their alphabet)
'- Most used Na'vi Carracters
Basically the result of the calculation part
Na'vi alphabet and their percentage
| Letter | Amount | Percentage |
Example: A = 5123 = 23%

- Final Words
- Credits

More things to mention

If everything is finished, I will also make the BB code free files available.

I hope I don't forgot something...

Please feel free to post here for additional ideas, suggestions, critiques, questions etc.

Eywa ayngahu ulte 'Ivong Na'vi!

Niri Te · April 25, 2014, 08:55:30 PM

This would help someone like me, with Aphasia, or someone with a slight to moderate learning difficulty, ma Titstewan. A dictionary with all of the most used words, fixes, and case endings, would move me lightyears ahead of where I am now, in my Na'vi studies. I could then study and learn the things that I would actually USE, and not waste brain cells on things that I would almost never use. THANKS for asking.

Tirea Aean · April 25, 2014, 09:29:29 PM

Pretty cool little project. I wonder if a hell of a lot of this work could be done away with by writing up an interesting little Perl script.

PS: HRH @ you see my name /everywhere/

Tìtstewan · April 25, 2014, 09:37:01 PM

Quote from: Niri Te on April 25, 2014, 08:55:30 PM
This would help someone like me, with Aphasia, or someone with a slight to moderate learning difficulty, ma Titstewan. A dictionary with all of the most used words, fixes, and case endings, would move me lightyears ahead of where I am now, in my Na'vi studies. I could then study and learn the things that I would actually USE, and not waste brain cells on things that I would almost never use. THANKS for asking.

You are welcome! I really like to help aysmuk by my projectes.

Quote from: Tirea Aean on April 25, 2014, 09:29:29 PM
Pretty cool little project. I wonder if a hell of a lot of this work could be done away with by writing up an interesting little Perl script.

Well, there is no scripts which could help...
This was discused in the other thread.

Quote from: Tirea Aean on April 25, 2014, 09:29:29 PM
PS: HRH @ you see my name /everywhere/

HRH!

baritone · May 15, 2014, 03:05:28 AM

This is very important work.

Tìtstewan · May 15, 2014, 06:47:50 AM

Irayo, ma baritone!

It is quite hard to write that stuff, but this will be an awesome thing.

I want to mention, that I probably I need one/two weeks more, because I have to visit the doctor often plus I have some job interviews...

Tìtstewan · June 03, 2014, 01:21:04 PM

Ma smuk,

My health is getting better, so I want to announce that I will start writing the Na'vi Frequency dictionary at the weekend.
I'm also looking for correctors for checking it, before the release, because my so called English isn't that good...

Ngeyä Tìtstewan

Kame Ayyo’koti · June 04, 2014, 02:31:46 PM

Are you doing this by hand?

Tìtstewan · June 04, 2014, 03:17:14 PM

Yes. Collecting, sorting, correcting, analysis and writing are mostly hand-made, because there is no programm for that.

That's why I needed so much time for it (since last year)...

Kame Ayyo’koti · June 04, 2014, 03:58:54 PM

This is why I think basic scripting should be taught in schools. Computers are very good at doing tedious things like this!

I will see if I can write a script that can parse and sort/count words.

Tìtstewan · June 04, 2014, 04:32:51 PM

I think, that would be a monster script... This was discussed in another thread, if I remember me correct. There are some problems where scripts can't handle. Contents of quotes for example (and other things I have mentioned in the OP too.) Also, it's "too late" to do any work for script writing because I did already the most of the work.
I have to do only verification, analysis, some corrections and writing all the stuff down to the final file mentioned in the OP. As for counting the words, I have already done a method in MS Excel...

Kame Ayyo’koti · June 04, 2014, 09:33:49 PM

I've managed to throw something together as proof of concept. It is a short script (weighing in at 44 lines!, including comments) which reads a "dictionary" file which "explains" to it the parts each word consists of. It looks like this:

Spoiler: Dictionary File Example

This way, not only can the roots be counted, but affixes can be counted as well.

Then you give it a text file. I copied a bit from nìNa'vi nì'aw:

Spoiler: Input Example

It "looks" at each word and checks if it's in the dictionary. If it is, it adds +1 to each of that word's components under the dictionary entry. If it isn't, it prints that word to an "Unlisted Words" file as a blank dictionary entry, like so:

Spoiler: Unlisted Words File Example

After reading the entire file, it prints the final word count to a Comma Separated Value file:

Spoiler: Word Count CSV File Example

This can be easily loaded into a spreadsheet program and sorted and used as pleased.

After processing a load of text, you would look in the Unlisted Words file, provide definitions for them, and copy and paste them to the dictionary file, then re-parse the text until none are left.

Please keep in mind that this script is not polished—I only threw it together to show how this could be possible. It only does the bare minimum and could be improved to deal with the problems of parsing text like this.

For example, you'll notice that there is some punctuation in the Unlisted Words entries, "stuck" to some of the words. This could be improved by giving it a list of characters to ignore when examining a word, so something like srak? would be considered srak, without the question mark.

As for recognizing si-verbs, names, and other multi-word phrases and constructions, I could possibly throw in a "pre-parser" that examines entire sentences for known multi-word phrases, before examining words one-by-one. Also, if you're concerned about it parsing usernames and other things as regular words, you could always scan the text yourself and delete them by hand before handing it over to the script.

As for things like BB codes and HTML (and their contents), I imagine they could easily be taken care of with the use of regex matching. I work with programs that are "smart" enough to recognize things like this. I could write a script that either just removes the tags themselves, or the tags and everything inbetween them. It would work for HTML too.

I would like to read this discussion that was had about scripts, as well as what scripts "can't handle." I may be able to find a way to make it work.

Tìtstewan · June 05, 2014, 02:16:38 AM

Well, I didn't say, that such a script would be impossible.

Some other wanted to write a script but they haven't time for it.

Could a script make differences by context to find out what one mean -> kìlvan vs. k<ìlv>an, frrfen vs f(<er>)rrfen etc.?
The next this is, mistakes and typos...

Quote from: Kame Ayyo'koti on June 04, 2014, 09:33:49 PM
As for things like BB codes and HTML (and their contents), I imagine they could easily be taken care of with the use of regex matching. I work with programs that are "smart" enough to recognize things like this. I could write a script that either just removes the tags themselves, or the tags and everything inbetween them. It would work for HTML too.

HTML wasn't the problem because no one can use it (expect a hand full of users). BB-code was annoying.
But they are not a problem anymore, because I removed them already...

Quote from: Kame Ayyo'koti on June 04, 2014, 09:33:49 PM
I would like to read this discussion that was had about scripts, as well as what scripts "can't handle." I may be able to find a way to make it work.

http://forum.learnnavi.org/learning-resources/navi-frequency-dictionary/
Contents of quotes:

Quote from: 1text X
Quote from: 2text X
Quote from: 3text X + additional content

for example? $:-\$

Kame Ayyo’koti · June 05, 2014, 07:24:58 PM

Quote from: Tìtstewan on June 05, 2014, 02:16:38 AM
Contents of quotes:
Quote from: 1text X
Quote from: 2text X
Quote from: 3text X + additional content
for example? $:-\$

Are you asking for the quote tags to be removed? I've written a script that does this, and leaves the quoted text intact. For example, here is some text I threw together from nìNa'vi nì'aw:

Spoiler: Input Text Example, with Quote Tags

Code Select

[quote author=Kemaweyan link=topic=10738.msg254146#msg254146 date=1278116723]
[quote author=Taronyu link=topic=10738.msg254140#msg254140 date=1278116305]
Hrh, oel omum fìfì'u ngateri. Oe silpey tsnì trro oe kivìyä ne Ruskiyä atxkxe...
[/quote]

Ziva'u nìprrte' ;) Nga plltxe fì'uteri a ziva'u nìtokx srak?

[quote author=Taronyu link=topic=10738.msg254140#msg254140 date=1278116305]
San meionghu sìk pe'u lu? 
[/quote]

me-ioang-hu :D lekelku :)

[quote author=Taronyu link=topic=10738.msg254140#msg254140 date=1278116305]
Fìtrr oel nìtxampxì tok fìtsengit... fpom leiyu. Oel fpìmìl nìtxan tsata oe kin tutéo... nì'awtu ke sìltsan lu. Slä oe zene kivä ye'rìn ta fìatxkxe ne Yu.E.Sey....
[/quote]

Mì Yu.E.Sey srak? Tì'i'avay krrä fu fivrrfen nì'aw srak?
[/quote]

Srane, oeru lu fpom, irayo furia ngal sngolä'ì fìtìpängkxo.

Trram, fu aysrram, oe ke herahaw mrrvola *hour*ìri, oe uvan sami nìltsan uvanur a le*nazi*a aykekeruseytu *CoD5*mì.

frapamrel oeyä lu teya keyeyä, oe ngaru irayo seiyi txo aykeyeyìt ngal rivun.

And here is the same text stripped of quotes (with quoted text remaining):

Spoiler: Input Text Example, withOUT Quote Tags

Code Select



Hrh, oel omum fìfì'u ngateri. Oe silpey tsnì trro oe kivìyä ne Ruskiyä atxkxe...


Ziva'u nìprrte' ;) Nga plltxe fì'uteri a ziva'u nìtokx srak?


San meionghu sìk pe'u lu? 


me-ioang-hu :D lekelku :)


Fìtrr oel nìtxampxì tok fìtsengit... fpom leiyu. Oel fpìmìl nìtxan tsata oe kin tutéo... nì'awtu ke sìltsan lu. Slä oe zene kivä ye'rìn ta fìatxkxe ne Yu.E.Sey....


Mì Yu.E.Sey srak? Tì'i'avay krrä fu fivrrfen nì'aw srak?


Srane, oeru lu fpom, irayo furia ngal sngolä'ì fìtìpängkxo.

Trram, fu aysrram, oe ke herahaw mrrvola *hour*ìri, oe uvan sami nìltsan uvanur a le*nazi*a aykekeruseytu *CoD5*mì.

frapamrel oeyä lu teya keyeyä, oe ngaru irayo seiyi txo aykeyeyìt ngal rivun.

From what I read, it's possible to remove quote blocks AND their content, but this seems to be more complicated than simply removing the quote blocks themselves. I would have to look into BBCode parsing to do that.

Quote from: Tìtstewan on June 05, 2014, 02:16:38 AM
Could a script make differences by context to find out what one mean -> kìlvan vs. k<ìlv>an, frrfen vs f(<er>)rrfen etc.?
The next this is, mistakes and typos...

The script cannot read language like humans can. It does not read words or phrases, it reads patterns. If a simple-enough pattern exists to recognize the difference between two different but same-spelt words, than perhaps yes. But if the context for these different words types are too varied, it becomes more difficult to detect them all. And the fact that Na'vi allows free constituent order complicates this immensely. (For example, Na'vi allows all the S, V, O combinations. And then we have to consider such things as whether there are other words inbetween these items. Factors such as this make it much harder to figure out patterns.)

A far better solution I suggest is to have a word "exceptions" list, which would work like this:
The file contains a list of words such as «kìlvan» and «frrfen». When the script encounters these words, instead of counting them, it alerts you to their presence so you can examine them yourself.

The ultimate aim of a script like this would NOT be to replace a human interpreter; it would be too difficult to make a script intelligent enough to figure out everything. Instead, this script would do its best to take care of the simple nitty-gritty work, while detecting and pointing out possible trouble spots and complications, for you to handle and figure out yourself. The aim is to reduce the time it takes for you to do this work, not take care of it for you.

As for mistakes and typos, I'm sure these would be too varied to predict beforehand. These could be spit out into the Unlisted Words file for you to examine and give definitions to by hand. Once they are in the word dictionary, the script will be able to interpret them. (For typos that might have multiple possible meanings, you would add them to the word exceptions list.)

These things being said, here are my ideas for other features and improvements that could be made:

Word Dictionary File Auto-Generator: Regarding the word dictionary file, I could write a script that automatically produces most of the possible words and this affix combinations, along with their appropriate dictionary definitions. (I have already done this on a smaller scale to create Anki cards for myself.) That is, it would generate all the nouns with all their possible affix combinations, all the verbs with their possible infix combinations, and even enclitic adpositions. This would take most of the identifying/defining work out of it for you, so you would only have to figure out misspelt words or the odd word that isn't produced by the generator (which should be very few).
Text Preparser: This could find some phrases that it would otherwise miss, such as ayVitrayä aLusìng and si-verbs. (Si-verbs should be easy to find, as long as their format is noun-head si or noun-head ke si. Any other words inbetween, or moving the si before the noun-head would complicate things and make them harder to "catch.")
Word Exceptions/Alert List: For words that could have more than one meaning, the script would tell you when it finds them instead of counting them, so you could examine them and count them manually.
Word/Affix Type Categories: In the thread you linked, Herwìna mentions wanting to know the most common roots. I'm sure it would be possible to sort all the types into their respected categories: Roots (Nouns/Verbs/Adjectives/Adverbs), Plural Prefixes, Other Prefixes, Adpositions, and so forth. So, say, if you wanted to know the most common Nouns + Verbs + Pronouns, you could find this out without having to sort out the rest. We could also count words in their "complete" forms—that is, words including all their affixes, just as they are written, if that's what you want to know. So if you wanted to know the most common nouns in their "complete"/raw forms, you could find out.
Different Output Types: The output from this script could be many things, and you could even choose which you want. It could create CSV files or XML files for example.
Just throwing this out there, although I've already done it with success: Automatic English Translation Generator: That is, for each word in the word dictionary, there is also the English equivalent of that word, including its affixes. For example: fìfnehapxì → this type of part. This would be useful for outputting the list of most common words; you would not have to translate them by hand. Instead, in a second column would already be the definition. Mind you I have only done this with nouns, but it's worked so far for me.

I'm sure there's more ideas that could be added.

Ultimately, if I were to go through with making something like this, the most important thing I would want to know is exactly what sorts information you want to glean from the text. The answer to that question will determine how I need to store and handle that information.

A project like this would certainly take some work, but honestly, I imagine it would be a lot less work than you think.

Kame Ayyo’koti · June 05, 2014, 07:36:38 PM

Addendum: Just to be clear, so I don't confuse you, when I refer to the "Word Dictionary" and "Word Definitions," I mean is:

The Word Dictionary lists all possible words with all their possible affix combinations that can be found in the text. The "definition" is the parts the word is composed of.

An example from my earlier prototype:

Code Select

pefnekelku~pe-,fne-,kelku
pefnekelku is the "word entry" in the dictionary. This the a word the script can "read" in the text.

pe-,fne-,kelku is the "definition"; that is, these are the parts that word is made of.

Sorry if that was confusing anyone before. I'll try to invent some better terminology.

Tìtstewan · June 06, 2014, 04:54:11 AM

Well, that's what I mean with "monster script". It would be more and more complicated to write a script which removes quotes with their contents. But not all quotes contains contents which was quoted from somewhere. A way to "mark" them would be a text/sentence comparison and stuff which isn't the same in quotes, they are additions, corrections or somethng else. But there are not only quotes, there also some like "pseudo-quotes". Text copied from a post for correction as,
Original post: Oe tsea fiyayo.
Random post: Oel tse'a fìyayot. <- this is a correction of a original post.
...without any quote tags.

The next thing, I removed the whole BB-Code stuff already, so they are not the problem anymore. And as I mentioned, I thought a way in Excel to count and fix typos etc.
Here a little copy-past:

Code Select


'akra
'akuyu
'almefu
'almefu
'almefu
'almefu'ameyng
'awsiteng
'awsiteng
'awsiteng
'efu
'efu
'efu
'eko
'ekongit
'evil
'evil
'eviya
'eylan
'eylan
'eylan
'eylan
'eylan
'eylan
'eylan
'eylan
'ivampi
'iviawn
'iviawn
'Rrrtat
'akra
'akra
'awa
'awkxo
'awlie
'awlie
'awlie
'awpor
'awpor
'awsiteng
'awsiteng
'ayi'a
'Efu
'efu
'Efu
'Efu
'evan
'evan
'eve
'evi
'evi
'eykivì'awn
'eylan
'eylan
'eylan
'eylan
'eylanìl
'ì'awn
'ì'awn
'ì'awn
'ì'awn
'ì'awn
'ì'awn
'ia
'Ìlmefu
'ìlmefu
'ìlmefu
'ipu
'ipu
'ipu
'o'
'o'
'ok
'ok
'okit
'okrol,
'oleiul
'opinit
'u
'u
'uo
'upxaret
'ut
'uti
a
a
a
a
a
A
a
a
a
a
a
a
a
a
a
A
A
a
a
a
a
A
a
a
A
a
a
a
a
a
a
a
a
A
a
a
a
a
a
a
a
a
a
a
A
A
a
a
a
A
a
a
a
a
A
A
A
A
A
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a'aw
a'awve
a'awve
a'awve
a'ewan
a'o'
a'aw
a'aw
a'aw
a'awve
a'awve
a'awve
a'awve
aean
aean
a'eoio
a'eoio
a'eoio
a'eoio
a'ewan
a'ewan
afe'
afe'
afkew
afkew
aftawnem
aftawnem
aftxavang
aham
aham
ahhh
ahhh
ahì'i
ahì'i
ahì'i
ahì'i
ahì'i
ahì'i
ahì'i
ahì'i
ahì'i
ahol
äie
äie
äiena
akakrel
akakrel
akalin
akawng
akawng
akawng
akawng
akawng
akawng
akawng
akawng
akawng
akerusey
aketsuksla'tsu
Akoak
akoak
akoak
akoak
akoak
akosman
akosman
akosman
akosman
akosman
akosman
Akxayl
akxayl
alahe
alahe
alahe
alahe
alahe
alahe
alahe
alahe
alahe
alahe
alaksi
alaksi
alaw
alefpom
alefpom
aletsranten
alìm
alìm
alìm
alìm
alìm
alìm
alìm
alìm
alìm
alìm
almomum
alo
alo
alo
alor
alor
alor
alor
alor
alor
alor
alor
alor
alor
alu
alu
alu
alu
alu
alu
alu
alu
alu
alu
alu
alu
alu
alu
alu
alu
alu
alu
alu
alu
alu
alu
alu
alu
alu
alu
alu
alu
alu
alu
alusìngmì
alusìngmì
alusìngmì
alusìngmì

There is every word listed and sorted alphabethical. So I can go through the list and fix mistakes, typos and unclear words. Then I only have to use the counter and I have what I want. Also I can split the affixes from the words.

QuoteAs for mistakes and typos, I'm sure these would be too varied to predict beforehand. These could be spit out into the Unlisted Words file for you to examine and give definitions to by hand.

I think such a list would be kinda long.

QuoteWord/Affix Type Categories: In the thread you linked, Herwìna mentions wanting to know the most common roots. I'm sure it would be possible to sort all the types into their respected categories:

-->

Quote from: Tìtstewan on April 25, 2014, 08:40:00 PM
- Category 1: Most Used Root Words
'- Root Words:
| Word | Amount | Percentage | Notes |
-Words: listed root words which is used by the main dictionary of LearnNa'vi.org [learnnavi.org]
-Amount: amount of the words
-Percentage: How much percent related to the whole amount of words been used -> RootWords / WholeWordAmount = X%
-Notes: random notes regarding to that word. Example: yoa - [released: year 2014 1H] (1H = first half of a year | 2H = second half of a year)

There will be a file with the common root words anyway.

Kame Ayyo’koti · June 06, 2014, 02:03:11 PM

Quote from: Tìtstewan on June 06, 2014, 04:54:11 AM
Well, that's what I mean with "monster script". It would be more and more complicated to write a script which removes quotes with their contents.

Adding a BBCode parser should be easy, as long as I can find one I can use, and which does what I need it to. If not, you would still need to remove quote blocks yourself.

Quote from: Tìtstewan on June 06, 2014, 04:54:11 AM
But not all quotes contains contents which was quoted from somewhere. A way to "mark" them would be a text/sentence comparison and stuff which isn't the same in quotes, they are additions, corrections or somethng else. But there are not only quotes, there also some like "pseudo-quotes". Text copied from a post for correction as,
Original post: Oe tsea fiyayo.
Random post: Oel tse'a fìyayot. <- this is a correction of a original post.
...without any quote tags.

I can look into "fuzzy text matching," such as the way a dictionary program or Google guesses typos. Since I know nothing about how it works, I can't say whether it would be difficult to add something like this, but this goes beyond a simple word counter.

Otherwise, this is something you would still have to do yourself. You could prepare the text by removing duplicate text before running it through the script. The most I can offer with regards to this is having a script pull out the quote blocks for you so you can examine them, but of course this wouldn't work with the "unquoted quotes" you mention.

Worst case scenario: Either leave all the quotes in, strip the BBCode from them, and count them like regular text; or delete all quote blocks and count the remaining text, including "unquoted quotes."

Quote from: Tìtstewan on June 06, 2014, 04:54:11 AM
And as I mentioned, I thought a way in Excel to count and fix typos etc.

There is every word listed and sorted alphabethical. So I can go through the list and fix mistakes, typos and unclear words. Then I only have to use the counter and I have what I want. Also I can split the affixes from the words.

QuoteAs for mistakes and typos, I'm sure these would be too varied to predict beforehand. These could be spit out into the Unlisted Words file for you to examine and give definitions to by hand.
I think such a list would be kinda long.

Does this method you use require you to fix the same typo more than once?

And yes, the Unlisted Words file would be long, but you would only need to fix each word/typo just once. You would never have to fix that typo again. It sounds like the method you're already using is long.

Are you planning to count future posts? Will you do that by hand as well?
The reason I suggest this script/method is that it should make processing future text easier. As I said, once you "fix" a word or a typo, the script can read it from then on, and I could have most of the "word dictionary" generated automatically. You would only have to fix new typos that it hasn't seen before, which will be less and less as time goes on after you fix more and more of them.

Whatever the case, you seem to be happy with the method you've worked out. It's up to you what you prefer to do.

Tìtstewan · June 06, 2014, 03:07:01 PM

Quote from: Kame Ayyo'koti on June 06, 2014, 02:03:11 PM
Does this method you use require you to fix the same typo more than once?

Hmm, if I understood your question correct, usually yes but also no.
The nice thing by this "list method" is, that for example
Kaltxì
kaltxi
Katlxì
kaltxì
kaltì
must not be corrected when I see what they are mean. Here I see 5 version of kaltxì, without to spent time to correct them.

I have also the possibility, to "go back" to check where a word was originally placed, because I import these words from a txt file, where the Na'vi stuff are cleaned. That's the reason, why si-verbs, utral aymokriyä, toruk makto etc must have an underscore -> <word>_si, utral_aymokriyä, toruk_makto etc. otherwise they get lost by import.

Quote from: Kame Ayyo'koti on June 06, 2014, 02:03:11 PM
Are you planning to count future posts? Will you do that by hand as well?
The reason I suggest this script/method is that it should make processing future text easier. As I said, once you "fix" a word or a typo, the script can read it from then on, and I could have most of the "word dictionary" generated automatically. You would only have to fix new typos that it hasn't seen before, which will be less and less as time goes on after you fix more and more of them.

I think yes. But if there would be a script, it would be awesome. Also, one just could collect word from that day where the dictionary has been released. The raw files as csv from the admin is made at 7th May 2013. I have also all the Na'vi stuff since that day too. But I would recommand to do it complete as new, because that future one would be an another method.

Quote from: Kame Ayyo'koti on June 06, 2014, 02:03:11 PM
Whatever the case, you seem to be happy with the method you've worked out. It's up to you what you prefer to do.

As much as I like to have a script for that, at the moment it isn't necessary, because I did the most of them before you come up with your script offer. You have maybe read the original thread and there were some people who wanted to write a script but they hadn't time for it. I thought a lot of ways how I could find a way to solve this project without any difficult scripts, because I'm too stupid for such scripting (I hadn't any scripting at the school, no wonder at that poor German school system...). So, I creadted "my method" many months ago. Originally it was planned to make a "500 most used word" project, but as I improved the whole stuff here, I decided to include all avaliable Na'vi words.
As for the future, I would do this only with a script if there will be available, for sure.
If you want to write a script, you have a good moment especially for testing and improvements even for the future.

Kame Ayyo’koti · June 08, 2014, 12:47:38 AM

I think having a proper tool for counting words in the future would be good for the community. People could know what words the community uses the most in nìNa'vi nì'aw, so they can read them more easily. Also, they could scan stories written nìNa'vi and learn the most common words to make it easier to read.

Since your work is almost done, and I have other things I want to focus on (like learning the language!

), I will save this project for later.

When I have less to do, I will come back to it and try to make a proper tool to help you and others. I want to make a full program, with an graphical user interface, so it would be easy to learn and use.

It still probably wouldn't do all the things you would like it to do, but after I build the basic program, maybe we can think about adding more advanced/fancy features.

Speaking of which, I stumbled upon this the other day: The Natural Language Toolkit
This appears to be code written by people which can figure out how to "read" language. According to this post, apparently it can be "taught" to know new languages: NLTK for other languages besides English

QuoteNLTK mostly relies on machine learning and the "settings" are usually extracted from the training data.

I have no idea if I could use this, and make it "learn" Na'vi, but it would be exciting to try!

And if we can get it to work, it might make a great tool for many things.

Wllìm · June 08, 2014, 09:01:28 AM

Quote from: Kame Ayyo'koti on June 08, 2014, 12:47:38 AM
I have no idea if I could use this, and make it "learn" Na'vi, but it would be exciting to try! And if we can get it to work, it might make a great tool for many things.

I'm also very interested in this

I've used NLTK a little bit before, and it seems that it is capable of doing a lot of things. For example, you can give it a formal grammar like...

Spoiler: NLTK code

Code Select


# A context-free grammar for (a very, very, very small part of) Na'vi, for NLTK.
# This grammar is made to be used by an nltk.parse.FeatureEarleyChartParser.

% start S

S[TYPE='intr'] -> F V[TYPE='intr'] F C[CASE='sub'] F
S[TYPE='intr'] -> F C[CASE='sub'] F V[TYPE='intr'] F
S[TYPE='intr'] -> F V[TYPE='tr'] F C[CASE='sub'] F
S[TYPE='intr'] -> F C[CASE='sub'] F V[TYPE='tr'] F
S[TYPE='intr'] -> F V[TYPE='intr'] F

S[TYPE='tr'] -> F V[TYPE='tr'] F C[CASE='age'] F C[CASE='pat'] F
S[TYPE='tr'] -> F V[TYPE='tr'] F C[CASE='pat'] F C[CASE='age'] F
S[TYPE='tr'] -> F C[CASE='age'] F V[TYPE='tr'] F C[CASE='pat'] F
S[TYPE='tr'] -> F C[CASE='pat'] F V[TYPE='tr'] F C[CASE='age'] F
S[TYPE='tr'] -> F C[CASE='age'] F C[CASE='pat'] F V[TYPE='tr'] F
S[TYPE='tr'] -> F C[CASE='pat'] F C[CASE='age'] F V[TYPE='tr'] F
S[TYPE='tr'] -> F V[TYPE='tr'] F C[CASE='age'] F
S[TYPE='tr'] -> F C[CASE='age'] F V[TYPE='tr'] F
S[TYPE='tr'] -> F V[TYPE='tr'] F

S[TYPE='cop'] -> F C[CASE='sub'] F V[TYPE='cop'] F C[CASE='sub'] F
S[TYPE='cop'] -> F C[CASE='sub'] F V[TYPE='cop'] F
S[TYPE='cop'] -> F V[TYPE='cop'] F C[CASE='sub'] F
S[TYPE='cop'] -> F V[TYPE='cop'] F

# constituents
C[CASE=?c] -> N[CASE=?c]

# free "floating" parts such as adverbs, datives, topicals
F -> Adv F | C[CASE='dat'] F | C[CASE='top'] F |

# words
V[TYPE='cop'] -> "lu" | "slu"
V[TYPE='tr'] -> "kame" | "yom" | "tse'a" | "nìn"
V[TYPE='intr'] -> "kä" | "hahaw" | "'ì'awn"

N[CASE='sub'] -> "oe" | "nga" | "po" | "skxawng"
N[CASE='age'] -> "oel" | "ngal" | "pol" | "skxawngìl"
N[CASE='pat'] -> "oeti" | "ngati" | "poti" | "skxawngit"
N[CASE='dat'] -> "oeru" | "ngaru" | "poru" | "skxawngur"
N[CASE='gen'] -> "oeyä" | "ngeyä" | "peyä" | "skxawngä"
N[CASE='top'] -> "oeri" | "ngari" | "pori" | "skxawngìri"

Adv -> "nì'aw" | "nìmun" | "nìtxan" | "nìwotx"

Spoiler: Some Python code if you want to test this grammar

Code Select


from nltk import grammar, parse

grammar = grammar.parse_fcfg(open('navi.cfg').read())
parser = parse.FeatureEarleyChartParser(grammar)

while True:
    text = raw_input().split()
    for tree in parser.nbest_parse(text):
        tree.draw()

... and it can output parse trees for sentences like

oe lu skxawng
oe hahaw
oel ngati kame nìwotx

Clearly creating a complete grammar of Na'vi would be a extremely large job in this way

I never experimented with NLTK's machine-learning features, and I guess that would require much less work however

I have had a plan for quite some time, about trying to write a Na'vi parser that can analyse words and sentences. However, due to me being busy, that will have to wait one month more, when I have holiday