Another word/character frequency analysis

Started by Futurulus, August 15, 2011, 01:22:17 AM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

Futurulus

It's been done before, but not for a while, and I thought people might be interested in seeing the state of Na'vi in the nìNa'vi nì'aw forum lately.

I'm working on a computer program to let me play with the corpus more easily, but I do have two very preliminary results.  Here they are:


Source was, as I mentioned above, the complete contents of the nìNa'vi nì'aw forums, minus a select few stickied threads containing lots of non-Na'vi text.

Of course, there are problems with these lists, the main ones I can think of being 1) the character frequencies list doesn't collapse digraphs and 2) the words list doesn't merge derivations with their root words (the four forms of oe are the only example in the top 20).  Also, since I didn't do much scouring of the corpus by hand, the data is a bit noisy, but there's enough of it that I'm not too worried:

Number of posts: 3831
Number of words: 147806
Number of characters: 860859

The first non-Na'vi word that crept onto the list is (hrh...) "PM", in 86th place with 244 uses, followed closely by (no surprises here) "the", at number 98 with 209 uses.  ("spoiler" had a nice showing, with 152 :-[ hrh...I'll have to fix that at some point.  "hrh" itself, which I would totally count as a Na'vi word, was--alas--only used 113 times.)

I've attached the full list, which contains all characters and all words used at least twice.  Oeru lolu fmawn nì'ul a krr, ayngaru payeng.  Source code for the program I wrote available upon request.

ta Futurulu


edit -- other interesting results (copied from later posts):

'Oma Tirea

Interesting...

How about Na'vi letter frequency?  That would be really telling :)

[img]http://swokaikran.skxawng.lu/sigbar/nwotd.php?p=2b[/img]

ÌTXTSTXRR!!

Srake serar le'Ìnglìsìa lì'fyayä aylì'ut?  Nari si älofoniru rutxe!!

Ftxavanga Txe′lan


Futurulus

Nìprrte', ma Txe'lan!

Quote from: 'Oma Tirea on August 16, 2011, 04:02:50 AM
Interesting...

How about Na'vi letter frequency?  That would be really telling :)
Here you go, ma 'Oma Tirea:
(Never forget the humble tìftang!  I wrote a feature to filter out words with letters not in the Na'vi alphabet and realized that nì'aw, number 22 on the list, was getting rejected!)

The new counts.txt includes this letter table, as well as a list of definitely non-Na'vi words that the program filtered out.

A way to count words based on root forms is in the works, but is quite a bit more difficult.  I also just started working with Tirea Aean on getting his vrrtepCLI to detect affixes -- hopefully if we can get that working, I can reuse a bunch of the code.

'Oma Tirea

Quote from: Futurulus on August 20, 2011, 12:09:45 AM
Quote from: 'Oma Tirea on August 16, 2011, 04:02:50 AM
Interesting...

How about Na'vi letter frequency?  That would be really telling :)
Here you go, ma 'Oma Tirea:

Irayo nìtxan, +1.  However 'RR is missing.

[img]http://swokaikran.skxawng.lu/sigbar/nwotd.php?p=2b[/img]

ÌTXTSTXRR!!

Srake serar le'Ìnglìsìa lì'fyayä aylì'ut?  Nari si älofoniru rutxe!!

Futurulus

Quote from: 'Oma Tirea on August 20, 2011, 12:21:33 AM
Irayo nìtxan, +1.  However 'RR is missing.
oo, good catch.  RR clocks in at number 27, and drops R from 8th place to 12th.

Tirea Aean

Quote from: Futurulus on August 20, 2011, 12:48:02 AM
Quote from: 'Oma Tirea on August 20, 2011, 12:21:33 AM
Irayo nìtxan, +1.  However 'RR is missing.
oo, good catch.  RR clocks in at number 27, and drops R from 8th place to 12th.

and based off this data, Here is a graph. Just cuz I like graphs. ;)



.xls attached.

Kamean

Tse'a ngal ke'ut a krr fra'uti kame.


'Oma Tirea

[img]http://swokaikran.skxawng.lu/sigbar/nwotd.php?p=2b[/img]

ÌTXTSTXRR!!

Srake serar le'Ìnglìsìa lì'fyayä aylì'ut?  Nari si älofoniru rutxe!!