I've managed to throw something together as proof of concept. It is a short script (weighing in at 44 lines!,
including comments) which reads a "dictionary" file which "explains" to it the parts each word consists of. It looks like this:
tslolam~tslam,-ol-
fìtseng~fì-,tseng
pefnekelku~pe-,fne-,kelku
peseng~pe-,tseng
kaltxì~kaltxì
frapoya!~fra-,po,-ya
lu~lu
txana~txan
krr~krr
fwa~fwa
tuteo~tute,-o
pamrel~pamrel
si~ignore
’uor~'u,-o,-ur
fìtsteng.~fì-,tseng
fwa~fwa
tìrey~tìrey
fìtsenge~fì-,tseng
za’u~za'u
a~a
fì’u~fì-,'u
oeru~oe,-ru
teya~teya
seiyi~-ei-
nìngay.~nìngay
srake~srake
kop~kop
som~som
fìtxan~fìtxan
ro~ro
aysenge~ay+,tseng
tok~tok
ayngal?~ay+,nga,-l
tewti.~tewti
fìtseng,~fì-,tseng
som~som
fìtxan~fìtxan
fko~fko
ke~ke
tsängun~tsun,-äng-
kem~kem
sivi~-iv-
ke’ur~ke'u,-r
stum.~stum
nì’aw~nì'aw
tìhuseyn~heyn,tì-us-
ke~ke
ftue~ftue
taluna~taluna
pay~pay
asyä’ä~syä'ä
ta~ta
tokx~tokx
za’u…~za'u
–~ignore
(~ignore
roll~ignore
eyes~ignore
oeru~oe,-ru
txoa~txoa
livu~lu,-iv-
fpi~fpi
This way, not only can the roots be counted, but affixes can be counted as well.
Then you give it a text file. I copied a bit from nìNa’vi nì’aw:
Kaltxì frapoya!
lu txana krr fwa tuteo pamrel si ’uor fìtsteng. Fwa tìrey ne’ìm fìtsenge za’u a fì’u oeru teya seiyi nìngay.
Srake lu kop som fìtxan ro aysenge a tok ayngal? – tewti. Fìtseng, lu som fìtxan fwa fko ke tsängun kem sivi ke’ur stum. Nì’aw tìhuseyn lu ke ftue taluna pay asyä’ä ta tokx za’u… ( Roll Eyes oeru txoa livu fpi lì’u slä ke omum oel fya’ot a oel plltxe tsat nìketeng…)
Tse, ma Taronyu, tsamun nga hivahaw tsatì’i’a srak?
Ìlä fìskxom zerok oel tìrolti a fkol rol mì rel arusikx a syaw san« Tsray »sìk, srake ayngal tsat omum? ’Uo na fì’u:
Ma prrnen, hivahaw
nìmwey hivahaw
tìrey lu ngim sì tìyawn lu teya hu ral
krr layu kalin ngafpi
’änsyema kifkeyti tsive’a
lu krr fte tsive’a fte kivame
hufwa tìvawm za’u sì kä
fya’o a *hufwevil ayutralti ’ärip
fya’o a prrnesyul ’ong
It "looks" at each word and checks if it's in the dictionary. If it is, it adds +1 to each of that word's components under the dictionary entry. If it isn't, it prints that word to an "Unlisted Words" file as a blank dictionary entry, like so:
ne’ìm~
lì’u~
slä~
omum~
oel~
fya’ot~
oel~
plltxe~
tsat~
nìketeng…)~
tse,~
ma~
taronyu,~
tsamun~
nga~
hivahaw~
tsatì’i’a~
srak?~
ìlä~
fìskxom~
zerok~
oel~
tìrolti~
fkol~
rol~
mì~
rel~
arusikx~
syaw~
san«~
tsray~
»sìk,~
ayngal~
tsat~
omum?~
’uo~
na~
fì’u:~
ma~
prrnen,~
hivahaw~
nìmwey~
hivahaw~
ngim~
sì~
tìyawn~
hu~
ral~
layu~
kalin~
ngafpi~
’änsyema~
kifkeyti~
tsive’a~
fte~
tsive’a~
fte~
kivame~
hufwa~
tìvawm~
sì~
kä~
fya’o~
*hufwevil~
ayutralti~
’ärip~
fya’o~
prrnesyul~
’ong~
After reading the entire file, it prints the final word count to a Comma Separated Value file:
krr,3
kaltxì,1
tewti,1
pe-,0
a,7
ke,3
srake,2
fne-,0
fìtxan,2
fko,1
za'u,3
ignore,5
teya,2
fì-,4
fpi,1
kop,1
-ya,1
heyn,1
kem,1
pay,1
nìngay,1
tì-us-,1
po,1
fra-,1
fwa,3
pamrel,1
'u,2
-ur,1
-ru,2
-o,2
txoa,1
syä'ä,1
-l,1
tslam,0
-ei-,1
ftue,1
taluna,1
som,2
ta,1
tok,1
tseng,4
lu,8
nga,1
-r,1
ro,1
-iv-,2
txan,1
tute,1
stum,1
nì'aw,1
kelku,0
tsun,1
-ol-,0
ay+,2
oe,2
-äng-,1
ke'u,1
tokx,1
tìrey,2
This can be easily loaded into a spreadsheet program and sorted and used as pleased.
After processing a load of text, you would look in the Unlisted Words file, provide definitions for them, and copy and paste them to the dictionary file, then re-parse the text until none are left.
Please keep in mind that this script is not polished—I only threw it together to show how this could be possible. It only does the bare minimum and could be improved to deal with the problems of parsing text like this.
For example, you'll notice that there is some punctuation in the Unlisted Words entries, "stuck" to some of the words. This could be improved by giving it a list of characters to ignore when examining a word, so something like
srak? would be considered
srak, without the question mark.
As for recognizing si-verbs, names, and other multi-word phrases and constructions, I could possibly throw in a "pre-parser" that examines entire sentences for known multi-word phrases, before examining words one-by-one. Also, if you're concerned about it parsing usernames and other things as regular words, you could always scan the text yourself and delete them by hand before handing it over to the script.
As for things like BB codes and HTML (and their contents), I imagine they could easily be taken care of with the use of regex matching. I work with programs that are "smart" enough to recognize things like this. I could write a script that either just removes the tags themselves, or the tags and everything inbetween them. It would work for HTML too.
I would like to read this discussion that was had about scripts, as well as what scripts "can't handle." I may be able to find a way to make it work.