The Na'vi Librarian's Lament, or Collation Algorithm Problems Unique to Na'vi

Yawne Zize’ite · June 16, 2012, 07:56:48 PM

First, this could imaginably go in a lot of places; Intermediate seems the least bad, as it's not an active project and is a bit complex, but if this should be elsewhere please move it.

One of my projects the last time Na'vi piqued my interests was a collation algorithm. As collation algorithms go, it seemed to do its job; there was a place for all 33 letters and every letter had its place, including the position of ʼ at the beginning of the alphabet instead of the end (which is unique to Na'vi among all languages using the Latin alphabet!) and the diphthongs.

The diphthongs are the problem. Being a collation algorithm, it's stupid, and would interpret any sequence of a/e + w/y as a diphthong+vowel sequence and collate accordingly: yayayr, yaymak, yayo, etc, when the correct collation is most likely yayayr, yayo, yaymak; a+consonant should come before all words with ay. As far as I know, the problem of an ambiguous digraph that affects collation order is unique to Naʼvi, although I suspect someone who knows a Central European language could produce an example - in fact, I would be very grateful for such an example, since that would point the way towards a solution. Slovak, maybe? (Downthread, Blue Elf notes that this is a problem in Czech and Slovak, which treat "ch" as one letter for collation but have a few compound words with c+h.)

This is already beyond what the Unicode Collation Algorithm can handle, as far as I know (which means that correct Naʼvi sorting is unlikely to ever be possible without special-purpose programs), but if one were writing a collation program from scratch it doesn't sound hard to add in an extra rule: treat a/e + w/y + a following vowel or pseudovowel as (vowel) + (semivowel) + (vowel) not (diphthong) + (vowel). Then this hypothetical program encounters <tswayon>, promptly parses it as <tswa.yon>, and files it before a hypothetical word *<tswazu>. Oops.

This is where I'm out of ideas. In a somewhat similar extreme case (German collation of foreign names with ä, ö, ü diaeresis instead of umlaut, which would have different sorting rules) the Unicode Consortium recommended using the nonprinting character Combining Grapheme Joiner to separate the base letter and diacritical mark to break the normal order. (Edit: this is also recommended for Czech and Slovak c+h combinations.) The most Unicode-compliant solution I can think of to Naʼvi sorting is to use some preprocessing program to add CGJ between a/e and w/y to break up all sequences of a/e + w/y + vowel except for any words found on a specific exception list, then run the preprocessed list through the Unicode Collation Algorithm (with Naʼvi tailoring). I'm sure someone who doesn't have a headache can think of a better idea, but that someone is not me. Help, please?

Edited to replace the incorrect example with a correct one and note some new information.

Plumps · June 17, 2012, 08:36:02 AM

I'm not altogether sure whether I see the problem... $:-\$

What you're describing is more a problem for syllabification than sorting.

Quote from: Yawne Zize'ite on June 16, 2012, 07:56:48 PM
This is already beyond what the Unicode Collation Algorithm can handle, as far as I know (which means that correct Na'vi sorting is unlikely to ever be possible without special-purpose programs), but if one were writing a collation program from scratch it doesn't sound hard to add in an extra rule: treat a/e + w/y + a following vowel as (vowel) + (semivowel) + (vowel) not (diphthong) + (vowel). Then this hypothetical program encounters <tswayon>, promptly parses it as <tswa.yon>, and files it before a hypothetical word *<tswazu>. Oops.

I can't give you a solution for the diphthong problem, however I can give you an example of German where there can be the same difference:

‹eu› is very often treated as a diphthong, the oy in English destroy
However, it can happen, especially with the prefixes be- and ge-, that e and u come together but not forming a diphthong, e.g. beurteilen (assess) which comes from urteilen (judge) – so there you'd pronounce it be.ur.tei.len. Nevertheless, it is not treated differently in lexical sorting. Beule (bump; spoken with a diphthong) still comes before beurteilen.

‹äu› has the same pronunciation as ‹eu›. Although ä is a letter on it's own, the trema is ignored in lexical sorting. So, although in our reciting of the alphabet we list ä, ö, and ü after z, there is no separate section in the Duden after Z, as in Swedish with å, ä, and ö. Bäurisch (rustic; from Bauer, farmer or pawn) comes after Baureihe (production series) and before Bauruine (abandoned construction).
This of course does not happen in Na'vi

Nevertheless, for me the tswayon – *tswazu example is totally okay, as long as ay, ey, aw, and ew are not listed separately in the dictionary before, which I think we don't do. The entries vor ey would be pretty thin (i.e. eyawr, eyk, eyktan, Eywa, Eywa'eveng) and for the others even less

Quote from: Yawne Zize'ite on June 16, 2012, 07:56:48 PM
The diphthongs are the problem. Being a collation algorithm, it's stupid, and would interpret any sequence of a/e + w/y as a diphthong+vowel sequence and collate accordingly: 'ewan, 'ewll, *'ewo, etc, when the correct collation is most likely 'ewan, *'ewo, 'ewll; e+consonant should come before all words with ew. As far as I know, the problem of an ambiguous digraph that affects collation order is unique to Na'vi, although I suspect someone who knows a Central European language could produce an example - in fact, I would be very grateful for such an example, since that would point the way towards a solution. Slovak, maybe?

According to the Alphabet that Frommer released two years ago, ll comes after l, so in my opinion, the sorting of 'ewan, 'ewll, and *'ewo is completely correct.

Yawne Zize’ite · June 17, 2012, 10:26:20 AM

The diphthongs are the crux of the problem. If they are treated as letters in their own right, ʼewan is spelled ʼ-e-w-a-n, *ʼewo would be spelled ʼ-e-w-o, but ʼewll is spelled ʼ-ew-ll and should be alphabetized after all words whose second letter is e. (A rough German equivalent would be the older spelling of ä ö ü as ae oe ue.)

Right now the dictionary does not treat the diphthongs as independent letters (the only words beginning with aw- and ay- are plural pronouns, but ey- indisputably has eyk, eyktan, Eywa, and Eywaʼeveng), probably because correctly sorting the diphthongs is very difficult.

So that's the dilemma: reject aw, ay, ew, and ey as letters of the alphabet, at least for collation purposes (Spanish used to collate ch and ll as separate letters but that was officially changed in 1994 under pressure from international standardization bodies), or keep them as full letters of the alphabet and figure out some way to disambiguate ew (the letter) from e-w (the letter e next to the letter w). I've read that Czech and Slovak have the same problem with the digraph ch in a few compound words.

Plumps · June 17, 2012, 11:58:38 AM

Right, now I understand, what you mean ... I think it should be handled as-is. If a dictionary has included IPA then one will know from the syllabification, whether these are diphthongs or not.

Quote from: Yawne Zize'ite on June 17, 2012, 10:26:20 AM
The diphthongs are the crux of the problem. If they are treated as letters in their own right, ʼewan is spelled ʼ-e-w-a-n, *ʼewo would be spelled ʼ-e-w-o, but ʼewll is spelled ʼ-ew-ll and should be alphabetized after all words whose second letter is e. (A rough German equivalent would be the older spelling of ä ö ü as ae oe ue.)

And that's not correct, I'm sorry. It is '-e-w-ll as well, because ll cannot be an onset of a syllable. Thus syllablewise it has to be 'e.wll

Yawne Zize’ite · June 17, 2012, 01:08:01 PM

Quote from: Plumps on June 17, 2012, 11:58:38 AM
Right, now I understand, what you mean ... I think it should be handled as-is. If a dictionary has included IPA then one will know from the syllabification, whether these are diphthongs or not.

So you're for counting aw, ay, ew, and ey as letters for teaching, but not for collation, similar to the Spanish example and for the same reason (difficulties with computerization)?

Quote
Quote from: Yawne Zize'ite on June 17, 2012, 10:26:20 AM
The diphthongs are the crux of the problem. If they are treated as letters in their own right, ʼewan is spelled ʼ-e-w-a-n, *ʼewo would be spelled ʼ-e-w-o, but ʼewll is spelled ʼ-ew-ll and should be alphabetized after all words whose second letter is e. (A rough German equivalent would be the older spelling of ä ö ü as ae oe ue.)
And that's not correct, I'm sorry. It is '-e-w-ll as well, because ll cannot be an onset of a syllable. Thus syllablewise it has to be 'e.wll

Oops!

A better example is yayo y-a-y-o vs. yaymak y-ay-m-a-k, which the dictionary has reversed. (I don't put much faith in the ordering of the current edition of the dictionary, since it has severe problems with words containing ì.)

Plumps · June 17, 2012, 01:29:33 PM

Quote from: Yawne Zize'ite on June 17, 2012, 01:08:01 PM
Quote from: Plumps on June 17, 2012, 11:58:38 AM
Right, now I understand, what you mean ... I think it should be handled as-is. If a dictionary has included IPA then one will know from the syllabification, whether these are diphthongs or not.

So you're for counting aw, ay, ew, and ey as letters for teaching, but not for collation, similar to the Spanish example and for the same reason (difficulties with computerization)?

Yes, I think so $:-\$

Since I don't print out my dictionaries but use them either online or in pdf form, I search for the word and don't pay that much attention to the sorting. To me this will only be of importance once a Na'vi Dictionary will go into print (I'm still hoping

)

But I'm interested what the others think

Quote from: Yawne Zize'ite on June 17, 2012, 01:08:01 PM
Oops! A better example is yayo y-a-y-o vs. yaymak y-ay-m-a-k, which the dictionary has reversed. (I don't put much faith in the ordering of the current edition of the dictionary, since it has severe problems with words containing ì.)

Yes, that's true and I noticed that as well when I copy-&-pasted from it to use it for a project of mine.

Yawne Zize’ite · June 17, 2012, 02:08:16 PM

Yawo y-a-w-o and the yawn- group y-aw-n-* are the other ones I distinctly recall as being reversed in the dictionary.

I'd really like to find out what people who actually use these things think should be, rather than is; right now I'm afraid that too much of Naʼvi alphabetization and collation is being driven by the language masters all using US keyboards with US copies of Excel set up to use a near-default collation since English doesn't accord alphabetic status to any of its digraphs. (English collation is generally simple, probably because any attempt to treat digraphs as letters would run headlong into the disastrous spelling system.) Once there is some consensus on what should be, then someone (maybe me) can get to work on making it happen.

Blue Elf · June 18, 2012, 01:07:03 AM

Quote from: Yawne Zize'ite on June 17, 2012, 10:26:20 AM
So that's the dilemma: reject aw, ay, ew, and ey as letters of the alphabet, at least for collation purposes (Spanish used to collate ch and ll as separate letters but that was officially changed in 1994 under pressure from international standardization bodies), or keep them as full letters of the alphabet and figure out some way to disambiguate ew (the letter) from e-w (the letter e next to the letter w). I've read that Czech and Slovak have the same problem with the digraph ch in a few compound words.

Yes, that's true. In Czech, "ch" is located between "h" and "i" and is treated as single letter (what annoys me a little; I'd prefer to have "ch-" words located under "c"

). There are a few words, where "ch" works like two distinct letters (see wikipedia; another example is báchamr). But these words appears to me as slang or jargon of specific profession

Seze Mune · June 18, 2012, 09:25:39 AM

This subject is a little deep for my language skillset, but if it helps at all, I use the dictionary the way Plumps does. I ride carelessly over the details and set my crosshairs on whole words regardless of their relative order in the dictionary.

`Eylan Ayfalulukanä · June 18, 2012, 09:22:36 PM

The order of letters in the Na'vi alphabet can be totally arbitrary, as the language does not exist in a real historic sense. We really don't know how a native Na'vi would order the alphabet, and what rules they would use in doing so.

But since this is a constructed language, and its creator speaks primarily American English, it is not surprising that the alphabet that he adopted is not much different than the English alphabet. There is absolutely no reason (other than learnability, probably) that the Na'vi alphabet has the order it has. This issue is even less important when you remember that Na'vi is primarily a spoken language. The only writing systems that exist for it is K. Pawl's adaption of a Latin alphabet, and a couple fan-created symbol sets that have become more or less 'standardized'.

So, rather than trying to fit Na'vi into a mold, it should be left to be its own 'animal'. As i see it (and the symbol set creators feel the same way), there should be (at minimum) a symbol for each letter, including what we call dipthongs and digraphs. A single symbol for a letter is very codifiable in Unicode, and there is a more-or-less agreed-on order for the symbols.

The dictionary is generated in LaTeX, which is far more flexible in terms of making an unusual letter order 'work'. It should be possible to have a separate entry, in the right order, for all the dipthongs and digraphs. If we do not want a unique symbol set, there are enough left-over letters in Latin alphabets and in IPA to create a complete and unambigious Na'vi alphabet using all unique-but-known symbols.

If one were to teach Na'vi completely from scratch, with no reference to the Latin alphabet, they would be best to use one of the fan-created symbol sets and teach it as a unique alphabet. If this were adopted, in place of the Latin alphabet and all its problems, a collation order would be much easier to do.

As the maintainer of a conlang dictionary (for Dothraki), I would love to do the same thing there. But like Na'vi, Dothraki is a spoken-only language, and a symbol set is kind of meaningless at the moment. But there are digraphs (no dipthongs!) that have the same issues as the ones do in Na'vi. And like Na'vi, Dothraki is represented for learning by a Latin alphabet. The most common digraphs in the language are ch, sh, th and zh. It would be really nice to break these out as separate letters, but this won't happen until I am much better at LaTeX. Dothraki's creator (David Peterson) is well-known for some of the beautiful symbol sets he has created. But nothing exists for Dothraki, as there is no precedent (and a much smaller fan base) for such a symbol set.

To sum this all up, I am happy with Na'vi as it is now, and the limitations in the ability to collate its words. But long-term, being able to separate all the digraphs and dipthongs into separate unique letters (with K. Pawl's collation order) would be a good thing.

wm.annis · June 18, 2012, 09:44:00 PM

You might find this perl library interesting: http://search.cpan.org/~sburke/Sort-ArbBiLex-4.01/ArbBiLex.pm

It is designed to allow you to define the sort order of both characters and digraphs.

Yawne Zize’ite · June 18, 2012, 11:15:26 PM

I'm taking an etic view of the matter, not an emic view; a native Naʼvi speaker would order the alphabet however the ʼRrtan linguist who designed it told her to! I'm not being facetious; the current order of the English alphabet and other Latin-based alphabets can be traced directly back to one of the two alphabetical orders in use in Ugarit 2300 years ago, which, as far as anyone knows, was arbitrary then. (The order of a native script will differ, but if human history is a guide a highly successful native script would be a defective syllabary.)

Uniltìrantokx te Skxawng has a similar philosophy regarding the desirability of using only single letters, no digraphs. I used to strongly feel that single letters were better, and I still like them better from an esthetic point of view, but I've come around to finding unambiguous digraphs often better than the alternatives. I wouldn't have even started this topic if Naʼvi didn't have a lot of ambiguous digraphs. Many Central European languages use digraphs as part of their alphabets and treat them as full letters; they take up one square in crossword puzzles, are included whole in abbreviations, and are independent headings. So the technology is there to treat them as single letters.

(Anyone can come up with a quick Naʼvi alphabet without digraphs: my attempt is ʼ a ă á æ e ĕ é f h i ì k k̓ l ĺ m n g o p p̓ r ŕ s t t̕ c u v w y z, or if I get to pick the alphabetical order too a ă á æ c e ĕ é f g h i ì k k̓ l ĺ m n o p p̓ r ŕ s t t̕ u v w y z ʼ. No attempt has been made to choose diacritics for their sorting order, and I used Americanist rather than Ethiopianist letters for the ejectives.)

The last time I asked about this I assumed there was some way to change sort order in LaTeX, but I never found what it was. How do you generate that Dothraki dictionary? Is there a SQL backend?

A unique symbol set is a curse as much as it is a blessing. I'm primarily a student of Quenya, as my avatar indicates, so I've given some thought to tengwar collation. One, tengwar aren't encoded in Unicode, at all, so actually writing them requires using specialized fonts using one of three incompatible standards and the Private Use Area. Two, no one has thought about how to do tengwar collation (as far as I can tell), so there isn't even a definitive order of the alphabet past the most-used 36 letters. Three, no one has thought about how to order the diacritics either. The current codepoint-based order is acceptable to me, in that nothing is indisputably out of place. I'd sort the extended grades differently (they're currently between vilya and rómen; I'd place extended-3 between hwesta and anto, and extended-4 between unque and númen), but there simply is no standard. There isn't even a standard for tengwar spelling. (For that matter, there's not a firm standard for Quenya spelling in the Latin alphabet!)

I am again reminded that I should look at Hungarian, which is likely the only language with collation problems as severe as Naʼvi, and for the same reason; ambiguous digraphs. (Hungarian allows ambiguous sequences involving digraphs; házszám could be *házs-zám or the correct ház-szám.)

ArBiLex looks like it can do as much as a basic UCA plugin; that is, it's trivial to enter everything except the digraph breaking rules, and that step would require an extra program to break digraphs.

Yawne Zize’ite · June 25, 2012, 01:09:44 AM

So I have done a little research into Hungarian sorting, which has exactly the same list of problems as Na'vi sorting (digraphs considered single letters and ambiguous multigraphs which must be parsed word by word). The answer is that for general purposes Hungarians have learned to tolerate mildly incorrect automatic sorting. For a dictionary sorting should be done properly, and it's possible using a UCA-based collator and Combining Grapheme Joiner. However, Tuiq uses a Perl script instead, and I don't know Perl and the overall system well enough to suggest a workable way to sort properly. (My guess is that you're using straight ASCII codepoint ordering, since { comes after z, which means you have to refile all incorrect digraphs manually, switch to a UCA collator like Unicode::Collate and add Naʼvi-specific tailoring, or insert { after a/e+w/y, except before vowels, save tswayon.)