Eana Eltu: Translator, Dictionary, API and putxìng.

baritone · November 05, 2013, 12:46:33 PM

I've been wanting to write a program that would do something like vrrtepcli.
I've asked Tirea Aean about his plans about vrrtepkli via facebook, but has not yet received a response.

`Eylan Ayfalulukanä · November 05, 2013, 05:09:04 PM

I believe VrrtepCLI is open source. Grab it and run with it!

Ulte ma Tuiq, woulr ee.net somehow run under Linux or *nix?

Tuiq · November 05, 2013, 07:41:09 PM

It would be rewritten in ASP.NET MVC 3/4, which is fully/partially supported by recent Mono versions. So it can continue being hosted on LN.org. You wouldn't be able to tell a difference between Perl and ASP.NET, I hope, except for a few neat changes. I've written about them extensively in another thread of mine I believe, but to summarize it:

Unified login
Your login for this forum is also your login for EE.NET. You cannot create accounts, forget passwords or anything on EE.NET, that will be this forum's job then.

Inheritance and named fields
Instead of having \type with countless args, the new system will have word types that may inherit from each other (base < word < verb < transitional verb). It sounds a bit odd, but it should come easy. Let's assume we have a root word type called root, which has the fields ("localized" [T], "native", "ipa"), where [T] marks that this field may be changed by translators. We can then inherit from root, for example cw2 (connected word? I'm sure you know what it means) with fields ("localized" [T], "native", "ipa", "localized cw1" [T], "native cw1", "localized cw2" [T], "native cw2"). Because every cw2 is also a root word, we can transform cw2(localized: "together", native: "'awsiteng", ipa: "Paw.si.[textprimstress] tEN", localized cw1: "one", cw1: "'aw", localized cw2: "same, equal", cw2: "teng") into a root(cw2(localized: "together", native: "'awsiteng", ipa: "Paw.si.[textprimstress]"). As you see, what made cw "unique" was simply dropped.

This is a big step upwards. Because, right now, it looks like this:
\cw{'awsiteng}{Paw.si.\textprimstress tEN}{together}{'aw}{one}{teng}{same, equal}

The order of the arguments isn't necessarily even the same, which I'll show with a little bit of code:

Code Select


  if ($type eq 'pcw') {
   $mI = 4; $sI = 10;
  }
  elsif ($type eq 'affixN') {
   $wtI = 4; $mI = 3; $sI = 8;
  }
  elsif ($type eq 'alloffixN') {
   $wtI = 5; $mI = 4; $sI = 9;
  }
  elsif ($type eq 'markerN' || $type eq 'derivingaffixN') {
   $mI = 3; $sI = 7;
  }
  elsif ($type eq 'infixN') {
   $mI = 3; $pI = 4; $sI = 8;
  }
  elsif ($type eq 'infixcwN') {
   $mI = 3; $pI = 4; $sI = 10;
  }

With named fields (and inheritance), this becomes a non-issue. The code above searches for the shorthand, position, meaning and word type of a word. Since it differs from word to word, every possible exporter has to deal with these exceptions to "interpret" the data. This won't be an issue anymore, writing exporters becomes easier and they can become more powerful.

BBCodes
As mentioned above ([textprimstress]), the data itself will not be tied to LaTeX anymore. Instead, BBCodes will be used which replace this functionality. BBCodes themselves do nothing, it's up to the exporter how they handle the BBCodes. If a BBCode is used that is not supported by a used exporter, a warning will be generated that can be seen in some sort of log in EE.NET, visible to translators/dictionary maintainers. The idea, however, is that every dictionary could define BBCodes as they please, without changing any code. To do this, we'll have to use some sort of script (see below). Primitive, simple search-and-replace might be possible without scripts. I'll have to see what BBCodes will be required to see how powerful I have to design them. If you could compile a list of all LaTeX commands used that are not layout relevant (i.e. \usepackage, \documentclass and stuff are clearly not going to be BBCodes) along with what BBCode you would imagine for them (for example, [bf]STRING[/bf] => {\textbf STRING}), that would be extremely useful.

Exporters and scripting
Independent of LaTeX, we could export to more formats - for example, HTML (and therefore, I believe, mobi, epub, that sort of thing) or even wiki markup/forum BBCode. By using BBCodes and inheritance, exporters can be written very easily.

These exporters will be written as C# plugins. However, no compiling will be necessary as these plugins will be compiled on-demand on the LN.org webserver (where they'll also be executed in a sandboxed environment). These will be written, most likely, once and then never touched again. I will make sure to write plugins for the most common ones, such as LaTeX, CSV, jMemorize and SQL. They can then be adapted by the dictionary maintainers as they please. I will try to get the API as simple as possible, so you need very, very little programming knowledge to get going.

These exporters will define how they interpret BBCode (LaTeX treats [bf] differently than HTML, and CSV/SQL just ignores it altogether), how the file itself is generated and even if a re-generation is required. The interface, more or less, would likely look like this, but I haven't begun yet and I can't make promises. Just to give you an idea, if you're interested unhide the spoiler:

Spoiler

Code Select


public class Exporter
{ 
  // A function called when the BBCode class should be filled with input, i.e. before a preview or an output file is generated.
  // If the class itself is kept in cache, this function might only be called once for multiple GeneratePreview/GenerateFiles.
  void PopulateBBCode(BBCode bbcode);

  // Escapes a string, i.e. makes sure nothing will hurt any compilers whatsoever
  string Escape(string input);

  // Quotes a string, usually it will be something like " + Escape + "
  string Quote(string input);

  // Request to generate `word' in `language'. Word and variables have already been set to the specific language we are requesting and BBCode has already been replaced. Used for the web interface. Optional.
  void GeneratePreview(Word word, Output output, Variables variables);

  // Request to generate the whole file for `language'. Words implements IList. Optional.
  void GenerateFile(Words word, Language language, Output output, Variables variables);

  // Requests to generate the whole file in a generic way (for example, SQL, which has multiple languages). Optional.
  void GenerateFile(Words word);
}

That would be the Exporter interface, +/-. For example, a csv that would list only verbs could look like this, in real C# this time:

Code Select


public class CsvVerbs : Exporter
{
  // Fields that we will care about
  private const string[] Fields =
  { 
    "localized", // English, Deutsch, Français
    "native", // Na'vi, Dothrakivakinomurder
    "ipa" // self explanatory
  }
  
  string Escape(string input)
  {
    return string.Replace("\"", "\\") 
  }
  
  string Quote(string input)
  {
    return string.Format(@"""{0}""", Escape(input));
  }
  
  void PopulateBBCode()
  {
    // We don't need any kind of BBCode. Strip all BBCodes and surpress warnings.
    BBCode.StripAll(true);
    // Alternatively, for example, we could do
    // BBCode.Replace(tag: "bf", replacement: @"{\textbf {0}}", single: false)
    // or more fancy
    // BBCode.Replace("reverse", text => return string.Join("", text.Reverse()));
    // or less fancy
    // BBCode.Replace("textprimstress", @"\textprimstress", true)
  }

  // Simplified. Does all the quote-and-merge magic.
  protected string CreateLine(IEnumerable<string> columns)
  {
    return string.Join(";", columns.Select(c => Quote(c)));
  }
  
  // Creates a simple line for this CSV
  protected string CreateLine(Word word)
  {
    // Simply get every field and map it to word[f]. 
    return CreateLine(Fields.Select(f => word[f]));
  }
  
  // Generate a simple line.
  void GeneratePreview(Word word, Language language, Output output, Variables variables);
  {
    // We're text-only, so there's not exactly much to do.
    output.WriteLine(CreateLine(word));
  }

  // Generate the whole file.
  void GenerateFile(Words word, Language language, Output output, Variables variables);
  {
    // SetFilename(string suffix, string extension
    output.SetFilename('verbs', 'csv');
    // Create the header line, e.g. "Deutsch"; "Na'vi"; "Infixposition"
    // Thankfully, we will require/assume that for every field there exists for words, there is also a corresponding variable name.
    // Therefore, this kind of works.
    output.WriteLine(CreateLine(Fields.Select(f => variables[f])));
    
    // Don't sort the words, although we could.
    foreach (Word w in words)
      output.WriteLine(CreateLine(w));
    
    // Done.
  }
}

Note that this one is much more powerful than required; the fields that can be translated could, for example, be passed as an argument. Each exporter could have such arguments, mostly "Has language-specific implementation", "Has generic implementation", "Supports Previews" (LaTeX, for example, would likely not support live-previews, unless browsers can somehow render LaTeX) and perhaps a third one for parameters. That way, you could use the same plugin, but with different arguments, to create different output. Have a CSV for verbs, one for adjectives, another one for infix positions - the possibilities are quite endless. Since EE.NET is .NET and so is the exporter, it's guaranteed to be interoperable.

Likely, the way dictionaries are created will change, however. Instead of the website creating them, they will be done (again) by a backend server. Since we're installing Mono anyway, let that thing be C# too.

That's just a bit on the technical side, i.e. what I have planned. I haven't written a single line of code as of yet, but I've made some thoughts. When I come around and have the whole Exporter business implemented and port over Dothraki (I'll go on a limb and assume that `Eylan Ayfalulukanä is a bit more open for experiments and that kind of stuff) we'll quickly see if it works out or not. I'm a lazy person, so a complicated exporter is out of question. Although the whole thing will (likely) be OpenSource, I want it to be simple enough so you guys can tweak existing exporters and perhaps even create new ones based on older ones. Of course I'm not going anywhere, but less work is always nice.

That was a bit in-depth about how the new generation of stuff would come. I have to say, I'm a bit excited. It would get rid of the current layout and the current files (most likely), by replacing it with better working stuff. Instead of files, variables would be used (which could, in turn, include other variables). As example, the current first page

Code Select

\maketitle
__INTRO_TEXT3__

\noindent{\bf __TITLE_ABBREVIATIONS__}
\begin{multicols}{2}{\noindent
-- = __ABB_MORPHEME_BOUNDARY__\\
+ = __ABB_LENITE_BOUNDARY__\\
$< >$ = __ABB_INFIX_MORPHEME__\\
___ADJ_SHORT___ = __ABB_ADJECTIVE__\\
___ADP_SHORT___ = __ABB_ADPOS_AFFIX__\\
___ADV_SHORT___ = __ABB_ADVERB__\\
___CONJ_SHORT___ = __ABB_CONJUNCTION__\\
___CW_SHORT___ = __ABB_COMPOUND_WORD__\\
___DEM_SHORT___ = __ABB_DEMONSTRATIVE__\\
___INTJ_SHORT___ = __ABB_INTERJECTION__\\
___INTER_SHORT___ = __ABB_INTERROGATIVE__\\
___N_SHORT___ = __ABB_NOUN__\\
___NFP_SHORT___ = __ABB_NFP__\\
___NUM_SHORT___ = __ABB_NUMBER__\\
___OFP_SHORT___ = __ABB_OFP__\\
___PART_SHORT___ = __ABB_PARTICLE__\\
___PN_SHORT___ = __ABB_PRONOUN__\\
___PROPN_SHORT___ = __ABB_PROPER_NOUN__\\
___V_SHORT___ = __ABB_VERB__\\ 
___SVIN_SHORT___ = __ABB_SVIN__\\
___VTR_SHORT___ = __ABB_VTR__ \\
___VIN_SHORT___ = __ABB_VIN__\\
___VTRM_SHORT___ = __ABB_VTRM__\\
___VIM_SHORT___ = __ABB_VIM__\\
___SBD___ = __ABB_SBD__
}
\end{multicols}

\section*{__SOURCES__}
\begin{packed_enum}
\item PF = __SO_FROMMER_CANON__
\item M = __SO_MOVIE_SCRIPT__
\item JC = __SO_CAMERON__
\item G = __SO_AVATAR_GAMES__
\item ASG = __SO_SURVIVAL_GUIDE__
\item PND = __PANDORAPEDIA__
\end{packed_enum}

\newpage

could look like this afterwards, as a variable:

Code Select



[title]
[author]
[date]

[begin]

%{intro text}

[noindent][bf]%{title: abbreviations}[/bf][/noindent]
[multicols=2]
[*]-- = %{abbreviation: morpheme boundary}
[*]+ = %{abbreviation: lenite boundary}
[*]« » = %{abbreviation: infix morpheme}
[/multicols]

[section]%{title: sources}[/section]
[packed enum]
[*]PF = __SO_FROMMER_CANON__
[*]M = __SO_MOVIE_SCRIPT__
[*]JC = __SO_CAMERON__
[*]G = __SO_AVATAR_GAMES__
[*]ASG = __SO_SURVIVAL_GUIDE__
[*]PND = __PANDORAPEDIA__
[/packed enum]

[newpage]

The names are horrible and I'm sure the thirty-different-types-of-enum are going to kill me, but we will see. In the worst case, there will be [item][/item] and [multicolrow][/multicolrow] or something like that, but I think in the long term we'll manage.

Edit: Another use, which came just to my mind, that would be possible with C# scripting would be dynamic linking. As soon as a dictionary is deemed "ready to publish" (i.e. included in the SQL/other things), it could be included in those files, as well as automatically linked in the intro:

Code Select


// variables['additional dictionaries'] is "Additional dictionaries are {0} as well as other, unfinished ones or something." as example
output.WriteLine(variables['additional dictionaries'], string.Join(", ", Languages.Available.Skip(1).Select(l => BBCode.Format("[url=http://dicts.learnnavi.org/something/NaviDictionary_{0}.pdf]{1}[/url]", l.LanguageCode, variables[l.LanguageCode]))))

Which would always be up-to-date, without somebody manually maintaining the list. Probably doesn't happen that often, but I would use it.

Toruk Makto · November 06, 2013, 08:08:09 AM

At least in the English version, I have always been under the impression there are way too many variables. The text and info at the beginning of the document almost never changes and could be maintained by a text editor. The layout and order of the document likewise will not change very much if at all. The numeric system is unlikely to change anymore. The only dynamic content are the sections with the word lists and phrases. We shouldn't go overboard making the system too flexible, if you know what I mean.

-M.

Tuiq · November 06, 2013, 10:31:42 AM

You have to think of variables as things that can/need to be translated. While the intro text does not change at all, translators need to be able to change it. If you hardcode it into the files, other dictionaries would be English, too - which is obviously not exactly wanted.

However, they haven't been used too properly for some things, I'm afraid, The abbreviations ones haven't been used at all, instead every word has 'vtr.' written in it. This has to change, of course - while the variables for all these names will stay, I would say that the approach should be changed. Variables should, if anything, be used more often. For example, instead of having \word{...}{...}{adj.}, I would propose \adj{...}{...}, which then uses the %{abbreviation adjective} variable. This way, dictionary maintainers won't have to write it a dozen time and translators don't need to copy it three thousand times.

Of course, this means you should be able to bind parameters to word templates, i.e. say adjective = word(_, _, "%{abbreviation adjective}", _, _). Alternatively, of course, you could simply use a script which does that, i.e. in your CreateLine function (or whatever you'll call it), you would replace @{type} manually with %{@{type}}. Hm.

baritone · November 06, 2013, 10:33:55 AM

Make a versatile design to export the dictionary data not only in latex, but also in other formats - that's right. But I've heard a lot that the compatibility of the .NET and Mono is very relative. In order not to run into problems, it is desirable to design the system in mono from the beginning.

Tuiq · November 06, 2013, 10:47:22 AM

It is not completely true. Mono does not completely support .NET, that is true, and even ASP.NET's support is partial. However, I'll design and test the system on Windows with MS' CLR. If there are problems on the server, we can always find workarounds.

Mono is just another CLR for the same CLI, more or less. I'm sure as hell not going to develop this with MonoDevelop.

`Eylan Ayfalulukanä · November 06, 2013, 09:26:04 PM

Another interesting proposal would be to limit the dictionary 'system' to just the various word lists. The master editor(s), and the editors for each language, would create the intro and end matter, and 'join' them to the word lists in the manner most appropriate to their situation. This might make Tuiq's job a bit easier, and should create only minimal extra work (mainly at first) for editirs of different languages.

Dothraki at this time, uses only one word type. The valyrian languages as currently proposed, use seven that are common between them. The primary difference between the word types isn't words as much as supporting matter-- the Valyrian dictionary is designed from the get-go to support, additional comments, examples, and references to other documents. I would in the long run, also add these to the Dothraki dictionary. It seems to me that a system should be developed that would allow these fields to be added (and sometimes removed) as needed. Space for all possible variables would be maintained in the database, but would not show up in the dictionary unless that variable was used. This simplifies thing such that a given field in the database can always be counted on to be definition, part of speech, IPA, etc. The dictionary would be UTF-8 or better and all tolls to work with it should be UTF-8. Then, Tuiq and ourselves would have a dictionary framework that could embrace hundreds of languages, with few fundamental changes.

Tuiq · November 06, 2013, 10:03:54 PM

While this is a valid approach, it is a very limited one, too. The Na'vi dictionary in its very first form already exceeded this simple layout by having appendices, which would not be possible with a simple merging.

I think that the C# exporter plugin way is the best way to do it. There are several reasons for it:

You can define completely odd and unfamiliar layouts, have different sortings, selections and formatting.
Based on the words, you can render them completely different. For example, while you could have a "comment" field on your root word type, you would only render this in the NaviExtendedDictionary.pdf. In the normal PDF, it would be completely ignored.
It gives you the whole power of C#, an established and quite well engineered language (and framework) to play with. With LINQ in special, dictionaries become very powerful things.
sandboxing run-time code in C# seems to be doable (i.e. plugins as proposed) and somewhat fairly easily. Of course, if we could drop the security concerns, this would be a no-brainer and a very easy thing to do. Because we cannot necessarily trust dictionary maintainers (never, ever trust any client/third party in a server-client relation) this would be necessary.

I've started reading myself into ASP.NET and have done some neat progress. I'll start by seeing how I can hook up mwForum's current database to ASP.NET's authentication - once this is done, applying for SMF would be no problem at all.

After the authentication is done, I'll start modelling the database structure (i.e. create the model part of the MVC system). Once I think the MVC is fleshed out, I'll post it as an update before I continue working as to see where possible problems could arise and to get some more input in.

Tuiq · November 07, 2013, 08:27:20 PM

And done. After some heavy struggling with MySQL, MembershipProvider and RoleProvider, I can say with confidence that this isn't going to be an issue. Logins from this forum will work flawlessly. I'll have to come up with a role system, but that shouldn't be too difficult. The question is whether I'll incorporate that into the dictionary itself (say, Dictionary.GetMaintainers()) or another table (similar to `dictTutors`). I think the latter is a go, with a system set up like `userId`, `dictionaryId`.

Now, moving on to design the dictionaries.

Tuiq · November 08, 2013, 01:49:12 AM

Alright, the data model was a bit tricky but I am quite confident I've got it nailed down.

I'm quite aware that this won't help anyone who isn't really into these kind of graphics, but it will help the explanation. Let's check out each component.

Dictionary
Dictionaries are the core. Each EE.NET system can have multiple dictionaries. Dictionaries are either a root dictionary (i.e. have no parent) or are inherited from another parent.

Inherited dictionaries (ParentID/Parent) copy all settings, exporters and words from their root dictionary (which can be its parent or even a further ancestor). If you create a child dictionary, it will be an exact copy of its parent dictionary. However, child dictionaries can change certain things, mainly variables (not included in the picture yet, as they are very simple). This is pretty much the current system, but better. Where we currently have a slave-master system (translators <-> maintainers), this will allow for more flexibility.

Example: Assume wanted to make a Swiss German dictionary, but I'm lazy. Very. So there could be periods where the dictionary would not be updated at all. In the current system, this would either mean that the dictionary would contain more and more information in English (not exactly something we want) or it wouldn't be updated at all.

In the new system, we could attach the Swiss German dictionary to the German one (i.e. Swiss German < German < English/Main). This way, the "holes" would be filled with German. Since it is possible to hook dictionaries up and down (and, in theory, even to separate or couple them to another project... although I doubt that makes a lot of sense).

This works for all languages that are roughly the same and get neglected. It's not the best solution, but it is one that would currently be impossible.

Root dictionaries are dictionaries that have no parent (for example, Na'vi or Dothraki). All translations are based on a root dictionary. Root dictionaries have a special function; only root dictionaries have templates (all children use the root's templates). Maintainers of root dictionaries can write exporter scripts, create, update and delete templates and variables or set some basic configuration.

The other fields are fairly self-explanatory, BaseFilename is used to give the exporters a hint to unified filenames (NaviDictionary.pdf, NaviDictionary.sql, NaviDictionary_de.jm), LanguageCode should be fairly obvious. I'm not sure if I need to call a Name field in the dictionaries themselves, I think "DictionaryName @{LanguageCode}" should be pretty self-explanatory.

Moving on!

Templates
Currently called "types", "word types" or "word templates", EE.NET templates will replace the current ones. They are way more powerful than the current word types as you will see in a second. You can think of Templates as "tables". Dictionaries have several tables (Templates), which have columns (Fields) and rows (Words), with each cell being an Entry.

Each template belongs to a root dictionary (as said before, all dictionaries inheriting directly or indirectly from the root will inherit the templates) and can itself be inherited too, as I've written before (think of the "word comment" template of the same as a "word", but with another field for "comment"). Perhaps multiple inheritance would make sense, but I'm not sure how layout plugins would deal with this. In the case of comments, it would make sense, but otherwise not too much. I'll have to think about it a bit.

What this means is that you can have "shared" data between templates. A child template inherits all its fields from its parent template. However, it is possible that a child "overwrites" fields (see below) by simply adding a field with the same name.

Templates are what will be used by the exporters to identify a word. A word itself will be not much more than a bunch of associated data ("english: brain; navi: eltu; ipa: oÔôîÎi"). Which fields are set is determined by the template used and the template themselves tell the exporters what the words are. Exporters that expect, for example, a word of type "basic" (english, navi) can expect that "verb" has a field called "english" and a field called "navi", because it inherits from basic.

This might sound all a bit complicated, but I guarantee that once it's set up, you'll kind of love it. Even if you won't touch it in like forever.

Field
Fields are one property of a template (and therefore, many words). They're the columns of our imaginary table.

So a Field defines which fields a word has (uhhh that's not helpful). As you can see, each field has certain properties.

Name: Unique name that will be displayed ("Human Language", "Na'vi", "IPA", "Infix Position")
Translatable: Whether this field can be changed or not by translators. For example, the IPA and Na'vi part can currently not be changed by translators. These would be set to "false".
FixedValue: A string. If set, the field becomes untranslatable and this value is used instead (think of "hard coding" it, setting it as a constant). This value only makes real sense with inheritance.

Word
A word (row) could be seen as an "instantiation" of a template. The template is the cast, the word is the ingot. Filled with Entries. There's not much to say about it, data-wise, as it's just the glue that keeps stuff together.

Entry
Now it's getting interesting. We've had the tables, the rows, the columns, it's time to talk about the cells.

Entries the final translations. These are the current things you're already filling out, the text inputs basically. These are the things that will be translated and, very likely, the stuff that will keep you going the most. The most important fields are Value, which is the translated (or original, if attached to a root dictionary) 'values' for that field ("eana", "eltu", "oÔôîÎi", "only used on Wednesdays").

Variables
Not in the picture, as they're not part of the data design per se. They're very much like the current variables. Instead of translating the master's variables, you will be shown your parent's and the root's translation (unless your parent IS the root dictionary, in which case this would be obsolete).

Enough talk, fancy pictures.
I think I've gone into detail about the dictionary inheritance, so let's take a closer look for templates.

In this example, I will use a few templates. A checkbox indicates if translators can change it, text left is the field name, text on the right is the fixed value (if any).

The top-left is another illustration of of what happens.

So, right. We have "word" as our root word, hence all fields are considered "new".

As a maintainer, you can set the following fields on a "word": Localized, Na'vi, Word Type, Block.

As a translator, you can change the following fields on a "word": Localized, Word Type.

"verb" inherits from word. It doesn't add any new words, but overwrites "type". Because this is a fixed value, neither maintainers nor translators can change this value.

As a maintainer, you can set the following fields on a "verb": Localized, Na'vi, Block.

As a translator, you can change the following fields on a "word": Localized.

"affix" is the same as verb, so we'll skip it.

"cw" is more interesting. "cw" also inherits from "word", but adds 4 new fields: Localized P1, Na'vi 1, Localized P2, Na'vi 2.

As a maintainer, you can change the following fields on a "cw": Localized, Na'vi, Word Type, Block, Localized P1, Na'vi 1, Localized P2, Na'vi 2.

As translator, you can change the following fields on a "cw": Localized, Word Type, Localized P1, Localized P2.

If we assume now that our SQL exporter exports all "word" types, then it would simply ignore the additions made by cw. cw is a word in its core, so it can be treated as such. However, a word may not be treated as a cw, because Localized/Na'vi parts are missing.

Now, of course, this is where multiple inheritance could be useful. For example, "infix cw" could have both fields from "infix" and "cw", therefore it would be listed in (if that ever exists) a cw dictionary just as it is listed in the infix dictionary. However, each template may only have 1 parent template right now.

Alternatively, you could always slap more fields onto your base classes and never use them. For example, you could add a "comment" field to your word class. Your exporter checks if "comment" isn't empty. In case it's empty, it does nothing, otherwise it puts "<comment>@{comment}</comment>" into the output.

I think the last solution is what we really need, since multiple inheritance is a pain to explain and understand, and quite difficult to store.

With that being said, the model is "complete" I believe. The database as such is written, now it's time for some serious controller adding.

If you want to drop by and see how far I've come so far, feel free to drop by in the LearnNa'vi IRC channel and give me a shout (i.e. say Tuiq), I'll set you up with a link.

Tuiq · November 10, 2013, 07:48:12 PM

After a few changes, I think the model can work. However, I will impose a few limitations first:

You cannot edit an existing word's class (id, template, type) i.e. creating a verb, then changing its type to "verb comment".
You cannot remove or edit an existing class' fields. i.e. removing a column (for example, "comment"), changing its properties (to fixedvalue/comment) will not work. You can, however, rename them.

These limitations will be a bit tricky to implement and therefore will not be provided at the start. But we'll get there. I promise.

As a status report, things are going quite well I think. I've managed to get EntityFramework to do my bidding, as does the MySQL connection/membership/rolemanager. I'll keep on developing for MS SQL and Microsoft's .NET CLR for now, however, because MySQL's development tools are just plain broken (and painful to use) and Visual Studio's debugger for the IIS is just too nice.

However, I still believe that we can simply port it over later. Unless there's some evil bug hidden that the Mono website doesn't list.

`Eylan Ayfalulukanä · November 10, 2013, 10:05:10 PM

My head is hurting....!

Is Marki involved with this at all?

I get this in part, but a lot of this still doesn't make sense. I am also wondering if the limitations will prove to be a brick wall. One possibility: Can a word be deleted and recreated with the new/correct properties? This is probably a bigger deal for Na'vi than Dothraki, and I don't know yet how this will work with Valyrian.

And is this extensible to other language projects that might be radically different, and able to be set up on another server (read: Is this something that folks at the Language Creation Society might find useful?).

I may just drop by on your IRC channel, if I can keep it up reliably.

Tuiq · November 10, 2013, 10:06:25 PM

Not yet. I'm still waiting for a green light, so to say, but since I have little better to do, I've already started.

While it does sound extremely complicated, I promise it won't be. It's a powerful tool that is quite complex under the hood, but the interface will be very simple. Simpler than the current EE. I promise.

`Eylan Ayfalulukanä · November 10, 2013, 10:10:11 PM

Will the information in the previous EE have to be re-entered from scratch?

Tuiq · November 10, 2013, 10:22:11 PM

I plan on converting over as much information as I can. However, I could still need a helping hand in grabbing the information together. Useful, if possible, would be:

• For every word type in existence, a description which parameter does what.

Spoiler

Example:

Code Select


word  \par\textbf{#1}: [\textipa{#2}] $_{#5}$ #3 \textit{#4}
#1: Na'vi
#2: IPA
#3: Your Language
#4: (comment? description? additional information?)
#5: Source

If somebody could supply me with such a list of (all still used types), that would be amazing and help me a lot in getting a quick test version as soon as possible. Here's the list of all currently used types:

Code Select


id  format  argc  changeable 
word  \par\textbf{#1}: [\textipa{#2}] $_{#5}$ #3 \textit{#4} 5  3,4
lenite  \par\textbf{#1}: [\textipa{#2}] $_{#6}$ #3 \textit{#4} __LENITE_TEXT__ 6  3,4
cw  \par\textbf{#1}: [\textipa{#2}] $_{#9}$ #3 \textit{#4} __CW_TEXT__ 9  3,4,6,8
cww  \par\textbf{#1}: [\textipa{#2}] $_{#7}$ #3 \textit{#4} __CWW_TEXT__ 7  3,4,6
loan  \par\textbf{#1}: [\textipa{#2}] $_{#6}$ #3 \textit{#4} __LOAN_TEXT__ 7  3,4,7
note  \par\textbf{#1}: [\textipa{#2}] $_{#8}$ #3 \textit{#4} (#5 \textbf{#6} \textit{#7}) 8  3,4,5,7
derive  \par\textbf{#1}: [\textipa{#2}] $_{#9}$ #3 \textit{#4} (__DERIVE_TEXT__) 9  3,4,6,8
derives  \par\textbf{#1}: [\textipa{#2}] $_{#7}$ #3 \textit{#4} (__DERIVES_TEXT__) 7  3,4,6
derivingaffix  \par\textbf{#1}: [\textipa{#2}] $_{#4}$ #3 __DERIVINGAFFIX_TEXT__ 4  3
infixN  \par\textbf{#1}: [\textipa{#2}] $_{#5}$ #3 __VERBAL_INFIX_IN_POS__ #4 ({\sc #8}): \textbf{#6} \textit{#7} 8  3,7
infixcwN  \par\textbf{#1}: [\textipa{#2}] $_{#7}$ #3 __VERBAL_INFIX_IN_POS_S4_DERIVED_FROM__ \textbf{#5} + \textbf{#6}: \textbf{#8} \textit{#9} 10  3,9
affixN  \par\textbf{#1}: [\textipa{#2}] $_{#5}$ #3 __ADP_FOR__ #4 ({\sc #8}): \textbf{#6} \textit{#7} 8  3,4,7
alloffixN  \par\textbf{#1}: [\textipa{#2}] $_{#6}$ #4 __SUFFIX_FOR_NOUNS_ENDING_IN_S5_ALLOMORPH_OF__ \textbf{#3}) ({\sc #9}): \textbf{#7} \textit{#8} 9  4,5,8
derivingaffixN  \par\textbf{#1}: [\textipa{#2}] $_{#4}$ #3 __DERIVING_AFFIX__ \textbf{#5} \textit{#6} 7  3,6
markerN  \par\textbf{#1}: [\textipa{#2}] $_{#4}$ #3: \textbf{#5} \textit{#6} 7  3,6
liu  \par\textbf{#1}: [\textipa{#2}] $_{#5}$ #3 #4 5  3,4

Don't worry about the last two rows and the design. I just need to know which number is called what. Because the new EE system uses names, not numbers, this will be used to map the existing database into a new one in a first step. In a second step, I will then prettify said database to make full use of the new system.

• A list of all current LaTeX used in word definitions and how to replace it with BBCode.

Spoiler

For example:

Code Select


\textprimstress => [textprimstress]
\textcorner => [textcorner]
\bf{some text here} => [bf]some text here[/bf]

The first one is the most important, as it will allow me to convert the data from the old system into the new one for a few basic, similar steps. After we've done that first import, we can later tune it properly to make use of EE.NET's new features.

The second one is less important and will only become relevant once we start "rendering" dictionaries. Until then, it will just display the LaTeX in the browser, which you have to parse in your head, I guess.

Edit: For Dothraki, it would be a lot easier I guess - I'll just need "\par\textbf{#1}: [\textipa{#2}] $_{#5}$ #3 \textit{#4}" for the first part. I guess it's "Dothraki, IPA, English, Source, Part of Speech"?

`Eylan Ayfalulukanä · November 11, 2013, 04:23:20 AM

Quote from: Tuiq on November 10, 2013, 10:22:11 PM
I plan on converting over as much information as I can. However, I could still need a helping hand in grabbing the information together. Useful, if possible, would be:

• For every word type in existence, a description which parameter does what.
Spoiler
Example:

Code Select Expand
word \par\textbf{#1}: [\textipa{#2}] $_{#5}$ #3 \textit{#4} #1: Na'vi #2: IPA #3: Your Language #4: (comment? description? additional information?) #5: Source

If somebody could supply me with such a list of (all still used types), that would be amazing and help me a lot in getting a quick test version as soon as possible. Here's the list of all currently used types:

Code Select Expand
id format argc changeable word \par\textbf{#1}: [\textipa{#2}] $_{#5}$ #3 \textit{#4} 5 3,4 lenite \par\textbf{#1}: [\textipa{#2}] $_{#6}$ #3 \textit{#4} __LENITE_TEXT__ 6 3,4 cw \par\textbf{#1}: [\textipa{#2}] $_{#9}$ #3 \textit{#4} __CW_TEXT__ 9 3,4,6,8 cww \par\textbf{#1}: [\textipa{#2}] $_{#7}$ #3 \textit{#4} __CWW_TEXT__ 7 3,4,6 loan \par\textbf{#1}: [\textipa{#2}] $_{#6}$ #3 \textit{#4} __LOAN_TEXT__ 7 3,4,7 note \par\textbf{#1}: [\textipa{#2}] $_{#8}$ #3 \textit{#4} (#5 \textbf{#6} \textit{#7}) 8 3,4,5,7 derive \par\textbf{#1}: [\textipa{#2}] $_{#9}$ #3 \textit{#4} (__DERIVE_TEXT__) 9 3,4,6,8 derives \par\textbf{#1}: [\textipa{#2}] $_{#7}$ #3 \textit{#4} (__DERIVES_TEXT__) 7 3,4,6 derivingaffix \par\textbf{#1}: [\textipa{#2}] $_{#4}$ #3 __DERIVINGAFFIX_TEXT__ 4 3 infixN \par\textbf{#1}: [\textipa{#2}] $_{#5}$ #3 __VERBAL_INFIX_IN_POS__ #4 ({\sc #8}): \textbf{#6} \textit{#7} 8 3,7 infixcwN \par\textbf{#1}: [\textipa{#2}] $_{#7}$ #3 __VERBAL_INFIX_IN_POS_S4_DERIVED_FROM__ \textbf{#5} + \textbf{#6}: \textbf{#8} \textit{#9} 10 3,9 affixN \par\textbf{#1}: [\textipa{#2}] $_{#5}$ #3 __ADP_FOR__ #4 ({\sc #8}): \textbf{#6} \textit{#7} 8 3,4,7 alloffixN \par\textbf{#1}: [\textipa{#2}] $_{#6}$ #4 __SUFFIX_FOR_NOUNS_ENDING_IN_S5_ALLOMORPH_OF__ \textbf{#3}) ({\sc #9}): \textbf{#7} \textit{#8} 9 4,5,8 derivingaffixN \par\textbf{#1}: [\textipa{#2}] $_{#4}$ #3 __DERIVING_AFFIX__ \textbf{#5} \textit{#6} 7 3,6 markerN \par\textbf{#1}: [\textipa{#2}] $_{#4}$ #3: \textbf{#5} \textit{#6} 7 3,6 liu \par\textbf{#1}: [\textipa{#2}] $_{#5}$ #3 #4 5 3,4

Don't worry about the last two rows and the design. I just need to know which number is called what. Because the new EE system uses names, not numbers, this will be used to map the existing database into a new one in a first step. In a second step, I will then prettify said database to make full use of the new system.

• A list of all current LaTeX used in word definitions and how to replace it with BBCode.
Spoiler

For example:
Code Select Expand
\textprimstress => [textprimstress] \textcorner => [textcorner] \bf{some text here} => [bf]some text here[/bf]

The first one is the most important, as it will allow me to convert the data from the old system into the new one for a few basic, similar steps. After we've done that first import, we can later tune it properly to make use of EE.NET's new features.

The second one is less important and will only become relevant once we start "rendering" dictionaries. Until then, it will just display the LaTeX in the browser, which you have to parse in your head, I guess.

Edit: For Dothraki, it would be a lot easier I guess - I'll just need "\par\textbf{#1}: [\textipa{#2}] $_{#5}$ #3 \textit{#4}" for the first part. I guess it's "Dothraki, IPA, English, Source, Part of Speech"?

Dothraki is pretty easy, there is only one type:

#1 Dothraki
#2 IPA
#3 Part of Speech
#4 definition
#5 source
And the comment field is used extensively.

Both of the Valyrian languages have several types, but those are not as an immediate concern, as there are some fundamental problems that need to be fixed with these dictionaries.

I would really like Dothraki to have some added fields for root form, example sentence and canon/blog citations, but not all words will need any or all of these. I suspect Naʼvi could benefit from these as well, especially seeing that the Naʼvi dictionary is 'official'.

As far as LaTeX, there is of course boldface and italic. There is a special LaTeX form for the tie bar in the IPA, and also for the 'dental' symbol used under some consonants. Valyrian may use a couple more, plus it uses the stress marks you have already included in your example. Most importantly for Valyrian would be the ability to enter vowels with macrons, such as Ā ā Ē ē Ī ī Ō ō Ū ū Ȳ ȳ, without having to use BBcode. This is important for making the databases easily searchable on these letters. Is there really specialized BBcode for IPA? Where do you look for it?

Tuiq · November 11, 2013, 04:49:03 AM

Quote from: `Eylan Ayfalulukanä on November 11, 2013, 04:23:20 AMDothraki is pretty easy, there is only one type:

#1 Dothraki
#2 IPA
#3 Part of Speech
#4 definition
#5 source
And the comment field is used extensively.

Thanks, I'll abuse Dothraki as test dictionary then. If only I could export the current data from the database (cough cough Marki phpMyAdmin's export is broken cough) I could start writing a conversion tool. But it doesn't matter too much right now.

I'll add the following word class for Dothraki then: word with the fields Dothraki, IPA, Part of Speech, Definition, Source, Comment.

Quote from: `Eylan Ayfalulukanä on November 11, 2013, 04:23:20 AMI would really like Dothraki to have some added fields for root form, example sentence and canon/blog citations, but not all words will need any or all of these. I suspect Naʼvi could benefit from these as well, especially seeing that the Naʼvi dictionary is 'official'.

You'll be able to do that with the new word class system. For example, would you like to extend word with a root form field? If you could tell me which other classes you would need (i.e. "I would like something like the normal words, let's call it blog cited. It should add two new fields, 'citation source' and 'citation link' as well as 'date'.") I could get going on that.

Quote from: `Eylan Ayfalulukanä on November 11, 2013, 04:23:20 AMAs far as LaTeX, there is of course boldface and italic. There is a special LaTeX form for the tie bar in the IPA, and also for the 'dental' symbol used under some consonants. Valyrian may use a couple more, plus it uses the stress marks you have already included in your example. Most importantly for Valyrian would be the ability to enter vowels with macrons, such as Ā ā Ē ē Ī ī Ō ō Ū ū Ȳ ȳ, without having to use BBcode. This is important for making the databases easily searchable on these letters. Is there really specialized BBcode for IPA? Where do you look for it?

BBCode is generic. We're writing our own BBCode system, which means we can define what BBCodes there will be. Personally, I would prefer it if we could use Unicode wherever possible (for example, with macarons - as you just did). The "searching is not possible" is, I would guess, for the same reason that the Russian one isn't working, because of an incompatible LaTeX renderer. However, with the new system, we could very easily make HTML dictionaries - those would be searchable by normal browsers with CTRL-F. I'm not too sure how well IPA is handled by browsers, however. But we'll find a way, I hope.

In that context, is it possible to have the IPA completely in some sort of unicode? i.e. is there an unicode command for it? I found this list to be exactly what we might be looking for. For example, \textprimstress would be "ˈ". If we could write, in the IPA in Unicode, that would make things infinitely much easier. For example, {\textprimstress Pa.Paw} could become {ˈʔa.ʔaw}. However, I recognize this is a pain in the arse to write (although we could have a nice little browser help for that? A list of IPA symbols you could click to insert said IPA at the current location? Kind of like the ä and ì buttons here?). Perhaps I'll just have to find a write to decode TIPA into Unicode IPA for HTML. It doesn't seem to be impossible, at least. This list of IPA unicode looks quite complete to me. But I'm no linguist, so I won't make any statements on that.

So, to be precise: You could write the data in Unicode (as you do now), and if necessary the exporter would (let's assume this would be necessary) replace ä with 'a to make it searchable. Or it doesn't. Important is that, if your data is in unicode, we can do whatever we please with it. It's the easiest thing to handle and having all data consistently in Unicode wouldn't be a bad thing.

baritone · November 11, 2013, 11:05:41 AM

I have compiled a dictionary for vrrtepcli and add it into the source sode, but it was a great job. If the data transfer from the Russian part of the dictionary into NaviData.sql can not be fixed soon, I'm ready to make corrections in EE database and vrrtepcli dictionary simultaneously. But it would be better if the data transfer will be repaired.

Tuiq · November 11, 2013, 11:29:56 AM

In all likelihood, whatever is currently breaking Russian's SQL will not be fixed until the new EE system is there. It's simply too much effort and not worth it while we're working on a successor anyway.