"Dictionary" Generator

Started by Tìtstewan, March 15, 2016, 03:50:19 PM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

Tìtstewan

Kaltxì ma smuk,

Few days back, I have created a PHP based script that use the NaviData.sql to generate a dictionary page, for example, one can see on the LN Vocab page. :)

It is only one single php file one can find at my GitHub repository:
https://github.com/Titstewan/DictionaryGenerator

This generator generally creates an html page as output, but with a few changes in the following line,
echo '<span style="font-weight: bold; margin-left: -0.7em;">', $data2['navi'], '</span> [', $data2['ipa'] ,'] <em>', $data2['partOfSpeech'],'</em> ', $data2['localized'], '<br />';
one can use it to generate an XML output.
echo '<entry><word>', $data2['navi'],'</word><pro>[', $data2['ipa'] ,']</pro><source>PF</source><pos>', $data2['partOfSpeech'],'</pos><def>', $data2['localized'], '</def></entry>';
Just put the echo in a foreach function as follows:
foreach ($vocab as $data2)
{
// echo the stuff
echo '<entry><word>', $data2['navi'],'</word><pro>[', $data2['ipa'] ,']</pro><source>PF</source><pos>', $data2['partOfSpeech'],'</pos><def>', $data2['localized'], '</def></entry>';
}
echo '</dictionary> -->';


To run this, one will have MySQLi enabled. This script has been written in a PHP 5.5 environment. However, this could work under older PHP versions, but I haven't tested it.

If you have problems to get this running, just let me know. :)

-| Na'vi Vocab + Audio | Na'viteri as one HTML file | FAQ | Useful Links for Beginners |-
-| Kem si fu kem rä'ä si, ke lu tìfmi. |-

Wllìm

Nice, this should be able to replace my manual find-and-replace process for updating the grammar tools on my website. +1 :D

(I'll still need to apply manual fixes on the infix positions, though, and remove derived verbs...  :( I've been thinking for a long time about making a special dictionary containing - instead of the meaning - grammatical information about words...)

Tìtstewan

#2
I am glad you like it! :)
The EE database contains also the infix information like '<1><2>ak<3>u.  Just add the variable $data2['infixes'] to the main echo line. :)

EDIT: One can create a function that scans the $data2['infixes'] variable for <1>, <2> and <3> and replace them by the corresponding infixes (by using an array). By that, one could generate a list of the related verb with their infixes.

-| Na'vi Vocab + Audio | Na'viteri as one HTML file | FAQ | Useful Links for Beginners |-
-| Kem si fu kem rä'ä si, ke lu tìfmi. |-

Wllìm

Well, I know that there is an infixes column, but it contains mistakes as it is filled by a script (see this post). The IPA data is correct, but it is a lot of work to parse it. Also some other information is missing... I think while the current dictionary is great for humans, it is not easy to use the data for grammar analyzers and so on ;)

Tìtstewan

Ah, yes... I forgot about that post. :-[ So, nevermind. :)
Yeah, the issue with Eana Eltu... We would need to create a completely new environment that
A) actually support UTF-8 characters (use my scipt and switch to Russian, you'll get a lot of question marks)
B) offer more flexibility like adding sentences
C) get rid of that LaTeX thing that is indeed powerful but it has problems regarding character encoding (why on earth they don't add fully UTF-8mb4 support, and that in all packages?)
and some other stuff I forget...

Also, (no joke) I started to derp with a fresh SMF installation and try to create a "dictionary system" modification. But the thing is, I am not a php dev, and therefore i have to lurk A LOT in various documentations.
Just see the attachment, this is what I got so far....

-| Na'vi Vocab + Audio | Na'viteri as one HTML file | FAQ | Useful Links for Beginners |-
-| Kem si fu kem rä'ä si, ke lu tìfmi. |-

Wllìm

That looks good. I see both advantages and disadvantages to having the dictionary integrated with the forum software. Advantage would be that it looks nice to have the dictionary integrated with the forum. Disadvantage would be that it may be harder to develop and maintain...

About LaTeX: if one uses XeLaTeX instead of PDFLaTeX to compile, you get full UTF-8 support. It should even work with languages with complicatted scripts like Chinese (I never tried it though - I don't speak Chinese ;D) If you have questions about LaTeX stuff: I use it often, so just ask :D

I think that it would be best to have the database decoupled from whatever program is used to produce the output. So if you want a PDF, you could use some program that invokes XeLaTeX; if you want HTML, you can use something like your PHP script; and so on :)

Sentences would be great! Also maybe the Frequency Dictionary could be integrated... And maybe etymology information for each word? Okay, I'm getting a bit too enthusiastic here ;D

I think I am going to develop some prototype this weekend... :-\

Tìtstewan

That mod is definitely not supposed to be installed on *this* forum. EanaEltu is also a "hacked" forum software, just because one does not have to create a permission, login, or session system too, only the part for the dictionary have to be done. I just took SMF because I mostly understand how it works. EE forum is based on Perl and I know absolutely nothing about about Perl.

LaTeX is interresting and on some things very powerful (I actually use it for creating a new Na'vi reference (Horen amuve). It works well, but for example TeXStudio is showing errors about font faces because of the IPA stuff and I had difficulties with \href{}{} because it still used T1 or OT1 coding. I "fixed" an apostrophe by writing \%E2\%80\%99 ...this is not what I really want to deal with links that has non ASII signs. O___o

There should be a "database that rule them all", that could be the origin for all other dbs, also for a version with LaTeX to generates the PDF (which has to keep its current form for various reasons, btw). Kop, I (and some other people for sure) would really love to get that dictionary automated that Plumps has created. And finally, what let wish to add is also a kind of word management system for the LEP because I fear that the LEP word list is becomming more and more complicated and bigger.

I am not sure what do you mean with "develop some prototypes", but if you consider to create such a system, I'll let you know that there was a group in the past that planned to create someng like that, but stopped further development. I would suggest to team up and develop such a system together because I doubt that one single person can just create such a thing (I am not saying that it is impossible, btw :)) So, some of us have Githup and stuff... ...shall we create a new thread about it?

-| Na'vi Vocab + Audio | Na'viteri as one HTML file | FAQ | Useful Links for Beginners |-
-| Kem si fu kem rä'ä si, ke lu tìfmi. |-

Tìtstewan

That dictionary generator could totally fit in the "tools" area. :)

-| Na'vi Vocab + Audio | Na'viteri as one HTML file | FAQ | Useful Links for Beginners |-
-| Kem si fu kem rä'ä si, ke lu tìfmi. |-

`Eylan Ayfalulukanä

Attempts to get EE to work well beyond Na'vi and Dothraki have not been successful. High Valyrian, another language that lurks on this server and is not talked about much, uses some special and somewhat unusual diacritics on vowels. They give EE a fit because it uses PDFTeX. You can get PDFTeX to generate the correct characters, but the escape codes end up in the database and make things that parse the database throw up when they are encountered.

I'd like to see something developed that work work with a lot of different languages. It should be easy to add and edit word entries, be flexible in its formatting, make a nice-looking dictionary, and a database that can be universally understood. You can come close to all these things at once, but to perfect them will take some real work. I am sure there are commercial programs out there that will do this, and it would be understood why they are not inexpensive.

Yawey ngahu!
pamrel si ro [email protected]

Tìtstewan

Quote from: `Eylan Ayfalulukanä on March 21, 2016, 04:03:33 PMI am sure there are commercial programs out there that will do this, and it would be understood why they are not inexpensive.
I haven't found a web applipication that could cover all the necessary stuff we will need. Perhaps, it just has not been written yet. That's why I'll try to create such a dictionary system that will use UTF-8mb4. I am very worry about how to convert the stuff from the database into a PDF file that have to look very very close to the current Na'vi dictionary...

-| Na'vi Vocab + Audio | Na'viteri as one HTML file | FAQ | Useful Links for Beginners |-
-| Kem si fu kem rä'ä si, ke lu tìfmi. |-

Tìtstewan

So, yeah...
I simplified all the code of the the generator. I still wonder why on Earth I haven't done it earlier... :-[

-| Na'vi Vocab + Audio | Na'viteri as one HTML file | FAQ | Useful Links for Beginners |-
-| Kem si fu kem rä'ä si, ke lu tìfmi. |-

Tirea Aean

Quote from: Wllìm on March 17, 2016, 12:41:58 PM
Well, I know that there is an infixes column, but it contains mistakes as it is filled by a script (see this post). The IPA data is correct, but it is a lot of work to parse it. Also some other information is missing... I think while the current dictionary is great for humans, it is not easy to use the data for grammar analyzers and so on ;)

I've had no problems with our database in Fwew or Vrrtep thus far. Then again, these use plaintext dumps of the tables.

I have created a "fwew data file manufacturing/editing suite" that is automated. It is a set of scripts I run on every PDF update to do the following to get the update out for Fwew


  • Download NaviData.sql
  • drop local database; replace it with what's in NaviData.sql
  • select the tables into outfiles
  • shell script to fix the broken infix location data of compound verbs
  • probably some other stuff I forgot to list here
  • scp the data files to dictionarydata folder on my website

I just got done (mostly) working on getting Infix parsing to work in fwew. What I ended up doing:

Since the <1><2><3> "infixes" field of the data file is now reliable thanks to my hax.sh from step 4 above...


  • make a regex string for each infix position. Like: "(äp)?(eyk)?"
  • grab the infix location data from the infixes field of the data file
  • replace the things that look like <1> with such strings, make this the regex to match the input against
  • compile regex, call a Match All String function, to see what infixes the word has

IPA just works. Unless you're on Windows.


About EE, indeed I do remember the previous efforts to replace Eana Eltu. It for some reason never took off.

I really don't see why it's so impossible to do this. EE is literally just a pile of Perl hax. We're probably just over thinking it.

The mentioned effort was actually in support of redoing EVERYTHING. A brand new Database layout, a brand new PDF generator, AND brand new graphic interface for users to edit the dictionary. Yeah, that's a lot of work. But it can be done.  A few people on GitHub and a lot of dedication can pull it off. Even in such a way that the PDF output looks identical. (We would need to study the source code of the PDF to know exactly how to reproduce the layout and style in order to make the new product identical)

Yeah. Would be cool to have a universal dictionary system with Full UTF8 support. and all that stuff and what not.



Yawne Zize’ite

Is using PDFTeX an absolute necessity? XeLaTeX and LuaLaTeX both natively use UTF-8.

Tìtstewan

As far as I know, it is needed to create the pdf file.

-| Na'vi Vocab + Audio | Na'viteri as one HTML file | FAQ | Useful Links for Beginners |-
-| Kem si fu kem rä'ä si, ke lu tìfmi. |-