Eana Eltu: Translator, Dictionary, API and putxìng.

Tuiq · September 10, 2010, 09:29:16 AM

http://www.youtube.com/watch?v=dR3ccmWmLhk&autoplay=1

And the last one is just something that needed to be done.

Seze · September 10, 2010, 08:41:26 PM

I found an error in the new version of the SQL file. The `id` field is set to Auto_Increment in the create table statement for the `metaWords` table.

Tuiq · September 11, 2010, 03:51:36 AM

Fixed it in the Repository, will fix it on the server ASAP.

Tuiq · September 13, 2010, 12:50:15 PM

The SQL file is working again. Changed char(40) back to int(11). The IDs are shifted by 2, getting the real id is possible by using >> 2.

Seze · September 13, 2010, 11:04:58 PM

Quote from: Tuiq on September 13, 2010, 12:50:15 PM
The SQL file is working again. Changed char(40) back to int(11). The IDs are shifted by 2, getting the real id is possible by using >> 2.

Just to make sure I'm on the same page with this, the "real id" gives you some duplicates for words like "fitseng", "fitsenge" where both words share the same base word. So if we wanted the mutation number, we should bit shift the 2 places and then bit shift back (zero out the first 2 bits) and then compare that id number with the original and the difference between them would be the mutation id?

Example:

Fitseng (364) -> RealID = 91, MutationID = 0;
Fetsenge (365) -> RealID = 91, MutationID = 1;

Tuiq · September 14, 2010, 04:14:50 AM

To get the mutation number use the BIT AND operator with 3.

Code Select

realId = (ID >> 2);
mutation = (ID & 3);

364 >> 2 = 91
364 & 3 = 0
365 >> 2 = 91
365 & 3 = 1

The mutations are alternations of the same word. This is only used with things like srak(e) as of now.

By using this shifted system you can get the real ID and re-merge these words if you please.

Seze · September 14, 2010, 10:08:38 AM

Irayo for clearing that up for me. Now you've got me pondering ways to use that information in the mobile app...

Tuiq · September 17, 2010, 04:06:43 AM

Free (once again). The IRC bot is now again capable of !lnav, !nav{LANG}, !lnav{LANG}, !sent{LANG}. Evil old %SpeakNavi::LANGUAGE removed.

In other news, some numbers are broken - I can't tell what numbers though. I just know from experiments in the past that certain numbers are recognized wrong and that the fix would be annoying but pretty simple.

Tuiq · October 20, 2010, 12:16:45 PM

The IRC bot is down and will stay that way. The server's performance has gone to its borders, which means that I'll have to move all of EE quite soon.

Tuiq · February 06, 2012, 08:11:45 AM

The generated SQL has been changed. In addition to the words, there are now two new tables containing information about infixes and affixes.

Please note that every entry that has position set to NULL is an affix, whereas every entry with a position is an infix.

Ftiafpi · February 12, 2012, 10:23:26 PM

Quote from: Tuiq on September 10, 2010, 09:29:16 AM
Spoiler
http://www.youtube.com/watch?v=dR3ccmWmLhk&autoplay=1

And the last one is just something that needed to be done.

So, I noticed this was in my "new replies to your posts" area and stumbled upon this video that I apparently never watched. Such a great song I've never heard, irayo!

Tsyalatun te Eyktan Txuratu'itan · March 01, 2013, 05:24:05 AM

Ok, so, I've arranged for Tirea's vrrtepcli/jmemrise dictionary data to be automatically generated from the NaviDATA.sql dump file every six hours. Tirea has always versioned the resulting data using the version number found in the current dictionary PDF file.

The problem I've come across is that the two sets of data are not synchronised - the SQL dump file this morning is two hours after the dictionary update. A similar problem happened yesterday with the 12.891 dictionary version. The SQL dump file was missing the changes to tìftang si.

I see three solutions to this:

1) replace the vrrtepcli/jmemrise dictionary version with a md5sum or sha1sum of the NaviDATA.sql file, thereby making the "version" information useless to the general audience - no one will be able to tell whether their dictionary data is up to date or not.
2) some way of versioning the dictionary data in the SQL database such that the base data version for all dictionaries can be tracked.
3) move the vrrtepcli/jmemrise dictionary data generation onto eanaeltu.

Any thoughts?

Tuiq · March 01, 2013, 05:49:17 AM

The laziest solution would be to enforce Eana Eltu to generate the files you want (3). The mechanism doing that along with the example plugins is available at my GitHub. Requires a bit of Perl and it's a bit hackish, but the whole system might be revamped majorly (or rather, rebooted) soonish. At least, there are plans for EE2, including a concept, I'm talking to some dictionary guys for their ideas about the system (of course, everybody is invited in the discussion if you want anything specifically). I've started coding it, but as it isn't a real project of mine, I don't know if it's ever going to be finished or anything. Before I have something to show, I'll keep it "private".

As for the version, I would always propose to use a HEAD to see when the file was last modified. The system is set up to use cronjobs though - which means that the files should be generated every day at the same time, give or take an hour. So at least on EE's side, this should be constant. Unless the generation messes up for some reason - which wouldn't be noticed as there's no real event log or anything. It would just not update (and last-modified would stay the same). I don't really know how you could effectively version the database without causing a massive overhead in the SQL. Of course, EE saves the last modified file for each word (and each translation thereof) as UNIX timestamp which could be included in the SQL but isn't required for most cases and would bloat up the file for no good reason. Also, there might be some privacy issues.

Edit: To clarify, while I'm not actively doing anything on this anymore (besides a fix or two if required), you can - if you want to - write the plugin(s) and send them to me. I'll include them in the repository and upload/enable them on the server, assuming they aren't malicious or require any fancy third party stuff. In the latter case (or if they are cpu/memory intensive), you'll need to talk to the LN.org guys (first).

Tsyalatun te Eyktan Txuratu'itan · March 01, 2013, 06:25:33 AM

At the moment, my script is shell script - but it calls out to some of Tirea's original scripts which are a mixture of shell and python.

Can we fix the time at which the NaviData.sql file is generated so that it's independent of local timezones and daylight savings etc? With all the problems of daylight savings being up to governments to decide, and the desync'd nature of when daylight savings start and end around the world, having data exported internationally depending on daylight savings is far from a good idea.

Depending on the version of cron, CRON_TZ=UTC in the appropriate crontab file will set the timezone to be independent of local daylight savings. I can then arrange for my scripts to run maybe 5 or 10 minutes after that, which should reduce the possibility of this issue (it certainly would close the window down to something less likely to cause a problem.)

However, for the time being, I think I'm going to do the simple fix: make it run the script more regularly (once every hour at 10 past the hour) and use wget -N to grab the files. Any change to either the dictionary or SQL data will trigger a regeneration of the vrrtepcli dictionary data. As for the dictionary version, I think we'll just have to accept that it's meaningless - I'll append a suffix of a truncated sha1 hash of the SQL data - git style.

Tuiq · March 01, 2013, 06:30:04 AM

Quote from: Tsyalatun te Eyktan Txuratu'itan on March 01, 2013, 06:25:33 AMCan we fix the time at which the NaviData.sql file is generated so that it's independent of local timezones and daylight savings etc? With all the problems of daylight savings being up to governments to decide, and the desync'd nature of when daylight savings start and end around the world, having data exported internationally depending on daylight savings is far from a good idea.

Depending on the version of cron, CRON_TZ=UTC in the appropriate crontab file will set the timezone to be independent of local daylight savings. I can then arrange for my scripts to run maybe 5 or 10 minutes after that, which should reduce the possibility of this issue (it certainly would close the window down to something less likely to cause a problem.)

You can never completely fix it though. All output files are created in a chain, if one takes a bit longer, it throws off everything afterwards. Anyway, your partner in this issue is TorukMakto/Marki/whatevertheycallthemselvesthismonth. I don't do the cronjobs nor would I have access to them, I guess.

Toruk Makto · March 01, 2013, 12:57:07 PM

Actually, my nick has been settled for a little while now.

The navidata.sql file is generated at 00:02 every day. The server time is set to shift to daylight savings time automatically, so I don;t know of an easy way to keep the cron from changing too...

Tuiq · March 01, 2013, 12:58:19 PM

Quote from: Toruk Makto on March 01, 2013, 12:57:07 PM
Actually, my nick has been settled for a little while now.

The navidata.sql file is generated at 00:02 every day. The server time is set to shift to daylight savings time automatically, so I don;t know of an easy way to keep the cron from changing too...

More than two changes result in a permanent blackmark in my nick book, I'm sorry.

Also, he provided a way to have it independent of the DST, assuming you are using crontab or similar. Which timezone that 00:02 is in would be helpful too, I guess.

Toruk Makto · March 01, 2013, 01:01:25 PM

I guess I'll have to live with your black mark then.

FreeBSD doesn't support CRON_TZ, so that won't work. The server is in the C[SD]T time zone.

Tuiq · March 01, 2013, 01:03:16 PM

Hm. I guess the only real "solution" then is to have it scheduled on winter time - so it's executed an hour later.

That is, unless we start adding hooks (i.e. "Once we generated the stuff, we call your site over HTTP") - but that wouldn't be worth the effort, I believe.

Tsyalatun te Eyktan Txuratu'itan · March 01, 2013, 01:23:52 PM

Quote from: Toruk Makto on March 01, 2013, 01:01:25 PM
I guess I'll have to live with your black mark then.

FreeBSD doesn't support CRON_TZ, so that won't work. The server is in the C[SD]T time zone.

Well, there's another solution: as my crond does support it, I can set the crontab to operate in your timezone, so the two will change daylight savings together. I just need to work out what TZ= string corresponds with C[SD]T... TZ=CST6CDT maybe?

$ zdump -v CST6CDT
...
CST6CDT Sun Mar 11 07:59:59 2012 UTC = Sun Mar 11 01:59:59 2012 CST isdst=0 gmtoff=-21600
CST6CDT Sun Mar 11 08:00:00 2012 UTC = Sun Mar 11 03:00:00 2012 CDT isdst=1 gmtoff=-18000
CST6CDT Sun Nov 4 06:59:59 2012 UTC = Sun Nov 4 01:59:59 2012 CDT isdst=1 gmtoff=-18000
CST6CDT Sun Nov 4 07:00:00 2012 UTC = Sun Nov 4 01:00:00 2012 CST isdst=0 gmtoff=-21600
CST6CDT Sun Mar 10 07:59:59 2013 UTC = Sun Mar 10 01:59:59 2013 CST isdst=0 gmtoff=-21600
CST6CDT Sun Mar 10 08:00:00 2013 UTC = Sun Mar 10 03:00:00 2013 CDT isdst=1 gmtoff=-18000
CST6CDT Sun Nov 3 06:59:59 2013 UTC = Sun Nov 3 01:59:59 2013 CDT isdst=1 gmtoff=-18000
CST6CDT Sun Nov 3 07:00:00 2013 UTC = Sun Nov 3 01:00:00 2013 CST isdst=0 gmtoff=-21600

Does this look like reasonable DST transitions for your servers timezone?

Edit: *pondering*. No, I don't think CST6CDT is the right zonefile.

$ date; TZ=CST6CDT date
Fri Mar 1 19:33:10 GMT 2013
Fri Mar 1 13:33:10 CST 2013

So that indicates that my CST6CDT says that CST is 8 hours behind GMT. So, if the file is generated at 00:02 CST, that would be 08:02 GMT. But, the current file reports this in the HTTP headers:

Date: Fri, 01 Mar 2013 19:30:06 GMT
Last-Modified: Fri, 01 Mar 2013 09:00:02 GMT

which is an hour later...

Confused.