Translating Taronyu's dictionary

Started by Tuiq, July 29, 2010, 04:25:55 AM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

marcin1509

Maybe let's create one app to connect all translations in different languages.
Maybe I'll try to write it in Java+sqlite. Maybe after it i'll try to write in other technology.

Tìtstewan

If I am not mistaken, there is an app in development that will support or it is planned to support multiple languages. *looks at Tirea Aean* :P

-| Na'vi Vocab + Audio | Na'viteri as one HTML file | FAQ | Useful Links for Beginners |-
-| Kem si fu kem rä'ä si, ke lu tìfmi. |-

marcin1509

I meant a windows application, not mobile (Android). Maybe a universal app for windows mobile and windows 10 desktop.

Tirea Aean

Quote from: marcin1509 on June 30, 2016, 09:20:50 AM
I meant a windows application, not mobile (Android). Maybe a universal app for windows mobile and windows 10 desktop.

I've already been working on this for years. :D

Which reminds me I need to somehow get around to compiling it for Windows (it's already compiled for Linux now). See my GitHub http://github.com/tirea  Specifically the Fwew and possibly the vrrtepcli projects.  The only issue with these for Windows users is, no one likes to use cmd.exe as an interface on Windows. HRH So if you pull this off with a nice GUI, I might just stop cross-compiling my apps ;)

Tirea Aean

Not to resurrect the dead, but...


I'm just not sure what to make of our Russian portion of the database. It's been a pile of question marks in localizedWords for a couple years now. What happened and how can this be fixed?


As you see, localizedInfixes table looks okay, but localizedWords is just not working out for our Russian-speaking friends. This bug is indeed found in production. Therefore causing the sql file to contain this, and thus the projects (such as vrrtepcli and fwew) using this data as well.

Tìtstewan

Apparently, because there was an issue when saving russian characters in the database. I have the same issue when using the database for the dictionary generator.
Code (php) Select
<?php
// Select your language. Following are available (according the EE's NaviData.sql)
// eng, de, pl, est, hu, sv, nl and ru (ru = russian will show no words only ???, so don't use it)
$lang "'eng'";
?>

-| Na'vi Vocab + Audio | Na'viteri as one HTML file | FAQ | Useful Links for Beginners |-
-| Kem si fu kem rä'ä si, ke lu tìfmi. |-

Tuiq

#126
I don't think the issue occurs when saving into the database. The problem was the encoding of the characters. While latin1 works fine for English, German and various other, non-too-far-from-latin-character-languages, Russian with its cryllic alphabet wasn't part of latin1 and resulted in the question marks you see.

EE's main database contains the proper data (see the attachment), so it seems like the encoding is not properly written into the exported files (as the TSV shows the same issue). This is kind of weird... I think I'm exporting it explicitly as UTF-8, as is obvious when you take a look at the German words ('berühren'). I'm not entirely sure why it doesn't work for the Russian one. Looking at it, there seem to be other characters that don't seem to be properly encoded either.

So, errr... It's been a few (... 3? 4? 5? 6?) years since I've actually messed around on the server, or the database for that matter. I can't tell you more than something with the encoding doesn't seem right when exporting the files; but it's right in the web interface. So I suppose I would start investigating where the mistake in the export script lies... Is the encoding for the output stream not properly set, or even ignored? Is the file okay, but the webserver serves it wrong? Is the data read somehow corrupted from the database?

I think by now I've lost all access to EE/the database, so if I should take a look at that, I would need the credentials, URIs and all that again.

Edit: Okay, so that words don't work, but infixes do, is odd. Like, really odd. I suspect that something in SpeakNavi is not dealing with the UTF-8 properly, but then again, why is only THAT affected.
Eana Eltu: PDF/TSV/jMemorize

Tirea Aean

Those screenshots I posted were from the server's mySQL via SSH. I wasn't sure of the source of the bug. I ran mysql with --default-character-set=utf8 param. I didn't find anything wrong with other languages when selecting stuff from the table. ¯\_(ツ)_/¯

Tuiq

Using another client displays it correctly. In all likelihood, the font you're using does not support cryllic characters?

Some other languages seem to be broken, too. For example, VALUES ('8','pl','M?otog?ów','rz.') seems a bit borked - I don't expect it to have a question mark in the middle of the word... twice.
Eana Eltu: PDF/TSV/jMemorize

Tirea Aean

But the Cyrillic chars in the localizedInfixes table query look fine. (See other screenshot up there)

Tìtstewan

For me, the question marks are in the NaviData.sql file, I use for the generator.

EDIT: And I use mysqli_set_charset($db_link, 'utf8'); in my script.

-| Na'vi Vocab + Audio | Na'viteri as one HTML file | FAQ | Useful Links for Beginners |-
-| Kem si fu kem rä'ä si, ke lu tìfmi. |-

Tuiq

Right, so let's clear up some misunderstandings first...

There's four data sets:

1. The original database, which also happens to be MySQL. Here, everything is OK.
2. Because the original dataset was meant for LaTeX, it's rather ugly. Therefore, the script reads the data, transforms it into a nicer form, and makes it available for exporters.
3. The SQL exporter takes the refined data and writes it in a SQL format. We know that this SQL is wrong, as it omits certain characters... in certain blocks.
4. You're loading the SQL again in a MySQL database.

So either step 2 or 3 are borking. Step 1 can't be, because it's fine in my database - see the example (dumped as JSON):

Quote[
    {
        "id": "1",
        "arg1": null,
        "arg2": null,
        "arg3": "\u043f\u0435\u0440.",
        "arg4": "\u0442\u0440\u043e\u0433\u0430\u0442\u044c",
        "arg5": null,
        "arg6": null,
        "arg7": null,
        "arg8": null,
        "arg9": null,
        "arg10": null,
        "odd": ""
    },
    {
        "id": "2",
        "arg1": null,
        "arg2": null,
        "arg3": "\u0441\u0443\u0449.",
        "arg4": "\u043c\u043e\u043b\u043e\u0442\u043e\u0433\u043b\u0430\u0432 (\u0436\u0438\u0432\u043e\u0442\u043d\u043e\u0435)",
        "arg5": null,
        "arg6": null,
        "arg7": null,
        "arg8": null,
        "arg9": null,
        "arg10": null,
        "odd": "",
        "lc": "ru"
    },

It's encoded for JSON, but if you're evaluating it, it's the proper stuff (трогать, молотоглав (животное)). So step 1 isn't breaking. Either 2 or 3 are acting up, and I'm not sure which one it is yet. I'll need to get access to the current scripts that are exporting the stuff, so I can see what's going on.

Technically, I would say #2 is broken, because as far as I can tell, the SQL exporter is setting the output encoding properly and everything. However, that the infixes work, but the words don't, sounds fishy... which makes it more likely that 2 is broken again.

Seriously, I should just rewrite this stuff already in C#.
Eana Eltu: PDF/TSV/jMemorize

Tuiq

It was MySQL's fault. I'm not entirely sure why, but I'm not going to question it, nor am I going to go deeper into this issue.

All content should now be available and properly UTF8 encoded. Russian was just the most obvious one; there are other entries in the .sql which weren't properly encoded. Those should be fixed now, too. The infixes worked because they're being loaded with another DB wrapper, which was already forcing MySQL to send it using UTF8 or something... something I've had to do in the other wrapper myself now.
Eana Eltu: PDF/TSV/jMemorize

Tirea Aean

Ma Tuiq, thank you for all your time investigating and fixing this!

eejmensenikbenhet

I'm having trouble saving the Dutch dict... I just translated the Mo'ara definition in 13.41 and entered the new changelog line before clicking "Create", and now it gives me a wall of red error text.

I've included the complete Log in the attachments.

EDIT: Also, it seems a lot of the Variable's have been changed... Complete translations have vanished? The intro text, daytime definitions and more...

eejmensenikbenhet

#135
(Excuse the double-post)
Upon further inspection it seems that it has broken off every translation containing an accented vowel. ì/ä/ë/ó
Those are used both in Na'vi and in Dutch but I never had any trouble with them.

EDIT: It seems that every time I try to compile the document it cuts off all translations containing an accented vowel... I luckily have saved the 13.332 version on my laptop so I can copy and edit the intro texts and such, but I'm hoping that this didn't affect any of the word translations.

Tuiq

Hello!

Runaway argument?
{säpllhrr). Dank aan Elf! \item {\bf 13.301} - Tikfouten/formatterin\ETC.
! File ended while scanning use of \textbf .


=> 13.31 - Onnozele tikfouten van \textbf{pllhrr} en \textbf{säpllhrr). Dank aan Elf!

In CHANGELOG, you've used the wrong bracket ')' instead of '}'.

Fixed it. Dictionary compiles again.
Eana Eltu: PDF/TSV/jMemorize

eejmensenikbenhet

#137
Ah, so it does, thanks, that was fairly stupid of me...
Now, there's still the problem with the accented vowels, it cuts off a lot of text which in turn leaves open \textbf{ brackets turning the entire document bold.

EDIT: I'm working around it now using LaTeX accented vowels.
\`{i} and \'{o} seem to do the trick for now...

Tuiq

It's not your fault, if anything, the system should catch those mistakes and not let you save them in the first place. Or, even better, not even allow you to make them - e.g. by not using LaTeX but Markdown or something.

What's an accented vowel for you? ì? ò? ö? Technically, as long as it is UTF-8, LaTeX should eat it... but it was always a bit nitpicky.
Eana Eltu: PDF/TSV/jMemorize

eejmensenikbenhet

Up until now I've had no problems using tremas (ä in Na'vi and ë,ï in Dutch) or accents (ì in Na'vi and ó in Dutch) but all of a sudden it stopped working, oh well, it's fixed now.
Time for the next issue: it doesn't seem to accept Guillemets in some cases («these infix markers»), in the regular intro text it's fine, but in the intro of the other dicts (NL-Na'vi, Categorised and Concise) it removes all of the text following it. I can replace them with \guillemotleft and \guillemotright, but that seems foolish considering it works fine in the original intro text.