Eana Eltu: Translator, Dictionary, API and putxìng.

Started by Tuiq, January 07, 2010, 04:20:17 PM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

`Eylan Ayfalulukanä

Quote from: baritone on November 02, 2013, 12:32:35 AM
Quote from: Tuiq on November 01, 2013, 04:12:19 PM
However, UTF8 itself *should* cover everything.
There is no UTF8 at plflatex in tetex distribution. It is 8-bit application at any part of it. Unicode support appeared only in xetex, included in texlive.

P.S. Unicode support in (pdf)latex is like the one in linux text console in utf mode. pdflatex translates utf8 in 8-bit font encoding, and it works with bytes thereafter. And with xelatex the babel package should be replaced by polyglossia in order to avoid unicode translation into 8-bit font encodings.

Ah, this makes sense! We are finally getting somewhere undestanding what is not working with the macrons, and probably other things as well.

The escape code for macrons is more complex than just an apostrophe. I'll have to dig through my notes to see what it is.

Yawey ngahu!
pamrel si ro [email protected]

Tuiq

Quote from: baritone on November 02, 2013, 12:32:35 AM
Quote from: Tuiq on November 01, 2013, 04:12:19 PM
However, UTF8 itself *should* cover everything.
There is no UTF8 at plflatex in tetex distribution. It is 8-bit application at any part of it. Unicode support appeared only in xetex, included in texlive.

P.S. Unicode support in (pdf)latex is like the one in linux text console in utf mode. pdflatex translates utf8 in 8-bit font encoding, and it works with bytes thereafter. And with xelatex the babel package should be replaced by polyglossia in order to avoid unicode translation into 8-bit font encodings.

Well, I throw an UTF8 encoded file at it and it seems to compile fine - at least the dictionaries became searchable for ä and ì after the transition if I remember correctly. Or something else forced us to. I can't remember to be honest, but we had to switch from a "normal" encoding to UTF8 some time ago.
Eana Eltu: PDF/TSV/jMemorize

`Eylan Ayfalulukanä


Yawey ngahu!
pamrel si ro [email protected]

baritone

Quote from: Tuiq on November 02, 2013, 05:44:56 AM
Quote from: baritone on November 02, 2013, 12:32:35 AM
Quote from: Tuiq on November 01, 2013, 04:12:19 PM
However, UTF8 itself *should* cover everything.
There is no UTF8 at plflatex in tetex distribution. It is 8-bit application at any part of it. Unicode support appeared only in xetex, included in texlive.

P.S. Unicode support in (pdf)latex is like the one in linux text console in utf mode. pdflatex translates utf8 in 8-bit font encoding, and it works with bytes thereafter. And with xelatex the babel package should be replaced by polyglossia in order to avoid unicode translation into 8-bit font encodings.

Well, I throw an UTF8 encoded file at it and it seems to compile fine - at least the dictionaries became searchable for ä and ì after the transition if I remember correctly. Or something else forced us to. I can't remember to be honest, but we had to switch from a "normal" encoding to UTF8 some time ago.
Please download the dictionary here and run a search for the next word "n`ıyawr". As a result, you will find the word "nìyawr".

Tuiq

"n`ıyawr"

Alright. The Russian dictionary is weird. But I really recall that we had the same issues in the normal dictionaries, before we've switched to UTF8.
Eana Eltu: PDF/TSV/jMemorize

baritone

#245
Quote from: Tuiq on November 02, 2013, 11:30:37 PM
"n`ıyawr"

Alright. The Russian dictionary is weird. But I really recall that we had the same issues in the normal dictionaries, before we've switched to UTF8.
Russian dictionary uses the UTF8 input encoding as well as other languages. But the swithing to Unicode can not be complete for the Russian language while the using the pdflatex, as pdflatex uses an 8-bit-encoded fonts T2[ABCD], which does not contain the ä and ì character.

Here is the sample dic.pdf file that was compiled with the xelatex using unicode fonts, and the nfssfont.pdf file. You can search the "nì'aw" in dic.pdf file by the normal way. And at the nfssfont.pdf you can see that ä and ì symbols are not in T2A encoding, so it is the reason for no search capability for the words with ä and ì symbols in russian dictionary, which have been compiled with pdflatex in LN server.

P.S. If my writing in English is not clear, please tell me about it.

baritone

#246
Sorry, maybe I wrote something wrong?
I meant that even though the Russian dictionary input file has UTF8 encoding, but at the time when pdflatex compiled this text into the pdf file, all symbols in input file are converted into the 8-bit T2A font encoding, and treated as 8-bit text. At this text conversion the ä and ì characters are replaced by two ̈a and characters respectivelly.

P.S. And here it does not the matter that the output pdf file is encoded in utf8, once when at the pdf file creation it was the utf8-T2A-utf8 text conversion.

Tuiq

No, it's understandable. To compensate for its lack of UTF8 support, kind of, it falls back to escape sequences or something along the line. I feel that this is, as I've said, out of my reach. You would have to convince Markì that using xetex is A) possible (god knows what this niche OS supports) and B) doesn't change anything about the current design/look.
Eana Eltu: PDF/TSV/jMemorize

Toruk Makto

There is nothing niche about FreeBSD. I will take a look at xetex when I get time to do so.

Lì'fyari leNa'vi 'Rrtamì, vay set 'almong a fra'u zera'u ta ngrrpongu
Na'vi Dictionary: http://files.learnnavi.org/dicts/NaviDictionary.pdf

baritone

Thank you!
Of course, I will be happy to share any information and experiences that I have about this.

Tuiq

Eana Eltu: PDF/TSV/jMemorize

Toruk Makto


Lì'fyari leNa'vi 'Rrtamì, vay set 'almong a fra'u zera'u ta ngrrpongu
Na'vi Dictionary: http://files.learnnavi.org/dicts/NaviDictionary.pdf

Tuiq

That reminds me of two things.

One, perhaps it will work again with xetex/whatever we'll use.

Two, likely the whole dictionary process won't work properly because I rely on finding pdflatex error messages to check if the generation was successful or not. Which means I'll have to adapt a few lines for the new generation tool. If I remember correctly, it would be false-positive though, i.e. the generation itself would always be displayed as success.
Eana Eltu: PDF/TSV/jMemorize

baritone

texlive with xelatex can be installed from binary distribution in many BSD system, as it have been writen here.

And there is another question: what kind of encoding is used for Russian dictionary translation at NaviData.sql? I can not read it.

`Eylan Ayfalulukanä


Yawey ngahu!
pamrel si ro [email protected]

Tuiq

#255
Quote from: baritone on November 04, 2013, 01:51:24 PM
texlive with xelatex can be installed from binary distribution in many BSD system, as it have been writen here.

And there is another question: what kind of encoding is used for Russian dictionary translation at NaviData.sql? I can not read it.

It should be UTF-8, as is the whole dictionary. However, I'll go on a limb and assume that "VALUES ('4','ru','??\\\'?????','???.')" is not valid UTF8. I think those are question marks.

Quote from: `Eylan Ayfalulukanä on November 04, 2013, 04:02:33 PM
What is the STDOUT problem??

A backend side problem that shouldn't really bother anyone not involved with it... well, that's me and the LN.org guys.

Edit: While infixes work flawlessly, the whole thing that goes through the SpeakNavi.pm first does not. This is a real issue, as SpeakNavi is quite a big obstacle right now. It wasn't designed for third-party languages, nor was it really ever intended for UTF8, let alone cryllic.

At this point, I would really like to re-write most/all of the code. If possible, I would also like to use something else than Perl, preferably ASP.NET/Mono. But I don't know if that could be arranged.
Eana Eltu: PDF/TSV/jMemorize

baritone

Quote from: Tuiq on November 04, 2013, 06:54:29 PM
Quote from: baritone on November 04, 2013, 01:51:24 PM
And there is another question: what kind of encoding is used for Russian dictionary translation at NaviData.sql? I can not read it.

It should be UTF-8, as is the whole dictionary. However, I'll go on a limb and assume that "VALUES ('4','ru','??\\\'?????','???.')" is not valid UTF8. I think those are question marks.
Edit: While infixes work flawlessly, the whole thing that goes through the SpeakNavi.pm first does not. This is a real issue, as SpeakNavi is quite a big obstacle right now. It wasn't designed for third-party languages, nor was it really ever intended for UTF8, let alone cryllic.

At this point, I would really like to re-write most/all of the code. If possible, I would also like to use something else than Perl, preferably ASP.NET/Mono. But I don't know if that could be arranged.
This is sad. I'll try to pull the dictionary data for vrrtepcli from the Russian language translation page.

Tuiq

Quote from: baritone on November 04, 2013, 08:57:41 PM
Quote from: Tuiq on November 04, 2013, 06:54:29 PM
Quote from: baritone on November 04, 2013, 01:51:24 PM
And there is another question: what kind of encoding is used for Russian dictionary translation at NaviData.sql? I can not read it.

It should be UTF-8, as is the whole dictionary. However, I'll go on a limb and assume that "VALUES ('4','ru','??\\\'?????','???.')" is not valid UTF8. I think those are question marks.
Edit: While infixes work flawlessly, the whole thing that goes through the SpeakNavi.pm first does not. This is a real issue, as SpeakNavi is quite a big obstacle right now. It wasn't designed for third-party languages, nor was it really ever intended for UTF8, let alone cryllic.

At this point, I would really like to re-write most/all of the code. If possible, I would also like to use something else than Perl, preferably ASP.NET/Mono. But I don't know if that could be arranged.
This is sad. I'll try to pull the dictionary data for vrrtepcli from the Russian language translation page.

Please, please don't. It's an unreliable data source and server sided *very* expensive. It's generating huge amounts of data, most of which you will never really use. On top of that, if you require IDs whatsoever, you cannot get them from the translation site.
Eana Eltu: PDF/TSV/jMemorize

`Eylan Ayfalulukanä

I am wondering what will become of VrrtepCLI without Tirea Aean...

Yawey ngahu!
pamrel si ro [email protected]

Tuiq

Don't look at me, it's been a very long time since I've written a vocabulary trainer and if anything, I'd much rather push EE.NET forward.
Eana Eltu: PDF/TSV/jMemorize