Updating Mobile App Database Schema

Seze · September 08, 2010, 09:57:55 PM

I started working today on a new version of the database that will more resemble the schema of the main EanaEltu SQL. The new version will allow for localization and infix positions. Once I get this done, the sqlite file will be renamed from "dictionary.sqlite" to "database.sqlite". This way I won't break any old versions of Apps that expect the old schema.

EDIT:
Typo fixed "database.sql" --> "database.sqlite"

omängum fra'uti · September 09, 2010, 12:33:57 AM

I'd be more in favor of keeping sqlite and renaming it from dictionary to something else - while most people won't see it, .sql makes me think of a series of SQL commands rather than a binary database.

Feel free to post your work in progress if you want feedback or review. Otherwise, let me know the changes and I can update my nightly dictionary update to create it along with the update to the old. (That also allows a single version file to be used for both if they are updated at the same time.)

On the back end updating, at some point I want to stabilize the IDs so they can easily refer to external resources (Images, sound clips, etc).

Seze · September 09, 2010, 02:15:22 AM

Quote from: omängum fra'uti on September 09, 2010, 12:33:57 AM
I'd be more in favor of keeping sqlite and renaming it from dictionary to something else - while most people won't see it, .sql makes me think of a series of SQL commands rather than a binary database.

Feel free to post your work in progress if you want feedback or review. Otherwise, let me know the changes and I can update my nightly dictionary update to create it along with the update to the old. (That also allows a single version file to be used for both if they are updated at the same time.)

On the back end updating, at some point I want to stabilize the IDs so they can easily refer to external resources (Images, sound clips, etc).

That would be a typo on my part, I meant "database.sqlite", oel lu skxawngit... I am using the IDs from EanaEltu. They appear to be stable and do not change other than new ones are added when new words are added. TireaAean has offered to start recording audio samples that we can use in the App. I gave him the current list of words with their corresponding IDs to use when creating the clips. Each audio clip will be identified only by the ID number (1.mp3). That should help prevent some issues that might pop up with special chars in filenames.

As of now, I have a script that will run automatically, once I add it to my crontab, that will download the SQL file and update a MySQL database, then it will duplicate everything from the EanaEltu database into seperate tables modifying and adding data as needed for the Mobile App. Next (this is the part I still need to write) it will dump the MySQL tables that the Mobile App needs to CSV files that will then be used to create the SQLite database for the App. The database version won't change unless the SQL file has been updated. I am thinking about changing the version from just an arbitrary int that increments each time the database is updated to the last_modified unix timestamp of the EanaEltu SQL file. That would give us both a hard link to SQL version and the exact time it was updated.

The schema is almost identical to the EanaEltu database, except that I added a field called "alpha" to both the metaWords and localizedWords tables that acts just like the "alpha" field in the current database. I also re-added the version table. I need to re-add the android_metadata and the fancy_parts_of_speech tables. Can you send me the code that dynamically creates the fancy_parts_of_speech table?

Well, thats all I've gotten done at this point. I was hoping to have more done, but I wasted a few hours debugging a UTF-8/latin1 bug...

Muzer · September 09, 2010, 04:41:55 PM

They are not stable. They usually are, but there are a few things that will cause them to change - deleting words being one of them.

If you want a stable ID, you'll have to generate one youself currently, I'm afraid. I suggest identifying by the Na'vi, or a checksum of the Na'vi if you're worried about special characters.

Tuiq · September 09, 2010, 05:15:13 PM

Currently it works this way:

Internally, the IDs are stable. The exported, however, are not. They are increasing and change as soon as there is a gap in the internal database (for example, Taronyu deleted #42, yet it is exported in the SQL.)

If you want to associate entries with sound files I'd recommend going by IPA. Because if the IPA changes, your voice file has to change as well.

As soon as all words are loaded, the list is iterated. Order is undefined. If a word has multiple definitions (srak(e)), there are two entries and therefore two different ids, one for srak, one for srake. If a word becomes or looses such a bracket addition or just gets deleted, all IDs shift.

The SQL file is updated every day at about 3:15 CEST/CET. The progress takes some time, so I'd rather go for 2:00 GMT. (4:00 CEST/CET). The file is updated anyway, even if there was nothing changed.

If you really need unique IDs I'd recommend you change the behaviour EE handles them. In SpeakNavi.pm::OpenMySQLDatabase the IDs are assigned. If you use SHA1/MD5 instead of a number, you could use the format

Code Select

internalID-Variation

For example, srak had 413-0 and srake 413-1. Most words would just have ID-0. This way the IDs would be unique and stable, no matter what.

omängum fra'uti · September 10, 2010, 12:42:51 AM

Quote from: Muzer on September 09, 2010, 04:41:55 PM
They are not stable. They usually are, but there are a few things that will cause them to change - deleting words being one of them.

If you want a stable ID, you'll have to generate one youself currently, I'm afraid. I suggest identifying by the Na'vi, or a checksum of the Na'vi if you're worried about special characters.

Quote from: Ilisaqpuq on September 09, 2010, 05:14:28 PM
I suggest just standardizing the Na'vi word in question (lower case, degrade special characters, etc) and using that as an ID, so it's more readable and useful than ffe7c1a83369 or 28847 as references.

Both those have one major flaw... Words that are spelled the same but pronounced differently... Currently only tute/tute. Degrading special characters could make that worse depending on how it's done. So something more will need to be done to handle cases like that. This is especially true for pronunciation samples where it is significantly different.

As far as creating the fancy parts of speech, I do most of the schema conversion in SQL, so it is done in SQL... It does the brute force approach.

Spoiler

Code Select

INSERT
INTO   fancy_parts_of_speech
       (
              part_of_speech,
              description
       )
SELECT part_of_speech,
       REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(part_of_speech, '^,^', '^, ^'),'^adj.^','Adjective'),'^adp.^','Adposition'),'^adv.^','Adverb'),'^conj.^','Conjunction'),'^dem.^','Demonstrative'),'^inter.^','Interrogative'),'^intj.^','Interjection'),'^n.^','Noun'),'^num.^','Number'),'^part.^','Participle'),'^phrase^','Phrase'),'^pn.^','Pronoun'),'^prefix^','Prefix'),'^root^','Root'),'^v.^','Verb'),'^vin.^','Intransitive Verb'),'^vtr.^','Transitive Verb'),'^svin.^','Stative Intransitive Verb'),'^prop.n.^','Proper Noun')
FROM   (SELECT  part_of_speech
       FROM     entries
       GROUP BY part_of_speech
       );

Tuiq · September 10, 2010, 06:41:30 AM

Seriously - it would be way easier to modify EE directly. I'll announce it in the EE thread and update it most likely tomorrow.

You will have to change the table layout; id won't be an integer anymore, it will be a CHAR(40). Like always, I really recommend to execute THE WHOLE SQL and not just the truncate/insert part because I do not guarantee the database layout to stay stable.

Edit: It's updated. The next generated SQL will use a SHA1 hash of the internal ID rather than an incrementing number. The new ID will be stable and unique.

Seze · September 11, 2010, 12:08:57 AM

I have the first beta of the new SQLite database ready. You can download it here. Let me know if there is anything that needs changed or added.

This version has all currently available languages in it. The database file size has grown considerably because of it (~0.8 MB). While that size isn't too bad, I am thinking about possibly outputting separate SQLite databases for the different languages. Curious what everyone's thoughts are on that. I also just noticed that the parts of speech are english only right now (pulled from the metaWords table), so at some point I'll need to localize all the parts of speech from the localizedWords table instead.

Tuiq · September 11, 2010, 03:29:40 AM

Did you already try to compress the database? If I remember correctly I can bring my SQL down to about 140 kB using zlib.

Seze · September 11, 2010, 06:54:48 AM

Quote from: Tuiq on September 11, 2010, 03:29:40 AM
Did you already try to compress the database? If I remember correctly I can bring my SQL down to about 140 kB using zlib.

Never thought about compressing it. I guess the next question would be if I did compress it somehow, would it even be possible to decompress the SQLite database on a mobile phone? I will have to look into that...

omängum fra'uti · September 11, 2010, 12:53:15 PM

In the case of Android, compressed is no problem, the Android API includes zlib compression as a standard.

However, to be used the DB must be stored uncompressed, and that IMO is the main concern here. Mobile devices tend to have limited storage resources.

I did a little experiment, and over half the size of the DB is taken up by the hash keys. When I took them out and replaced them by integer keys, the size dropped drastically to around .35mb. So I propose the following idea... For the mobile form of the DB keep the ID as an integer, but have that integer come from an external DB which maps the short ID to the full hash ID. That way the hash still provides a high level of stability, but the mobile DB is kept compact. (And the Android implementation still requires an integer ID column.)

Seze, my thought is just a simple ID mapping table rather than an additional column in your mysql database... The reason is simple updates. Keep the fields in your mysql database identical to the dump from EE, and just re-load the data when you dump. Then do something like the following...

Schema:

CREATE TABLE idMap(
intKey MEDIUMINT NOT NULL AUTO_INCREMENT,
id CHAR(40) NOT NULL,
PRIMARY KEY (intkey), INDEX (id)
);

Query after updating the rest of the data:

INSERT INTO idMap(id) SELECT id FROM metaWords LEFT JOIN idMap USING (id) WHERE intKey IS NULL;

This will handily create a new unique integer key for any added row. Then when exporting to a sqlite DB, just join with the idMap table and return the integer key rather than the hash key for the dumped data.

Seze · October 02, 2010, 02:22:16 AM

I've been playing around with using the new database schema in the iPhone version of the mobile app and I've hit a small performance issue. Scrolling performance in the dictionary has become rather jittery using the new database layout. The app loads data from the database when the table cell is about to appear on screen. This is supposed to be the optimal way to load data from SQLite. The problem seems to be with SQLite not being able to get the data out of the database fast enough due to the join between the metaWords and localizedWords tables. Anyone have any suggestions on how to remove a join between the two tables and keep the ability to localize the database? One idea would be a separate database for each locale, or separate tables in the database for each locale.

Example Query from the iPhone Version:

Spoiler

omängum fra'uti · October 02, 2010, 11:15:44 AM

Adding an index might help. I'll play around with it after this weekend and see if I can improve performance. Otherwise, seperate DB is my vote.

Seze · October 02, 2010, 01:04:57 PM

I just anded an "entries" table to the database which has pretty much all of the data in it. When I switched the App over to use that table instead of "metaWords" and "localizedWords" the scrolling performance came back to normal. So it would appear that joins are not good for SQLite on embedded devices. The current version of "database.sqlite" just has the extra "entries" table in it. I left the other tables in so others can test with it for now. I'm slightly leaning on just separate tables for each locale. Separate tables shouldn't take up any more space than using the "localizedWords" table which already has all the locales in it.

omängum fra'uti · October 03, 2010, 02:50:14 AM

Joins are bad period without proper indexes. GIve me a copy of the one with seperate tables for words vs localized definitions and I'll see what I can do about performance.

Seze · October 03, 2010, 12:37:05 PM

Quote from: omängum fra'uti on October 03, 2010, 02:50:14 AM
Joins are bad period without proper indexes. GIve me a copy of the one with seperate tables for words vs localized definitions and I'll see what I can do about performance.

Database

That database has 3 main tables, "localizedWords" and "metaWords" plus the new table "entries" so you can test using joins and without joins.

omängum fra'uti · October 15, 2010, 04:05:44 AM

I was just noticing something that could probably be done a little better...

The changelog has tons of data the mobile app doesn't care about. It really only needs three things...

1. The ID
2. The revision it first appeared in
3. The revision it last changed in

At most one row then would be needed for each entry, and it could be indexed on the ID. In fact, I'd dare say that the two columns could even be added into metaWords and just queried with the rest of the data and acted upon then.