This self-contained tool enables incremental synchronization of your data from Google AppEngine's Datastore to a local SQLite 3 database. It will automatically convert your model definitions into an SQL schema, and download the model data in parallel using the new remote_api.
Synchronization is currently unidirectional (i.e. Datastore to SQLite, not the other way around) due to a lack of need, and also at least one technical limitation.
IntroductionThere are still many data processing tasks that aren't possible on AppEngine, to which a traditional SQL database is ideally suited. These include driving a full text indexer, running reports that aggregate complex statistics in multiple dimensions from your data, or asking ad-hoc questions that your application's counters weren't designed to support.
This utility solves these problems, while enabling many more applications to integrate with your data with minimal time and effort. An SQL schema is automatically generated by introspecting the model classes defined in your application's source code, into which new and updated entities are automatically fetched with a single command.
The database may be used for running interactive queries using one of the many free GUI management tools, or even hooking up to Microsoft Excel via ODBC. Additionally, by configuring a cron job it is possible to mirror your entire Datastore to an external location, for consumption by external software (for example, full text indexing, automated reporting and analysis, etc.)
How This WorksThe magic __key__ property is used to search for new immutable entities, while any Model subclasses that define at least one DateTimeProperty with auto_now=True will be automatically re-fetched every time they are modified. This means that you gain completely hands-free replication to an effortlessly accessible SQL database simply by including a single DateTimeProperty in each of your mutable models.
Model → SQL Schema MappingAll identifiers are mapped like so: ModelClassName → model_class_name, and ModelClassName.propertyName → model_class_name_property_name.
The current code generates the following SQL objects:
One TABLE for each Model subclass, containing a unique integer (id) and string-encoded Datastore Key. One TABLE for each property of each Model, containing an integer (class_name_id) and the actual value column (value). One VIEW for each Model, defined as a SELECT which joins together all the other tables into a single combined "SQL like" table. This is the most useful object. This design was chosen to make it trivial to add and drop columns on large data sets, without having to wait for large ALTER TABLE commands to run. It also makes it easier to define custom VIEWs on just the subset of your data you need.
SQL Type Datastore Native Type TEXT Key, ByteString, !IM, User, GeoPt, PhoneNumber, !Email, etc. (all simple text fields) TIMESTAMP datetime, date, time INTEGER int, long, Rating REAL float
ExampleThe following model:
careOf = db.StringProperty()
line1 = db.StringProperty()
line2 = db.StringProperty()
lastUpdated = db.DateTimeProperty(auto_now=True)Would be mapped to:
CREATE TABLE address_entity(id INTEGER PRIMARY KEY, ds_key TEXT NOT NULL);
CREATE TABLE address_care_of(address_entity_id INT NOT NULL, value TEXT);
CREATE TABLE address_line1(address_entity_id INT NOT NULL, value TEXT);
CREATE TABLE address_line2(address_entity_id INT NOT NULL, value TEXT);
CREATE TABLE address_last_updated(address_entity_id INT NOT NULL, value TIMESTAMP);
CREATE UNIQUE INDEX address_entity_ds_key_index
CREATE UNIQUE INDEX address_care_of_address_entity_id_index
CREATE UNIQUE INDEX address_line1_address_entity_id_index
CREATE UNIQUE INDEX address_line2_address_entity_id_index
CREATE UNIQUE INDEX address_last_updated_address_entity_id_index
CREATE VIEW address AS
address_entity.id AS __id,
address_entity.ds_key AS __ds_key,
address_care_of.value AS care_of,
address_line1.value AS line1,
address_line2.value AS line2,
address_last_updated.value AS last_updated
LEFT JOIN address_care_of ON (address_care_of.address_entity_id = address_entity.id)
LEFT JOIN address_line1 ON (address_line1.address_entity_id = address_entity.id)
LEFT JOIN address_line2 ON (address_line2.address_entity_id = address_entity.id)
LEFT JOIN address_last_updated ON (address_last_updated.address_entity_id = address_entity.id)
-- Example query:
SELECT COUNT(*) FROM address WHERE line1 LIKE '1 Infinite Loop%'UsageUsage: dsync.py [options]
-a AppEngine application name, e.g. "shell"
-e Developer's e-mail address (default: prompt)
-p Developer's password (default: prompt)
-r remote_api path on server (default "/remote_api")
-L Prepend extra path to module search path
-m Load Model classes from module
-d Local database path (default "./models.sqlite3")
-x Exclude the given Model class
-N Number of fetch worker threads (default: 5)
-C Number of entities to fetch per request (default: 50)
-v Verbose/debug output
--batch Fail rather than prompt for a password if none is
provided on the command line.
Path to AppEngine SDK (default: search).
Arrange for the given system command to be executed
after new or updated entities whose class matches
ModelGlob fetched (glob may contain * and ?).
Arrange for the given SQL statement to be executed
after new or updated entities whose class matches
ModelGlob fetched (glob may contain * and ?).
--help This message.
Synchronize the database schema to match the loaded Model
Start fetching data from Datastore to the database,
synchronizing its schema if necessary.
List properties and tables that no longer have associated
definitions in the loaded Model classes.
Remove properties and tables from local database that no
longer have associated definitions in the loaded Model
classes. Check "orphaned" output first to verify this
won't result in data loss!
Backup "myapp.appspot.com"'s datastore to $HOME/myapp.db
except for RemoteUrlCacheEntry:
dsync.py -L $HOME/src -m myapp.models -m myapp.counters \
-d $HOME/myapp.db -x RemoteUrlCacheEntry \
-a myapp -e email@example.com -p 1234 \
fetchKnown IssuesThis code represents about 10 hours worth of work; it's still at a very early stage. Please see the issue tracker for a more definitive list of issues, but most importantly right now:
With high worker thread counts, SQLite locks timeout. Not well tested. Only supports SQLite; PostgreSQL support would be great for very large data sets. I will be productionizing the code in the coming months, but if you come across a major issue, please tell me about it.
ConfigurationLocal SystemBefore using the utility, you must have Python 2.5 and the Google AppEngine SDK installed on your working machine, along with a copy of your site's source code. The source code is currently necessary for introspecting your application's Datastore schema.
The utility searches for the SDK in the default locations for Windows and OS X. If you have changed the install path for the SDK, you will need to specify it using a command line parameter.
After completing these steps, you must configure a remote_api URI for your site as shown below.
Application ChangesSimply add the following to app.yaml in your project directory:
- url: /remote_api
login: adminIt is recommended that you use the URI shown above, however you can specify an alternative using a command-line parameter if you desire. This enables the remote_api handler built into AppEngine, which you can read more about here.
Creating A Sync ScriptIt is recommended that you create a batch file or UNIX shell script to save the settings used for launching the utility. This helps reduce the potential for human error cropping up (for example, missing a library path).
The main tasks required for this are picking a stable location for your site's source code, say, $HOME/src/my-site (or %USERPROFILE%\my-site for Windows users), and figuring out which modules in your source code contain Datastore model definitions. This will be easier if you use only a single file for declaring models.
Here is an example sync script written in bash. It will prompt for firstname.lastname@example.org's password at startup, or you can hard-code it using the -p option.
APPNAME=mysite # mysite.appspot.com
-L $SOURCE_DIR \
-m my_site.models \ # Equivalent to $SOURCEDIR/my_site/models.py
-m my_site.counter \
-a $APPNAME \
-e $EMAIL \
-N 5 \ # Start 5 worker threads.
--trigger-cmd="$TRIG" \ # Trigger indexing when a new member joins.
fetchStart Syncing$ ./my-sync-script.sh
INFO:root:Session done: 1 added, 0 updated.
INFO:root:Room done: 1 added, 0 updated.
INFO:root:Invitation done: 2 added, 0 updated.
INFO:root:Video done: 5 added, 0 updated.
INFO:root:Item done: 5 added, 0 updated.
INFO:root:Member done: 2 added, 0 updated.
INFO:root:grand total: 1.73s for 16 added, 0 updated in 13 models.
INFO:root:grand total: 0.97s for 0 added, 0 updated in 13 models.If all has gone well, you should now have a local SQLite copy of your Datastore!
Run Some Queries$ sqlite3 $HOME/data/my-site.sqlite3
SQLite version 3.5.7
Enter ".help" for instructions
sqlite> .mode column
sqlite> SELECT COUNT(*) FROM session;
sqlite> SELECT session_id, display_name, last_seen, last_address, expiry_time, first_seen FROM session;
session_id display_name last_seen last_address expiry_time first_seen
-------------------------------- ------------ ------------------- -------------- ----------- -------------------
0ed373e4a697d2fed5b7e6b00f462423 Anonymous 2009-02-24 15:02:45 188.8.131.52 2009-02-24 12:16:03Feed A Starving Hacker!