Moving data between environments

Overview

Moving data, whether to other environments or to development machines, has traditionally been a problem for us. It is no longer a technical problem.

The Jenkins job which moves the data is scheduled to automatically run against staging every week on a Sunday. This way we always keep staging data close to production. The job can also be run manually against preview or staging, or just to create a clean dump shared on S3.

Check it out.

Moving data from production to preview or staging

How do I do it?

Moving data is easy as.

  • Visit our continuous integration server
  • Pick a TARGET (i.e. the database or S3 bucket to fill with new data)
    • Valid targets are preview, staging, s3
  • Find the “Clean and apply database dump - TARGET” job
  • Click “Build Now”
  • Wait for like 25 minutes
    • (Buy more biscuits if there are none left)
  • You should now see a sanitized version of production on the target stage and a new dump in the S3 bucket.

What is actually happening?

It’s a bit of a fafflafel, but here’s the basic algorithm what happens whenever you click “Build” (as seen above).

  • The status page for the target stage (if preview or staging) is requested and the api’s current alembic version stored for use later.
  • A new Docker container is run on Jenkins from a Postgres base image. This is where we cleanup the production data.
  • Jenkins checks our database-backups S3 bucket for the latest production database dump (a new one is created every morning) and downloads it.
  • The dump is then decrypted with GPG using the private key in the credentials repo, unzipped, and then imported into the Docker container.
  • The database in the container is then cleaned using a series of SQL queries that purge personal and confidential data. This step also generates an amount of pseudorandomly varied data to replace some of the removed data in order to provide test environments with a more realistic, diverse dataset.
  • The cleaned database is then queried for its alembic version. This is stored for later.
  • Jenkins then uses pg_dump to create a dump of this cleaned database, zips it, and then saves it to disk.
  • If the TARGET is s3 then the script skips all the following steps up until the point where it uploads to S3.
  • If the TARGET is preview or staging then Jenkins targets the appropriate Cloud Foundry space, fetches the credentials for the postgres service containing that stages database, creates an ssh tunnel to it and applies the cleaned dump to it with psql. The dumps are created with the --clean flag so they will drop any database objects before replacing them. The cleaned dump is synced with S3 in the same way as mentioned above and is then deleted from Jenkins.
  • There is a chance that the target stage may have more advanced migrations than production. Jenkins compares the alembic versions stored in previous steps, and if they do not match will run the database-migration-paas job against the target stage.
  • Next, as there is new data in the target stages database, the services and briefs need to be indexed. Jenkins uses the existing index-<briefs | services>-<preview | staging> jobs for this.
  • Because new indexes have been created, the aliases being used needs to be shifted. Currently we only have one alias per framework type, i.e. g-cloud and briefs-digital-outcomes-and-specialists. Jenkins uses the update-<preview | staging>-index-alias job to move it to the new index.
  • Jenkins then syncs the zipped dump with an S3 bucket in the digitalmarketplace-development account, deletes its local dump, and we’re pretty much done. The sync process removes any old files so the directory doesn’t become massive.
  • The final step is to clean up. The Docker container and its volume are destroyed, and Jenkins logs out from Cloud Foundry

After all of this is complete, we should have filled one of our non-production databases with cleaned data and saved the dump in a shared place where other developers can acess it. Pretty rad.

Importing data for developers

So it’s all well and good wiping out everything in the preview setup, but what about when we want to blow up everything in our own databases? How do we do that?

Luckily, this need has been anticipated and it’s actually fairly simple.

  • visit the “digitalmarketplace-cleaned-db-dumps” bucket in S3.
  • Download cleaned-<stage>-<date>.sql.gz
    • this was created the last time data was moved, so if you want a more up to date version run the “Clean and apply database dump” job with the TARGET as s3.
  • gzcat ./cleaned-<stage>-<date>.sql.gz | psql -f - digitalmarketplace (adjust for the target database name and specific path of the downloaded file)
  • You may need to run migrations locally. From your local api repo run make run-migrations if you do.
  • Reindex the new data so it appears in search results using digitalmarketplace-scripts/scripts/index-to-search-service.py.
  • Boom. Done.
    • Have one of those biscuits you bought earlier.