Database backups

How database backups are created

Backups are generated by a Jenkins job at 4am every morning. A dump is taken from the production database, which is then zipped, encrypted and uploaded to a replicated S3 bucket, digitalmarketplace-database-backups for storage. This is done in the following steps:

  • Jenkins creates a unique file name for the dump in the format <stage>-yyyyMMddHHmm.sql.gz.gpg
  • Jenkins deploys a worker app, db-backup to the PaaS, using the deploy-db-backup-app Makefile command in the digitalmarketplace-aws repo. This app has the scripts required to create and upload the dump baked into its Docker image (see Digital Marketplace Docker Hub).
  • The following environment variables are required by the app manifest: - DUMP_FILE_NAME - the unique filename generated by Jenkins (passed as an argument to deploy-db-backup-app) - S3_POST_URL_DATA - A signed url and extra data for POSTing the dump to S3. Explained below. - RECIPIENT - Used for encryption with GPG, it signifies which public key to use to encrypt.
  • An additional variable PUBKEY (the public key used for encryption) is set after the app has spun up.
  • The db-backup app then starts a task container which executes the create-db-dump.sh. This container has its own disk and memory quotas, needed to handle the large file size.

What the backup script does

The create-db-dump.sh script first imports PUBKEY into GPG2. It then connects to the database instance in the PaaS and uses pg_dump to create a plaintext dump with no owner and no access control list. The dump is streamed to gzip and then straight to GPG2 for encryption before being written to disk.

Next, a python script, upload-dump-to-s3.py, is executed for uploading the dump to S3. It uses S3_POST_URL_DATA (the signed S3 url generated earlier) and will return an error if upload fails.

Next, Jenkins checks that the new encrypted dump in S3 can be decrypted. This is to ensure that the private key used to decrypt the dumps is the correct counterpart of the public key used to encrypt. If the private key was rotated and the public key wasn’t for some reason, we wouldn’t know about it until too late without this check. Jenkins uses a script called check-db-dump-is-decryptable.sh.

The decrypt script downloads the new dump from S3. It then decrypts and imports the private GPG key from the credentials repo and imports it. GPG then executes a --list-packets command on the dump. We don’t actually care about the packets, but the command needs the correct private key to operate successfully. It means we can test decryption without actually having to decrypt. Finally it deletes the secret key as well as the downloaded dump.

Finally, Jenkins alerts slack with either a success or failure message and deletes the db-backup app from the PaaS.

The S3 buckets

The bucket where zipped and encrypted dumps are stored in the first instance is called digitalmarketplace-database-backups and is in the Digital Marketplace Backups AWS account.

This bucket has cross region replication enabled and will replicate all new objects to another bucket called digitalmarketplace-cross-region-database-backups in the eu-west-2 (london) region.

The buckets are accessible to 1 group and the Jenkins role. The group is called ‘backups’ and contains the users currently in the production_infrastructure group. This means that users on 2nd line support as well as permanent admins will be able to GET the backup files. The Jenkins role only has permissions to PUT or GET on the bucket to prevent deletion of dumps.

The buckets sit in the digitalmarketplace-backups account which can only be accessed using a password reset.

The backups in digitalmarketplace-database-backups and digitalmarketplace-cross-region-database-backups are retained for 180 and 7 days respectively.

Signed S3 URL’s

The ‘S3_POST_URL_DATA’ is generated by a script in the AWS repo called generate-s3-post-url-data.py. It needs to be executed by an AWS entity with the correct rights to upload to the S3 Jenkins bucket. In our case this is the Jenkins role assumed by the Jenkins server. The signed URL can then be used by an entity with no permissions on the bucket.

GPG

The dumps are being encrypted with GPG2. The public and private keys being used are kept in the digitalmarketplace-credentials repo. The private key is encrypted with SOPS in the usual way. The public key is unencrypted. The private key has a passphrase which is required to use it. This is also in the credentials repo and is also encrypted with SOPS.

The keys use RSA 4096.

Restoring from a backup

There is no automatic process to restore the production database from one of the dumps. If we’re in the situation where it needs to happen, it’s probably quite a serious situation and should probably be done manually. The steps will be similar to below:

  • Alert the team on the #dm-release Slack channel, and grab the deploy gorilla.

  • Disable writes to the database, either by putting the site into Maintenance mode (preferred option) or stopping the API app in PaaS (only as a last resort). Consider disabling smoke/smoulder tests.

  • Download the required database dump file (usually the most recent) from the digitalmarketplace-database-backups S3. Members of the admins group and the production-infrastructure group will be able to do this via the GUI console by assuming the backups role in the digitalmarketplace-backups account. If no admin is available, you can ssh into Jenkins and download it via the CLI. The command will be:

    aws s3 cp s3://digitalmarketplace-database-backups/<filename-on-s3> <path-to-local-destination>
    
  • The dump will need to be unencrypted and unzipped, using GPG and gzip. (Mac users: install via brew install gpg).

  • Use SOPS to decrypt the secret key from digitalmarkeplace-credentials/gpg/database-backups/secret.key.enc and save it to a file:

    ./sops-wrapper -d gpg/database-backups/secret.key.enc > <path-to-unencrypted-secret-key>
    
  • Import the secret key:

    gpg --import <path-to-unencrypted-secret-key>
    
  • You will be asked for a passphrase for the secret key. It can be decrypted with SOPS from digitalmarkeplace-credentials/gpg/database-backups/secret-key-passphrase.txt.enc.

  • Decrypt the database dump file and unzip. You’ll be asked for the passphrase again:

    gpg --decrypt <path-to-encrypted-dump> | gunzip > <path-to-local-destination>
    
  • Delete the secret key from your GPG keyring:

    gpg --delete-secret-key 'Digital Marketplace DB backups'
    
  • Delete the decrypted secret key file:

    rm <path-to-unencrypted-secret-key>
    
  • PaaS have an event trigger which invokes a function to reassign the owner of an object. The function checks if the current user has a particular role. If that role doesn’t exist in the database it causes an error. Removing the trigger prevents the function being executed. The trigger is removed in our cleaned database dump that is used by developers, however it is still present in the production dump.

  • Open the database dump file (use an editor that can handle an extremely large file, such as vi) and remove the following lines:

    CREATE EVENT TRIGGER reassign_owned ON ddl_command_end
    EXECUTE PROCEDURE public.reassign_owned();
    
  • Ensure that you’re logged in to Cloud Foundry and are in the production space (if that’s where you’re restoring to):

    cf target -s production
    
  • Use the populate-paas-db.sh script in the digitalmarketplace-aws repo to import the decrypted database dump to the database. PaaS will prompt for a password (this will have been copied to your clipboard by the script - it’s as easy as Ctrl+V!)

    ./digitalmarketplace-aws/scripts/populate-paas-db.sh <path-to-decrypted-dump-file>
    
  • The restore script will take at least 15 minutes to run. If you need a cup of tea, now is the time.

  • Test that the data has restored correctly (https://dm-api-production.cloudapps.digital should respond even during maintenance mode).

  • Delete the unencrypted database dump file from your local machine.

  • Let stakeholders know that the restore has been completed.

  • Ensure the team has a plan for reconciling any lost data, and how this will be communicated to users.

  • Toggle maintenance mode to ‘recovery’ to restore access to the API apps only.

  • Re-sync the Elasticsearch indices for services and briefs, using the Jenkins catchup jobs:

  • Toggle maintenance mode to ‘live’ to restore access to the Frontend apps.