Decoupling database migrations from server startup: why and how

If you’re using a schema management tool like Django ORM or Alembic, you need to run the migration at some point. And it’s tempting to run the migration as part of application startup—when you’re using Docker, for instance, you’ll have an entrypoint that will first run the migration and then start the application.

But coupling migrations with server startup can lead to quite a few problems, from database corruption to downtime, especially as you start to scale and get more users.

So in this article I’ll explain why coupling is a problem, and what you can do instead.

Migrations on application startup are a bad idea

When you’re developing a new application, doing the database migration on application startup is natural. As you’re developing new features, you make some changes to your schema and code, restart your test server, and immediately get the latest schema and code in sync.

But development environments are different than production, and that can lead to problems.

Problem #1: Parallel migrations lead to breakage

In dev environments you typically spin up one server, but in production you’re likely spinning up more than one. So now instead of one process doing schema migration, you have multiple processes trying to do multiple identical schema migrations at the same time.

Depending on your database, the migration tool you’re using, and the kind of migration you’re doing, parallel schema upgrades might break your database in a variety of ways.

You don’t want a broken database!

Problem #2: Mental coupling leads to downtime

If you always do schema upgrades as part of the application startup you also end up mentally coupling schema migrations and code upgrades. In particular, you’ll starting assuming that you only ever have new code running with the latest schema.

Why is that assumption a problem? From most to least common:

  1. Sometimes you need to rollback a broken code upgrade. If you assume you always have new code with a new schema, you can end up in a situation where your new code is broken, but you can’t easily rollback to older code because you’ve done an irreversible schema change.
  2. To minimize downtime on upgrades, you want to have a brief moment where both old and new versions of your application are running in parallel. If your schema migration breaks old code, you can’t do that.
  3. To catch bugs in new code, you might want to do a canary deploy. That is, upgrade only one or two of your many processes and see if they break.

The alternative: decoupling schema migrations from code upgrades

You want schema migration to happen separately than code upgrades:

  1. So you can run a single migration, without worries about parallelism.
  2. Optionally, to reduce downtime risks, so you can run old code with the new schema.

The first requirement can be addressed by making the schema migration a different command—we’ll talk about implementing that for Docker later on. Or you can use a migration tool that uses locking to ensure parallelism doesn’t happen, e.g. Flyway.

Zero-downtime schema migrations

The second requirement requires more work, and so you might choose not to do it for applications with less onerous uptime requirements. Essentially, you want to structure your schema migrations to be purely additive, at least in the short term.

For example, instead of changing a column type, you add a new column, and use triggers to make sure the two variants stay in sync. Then once you’re sure you’ll never have to go back to the old code, you run a further schema migration (again, without having to upgrade the code) to clear out the no longer needed column.

The general process is:

  1. Migrate from schema S to schema S+1, with only additive changes.
  2. Over time upgrade some of your processes from application version V to V+1.
  3. Eventually everything is on V+1, and you don’t ever expect to rollback to V.
  4. Finally, migrate from schema S+1 to S+2, and now you can do destructive schema changes to anything that V+1 no longer uses.

And of course you still need to do some work to deal with database locks and the like. Here’s an overview for PostgreSQL, as one starting point.

Decoupling with Docker

If you’re packaging your application with Docker, you don’t need to build separate images for the schema migration and application. Instead, you can build a single image that:

  1. By default runs the application.
  2. Optionally does the schema migration before application startup, which is useful for development and testing with the image.
  3. Can be told to run the migration separately from application startup.

Note: Outside any specific best practice being demonstrated, the Dockerfiles in this article are not examples of best practices, since the added complexity would obscure the main point of the article.

Python on Docker Production Handbook Need to ship quickly, and don’t have time to figure out every detail on your own? Read the concise, action-oriented Python on Docker Production Handbook.

First, create a Dockerfile that by default runs your entrypoint script:

FROM python:3.9-slim-bullseye
COPY . .

ENTRYPOINT []
# By default run entrypoint.sh, but if command-line arguments
# are given run those instead:
CMD ["./entrypoint.sh"]

The entrypoint script only runs the database migration if you set an environment variable:

#!/bin/bash
set -euo pipefail

if [ -v DEV_MODE ]; then
    python migrate_schema.py
fi

exec python server.py

So now by default the migration doesn’t happen:

$ docker run --rm migration-demo
Running server...

If you set the DEV_MODE environment variable you get both migration and application startup:

$ docker run --rm -e DEV_MODE=1 migration-demo
Migrating schema...
...done!
Running server...

And you can run migrations manually, separate from running the server:

$ docker run --rm migration-demo python migrate_schema.py
Migrating schema...
...done!
$

Dev is not production

In this case we were focusing on database migrations, but there’s a broader point: the configuration and usage patterns that work well in a development and test environment aren’t always those that work well in production.

Be on the lookout for other situations where you need to differentiate between the two.