How to implement a “dry run mode” for data imports in Django

2022-10-13 This ram ain’t running anywhere fast, dry or not.

In data import processes it’s often useful to have a “dry run” mode, that runs through the process but doesn’t actually save the data. This can allow you to check for validity and gather statistics, such as how many records already exist in the database. In this post, we’ll look at how to implement a dry run mode in Django by using a database transaction and rolling it back.

Example use case

Take these models:

from django.db import models


class Author(models.Model):
    name = models.TextField()

    class Meta:
        constraints = [
            models.UniqueConstraint(
                name="%(app_label)s_%(class)s_name_unique",
                fields=["name"],
            )
        ]


class Book(models.Model):
    title = models.TextField()
    author = models.ForeignKey(Author, on_delete=models.DO_NOTHING)

    class Meta:
        constraints = [
            models.UniqueConstraint(
                name="%(app_label)s_%(class)s_title_author_unique",
                fields=["title", "author"],
            )
        ]

Imagine you have the mission to write a management command to import data from a CSV into Author and Book. The CSV lists book titles alongside author names:

Title,Author
The Very Hungry Caterpillar,Eric Carle
A Message to Garcia,Elbert Hubbard
To Kill a Mockingbird,Harper Lee

The import process should avoid creating duplicate authors and books that already exist in the database. After it’s done, it should list the number of books and authors imported. In dry run mode, the command should still output the counts, but not actually add anything to the database.

Alrighty then.

Let’s jump ahead a bit. Here’s a first version of the import process, without a dry run mode:

import argparse
import csv
from contextlib import closing
from io import TextIOWrapper
from typing import Any

from django.core.management.base import BaseCommand
from django.db.transaction import atomic

from example.core.models import Author, Book


class Command(BaseCommand):
    help = "Import books from a CSV"

    def add_arguments(self, parser: argparse.ArgumentParser) -> None:
        parser.add_argument("file", type=argparse.FileType())

    def handle(self, *args: Any, file: TextIOWrapper, **kwargs: Any) -> None:
        books_created = 0
        authors_created = 0

        with closing(file), atomic():
            csvfile = csv.reader(file)

            header = next(csvfile)
            if header != ["Title", "Author"]:
                self.stdout.write("Unexpected header row, should be “Title,Author”")
                raise SystemExit(1)

            for title, author_name in csvfile:
                author, created = Author.objects.get_or_create(name=author_name)
                if created:
                    authors_created += 1

                book, created = Book.objects.get_or_create(
                    title=title,
                    author=author,
                )
                if created:
                    books_created += 1

        self.stdout.write(
            f"Created {books_created} books and {authors_created} authors."
        )

Some quick notes:

argparse.FileType conveniently opens the file for you, or fails with a nice error message if the path does not exist.
closing() closes the file after the with block is complete. It’s always a good idea to close files after you’re done with them, as operating systems enforce a limit on the number of open files.
atomic() wraps the import process in a transaction, so that if an error occurs, data will not be partially imported. We’ll be looking more at this in the following sections.
Django’s ORM method get_or_create() prevents the creation of duplicate rows.

Cool beans. Now let’s add a dry run mode!

Default on or default off?

There’s the question of whether to enable dry run mode by default, or provide it as an optional extra. The answer is the old chestnut “it depends”.

It may be that you want users to always use “dry run” and inspect the results before committing to the import. In this case, it’s best to default dry run mode on, with a flag to actually write to the databse.

But in other cases, the dry run mode may only be useful occasionally, such as for infrequent large data imports. In this case, you can make dry run optional, with an extra flag.

For our example, let’s go with making dry run mode the default.

You can add a --write flag to the command with an extra add_argument():

def add_arguments(self, parser: argparse.ArgumentParser) -> None:
    parser.add_argument(
        "--write",
        action="store_true",
        help="Actually edit the database",
    )
    parser.add_argument("file", type=argparse.FileType())

The parsed arguments will then include a boolean called write, that is True only if the flag is provided. We can accept this in the handle() signature like so:

def handle(self, *args: Any, file: TextIOWrapper, write: bool, **kwargs: Any) -> None:
    ...

Now let’s see how to implement the actual dry run mode.

Dry run by rollback

A database transaction groups together several changes into an all-or-nothing atomic operation. If the transaction commits, all those changes are applied at once. If the transaction rolls back, all those changes are discarded.

You can take advantage of this behaviour to implement a dry run mode. When performing a dry run, wrap a transaction around the changes, and roll it back.

The example above already wraps the process in a transaction with Django’s atomic(). This context manager starts a transaction when entered. If the block exits without an exception being raised, it commits the transaction. If an exception is raised, the transaction rolls back.

So, to implement dry run mode, you can raise an exception, which gets atomic() to roll back the transaction. Then you can catch the exception outside of the atomic() block, in order to continue normal processing. A simple way of adapting the previous example to use this technique would look like:

try:
    with closing(file), atomic():
        csvfile = csv.reader(file)

        # The rest of the import process
        ...

        if not write:  # dry run mode
            raise DoRollback()
except DoRollback:
    pass

Where DoRollback is a custom exception class. The code raises the exception when dry run mode is not active, then catches it outside of the atomic().

This is a perfectly serviceable technique, but it is tangling up the code a bit. It adds several extra lines to the import proces, and another level of indentation. Let’s tidy it up by extracting the atomic() with rollback into a separate context manager.

contextlib.contextmanager() is a decorator that makes it easy to create a context manager. Rather than writing a context manager class, you write a generator function with a single yield. On entry, the function runs up until the yield. On exit, code will continue at the yield, raising the current exception if the block raised one.

You can extract the atomic()-with-rollback pattern into a @contextmanager function like so:

from collections.abc import Generator
from contextlib import contextmanager


class DoRollback(Exception):
    pass


@contextmanager
def rollback_atomic() -> Generator[None, None, None]:
    try:
        with atomic():
            yield
            raise DoRollback()
    except DoRollback:
        pass

You can then use this context manager to create a “dry run block” wherever needed:

with rollback_atomic():
    # Make changes to the database that will always be rolled back
    ...

Sweet as.

Okay, so how about in the example command? The changes should be committed if write is True, otherwise it should roll back. You can encode this logic by picking the appropriate context manager before the with statement:

if write:
    atomic_context = atomic()
else:
    atomic_context = rollback_atomic()

with closing(file), atomic_context:
    ...

Nice.

Putting it all together, the command with dry-run-by-default looks like:

import argparse
import csv
from collections.abc import Generator
from contextlib import contextmanager, closing
from io import TextIOWrapper
from typing import Any

from django.core.management.base import BaseCommand
from django.db.transaction import atomic

from example.core.models import Author, Book


class Command(BaseCommand):
    help = "Import books from a CSV"

    def add_arguments(self, parser: argparse.ArgumentParser) -> None:
        parser.add_argument(
            "--write",
            action="store_true",
            default=False,
            help="Actually edit the database",
        )
        parser.add_argument("file", type=argparse.FileType())

    def handle(
        self, *args: Any, file: TextIOWrapper, write: bool, **kwargs: Any
    ) -> None:
        if not write:
            self.stdout.write("In dry run mode (--write not passed)")

        books_created = 0
        authors_created = 0

        if write:
            atomic_context = atomic()
        else:
            atomic_context = rollback_atomic()

        with closing(file), atomic_context:
            csvfile = csv.reader(file)

            header = next(csvfile)
            if header != ["Title", "Author"]:
                self.stdout.write("Unexpected header row, should be “Title,Author”")
                raise SystemExit(1)

            for title, author_name in csvfile:
                author, created = Author.objects.get_or_create(name=author_name)
                if created:
                    authors_created += 1

                book, created = Book.objects.get_or_create(
                    title=title,
                    author=author,
                )
                if created:
                    books_created += 1

        if write:
            prefix = "Created"
        else:
            prefix = "Would create"
        self.stdout.write(
            f"{prefix} {books_created} books and {authors_created} authors."
        )


class DoRollback(Exception):
    pass


@contextmanager
def rollback_atomic() -> Generator[None, None, None]:
    try:
        with atomic():
            yield
            raise DoRollback()
    except DoRollback:
        pass

Brillliant.

Fin

May your dry run processes be ever easier,

—Adam

Newly updated: my book Boost Your Django DX now covers Django 5.0 and Python 3.12.

One summary email a week, no spam, I pinky promise.

Related posts:

Tags: django