Django: Sanitize incoming HTML fragments with nh3

Artist’s impression of some chemical reaction involving ammonia.

A fairly common situation in a Django project is where you need to store and serve arbitrary HTML fragments. These often come from forms with rich text editors (using HTML’s contenteditable).

It’s insecure to trust user-generated HTML fragments since they can contain naughty content like:

<script src=https://example.com/evil.js></script>

A page containing this content would execute the arbitrary code in evil.js, possibly stealing user details. This technique is a Cross-Site Scripting (XSS) attack. Whilst a strong Content Security Policy can reduce the possible effects of arbitrary content, it’s still best to “sanitize” incoming HTML fragments, allowing only safe content into your database. This way, there’s no chance of future changes allowing XSS attacks through.

For years, the Django community has relied on the Bleach package for HTML sanitization, either directly or via django-bleach. But in January this year, Will Kahn-Greene, the Bleach maintainer, announced it was deprecated. This move is due to the underlying HTML parser package, html5lib, going unmaintained.

Since 2021, there has been a new package for the task, nh3, created and maintained by Messense Lv. Playing off of “bleach”, it is named after the chemical formula for Ammonia, which is also the name for its underlying HTML parser package. Both are built in Rust, making nh3 about 20 times faster than the pure-Python Bleach. Ammonia copies a lot from Bleach, even parts of the API, hence its similar name.

Let’s look at how to use nh3 for HTML sanitisation in Django forms. You can adapt this approach to other situations, such as in DRF serializers.

Make a little custom form field

It doesn’t take much to integrate nh3 into a form field. Here’s a forms.CharField subclass that sanitizes incoming HTML using nh3.clean():

import nh3
from django import forms


class HtmlSanitizedCharField(forms.CharField):
    def to_python(self, value):
        value = super().to_python(value)
        if value not in self.empty_values:
            value = nh3.clean(value)
        return value

Use it like a normal CharField:

class CommentForm(forms.Form):
    comment = HtmlSanitizedCharField()

Only allowed HTML tags or attributes will appear in the output:

In [2]: form = CommentForm({"comment":"<strong>hi</strong> <script src=evil.com></script>"})

In [3]: form.is_valid()
Out[3]: True

In [4]: f.cleaned_data['comment']
Out[4]: '<strong>hi</strong> '

Here are some quick tests to add to your project to ensure the sanitization continues to work:

from django.test import SimpleTestCase
from example.forms import HtmlSanitizedCharField


class HtmlSanitizedCharFieldTests(SimpleTestCase):
    field = HtmlSanitizedCharField()

    def test_empty(self):
        result = self.field.to_python("")
        assert result == ""

    def test_allowed_html(self):
        result = self.field.to_python("<strong>Arm</strong>")
        assert result == "<strong>Arm</strong>"

    def test_naughty_html(self):
        result = self.field.to_python(
            "<script src=example.com/evil.js></script><strong>Arm</strong>"
        )
        assert result == "<strong>Arm</strong>"

Customize cleaning with nh3.clean() arguments

The nh3 defaults (from Ammonia) are generally safe, allowing only general content tags and attributes, but they are still pretty wide, allowing 75 different tags. Allowing some tags on this default list, such as <article>, may lead to surprising results. In the worst case, an attacker may be able to craft official-looking content with evil instructions to users.

Typically, incoming HTML fragments come from a rich text editor on your site. In this case, use the arguments of nh3.clean() to limit allowed tags and attributes to only those your editor will create. This way, there won’t be any chance of surprise content.

For example, say your text editor that only outputs <a>, <em>, <p>, or <strong> tags with certain attributes. You could modify the field to call to nh3.clean() with its tags and attributes arguments like so:

import nh3
from django import forms


class HtmlSanitizedCharField(forms.CharField):
    def to_python(self, value):
        value = super().to_python(value)
        if value not in self.empty_values:
            value = nh3.clean(
                value,
                # Allow only tags and attributes from our rich text editor
                tags={
                    "a",
                    "em",
                    "p",
                    "strong",
                },
                attributes={
                    "a": {
                        "href",
                    },
                    "img": {
                        "alt",
                        "src",
                    },
                },
            )
        return value

Then the form would strip other tags, like <article>:

In [2]: form = CommentForm({"comment":"<article><em>hello</em></article>"})

In [3]: form.is_valid()
Out[3]: True

In [4]: form.cleaned_data['comment']
Out[4]: '<em>hello</em>'

nh3.clean() has several arguments, see its documentation for full details. If you’re converting from Bleach, see Daniel Roy Greenfeld’s post for a set of arguments that match what Bleach does.

Make model forms automatically use the form field

Many or most Django forms use the ModelForm shortcut to generate their form fields automatically. You can switch fields to use HtmlSanitizedCharField by adding explicit field definitions, but this is tiresome across a whole project. Instead, switch the appropriate model fields to a subclass that overrides models.Field.formfield() to return HtmlSanitizedCharField. For example:

from django.db import models

from example import forms


class HtmlSanitizedTextField(models.TextField):
    def formfield(self, form_class=forms.HtmlSanitizedCharField, **kwargs):
        return super().formfield(form_class=form_class, **kwargs)

Then, swap model fields to use it:

class Comment(models.Model):
    comment = HtmlSanitizedTextField()

This change will require a database migration, but that migration won’t run any SQL because the underlying database definition will be the same.

Then, model fields will automatically use the forms.HtmlSanitizedCharField class:

In [3]: class CommentForm(forms.ModelForm):
   ...:     class Meta:
   ...:         model = Comment
   ...:         fields = ("comment",)
   ...:

In [4]: form.fields['comment']
Out[4]: <example.forms.HtmlSanitizedCharField at 0x1069f4100>

Again, here’s a quick test to verify this behaviour:

from django.test import SimpleTestCase
from example import forms
from example.models import HtmlSanitizedTextField


class HtmlSanitizedTextFieldTests(SimpleTestCase):
    def test_formfield(self):
        field = HtmlSanitizedTextField()
        form_field = field.formfield()
        assert isinstance(form_field, forms.HtmlSanitizedCharField)

This protection only applies to forms

We only covered a sanitizing form field in this post. Data that enters your system from other sources, such as an API endpoint or remote import task, won’t be protected.

Fix this by adding calls to nh3.clean() in the appropriate places. Or you can consider making an advanced model field that calls nh3.clean() before saving to the database. The django-nh3 package (alpha release at time of writing) may help as it provides such a field.

Fin

Thanks to Messense for creating and maintaining nh3. And thanks to Rupert Baker of SharedGoals for asking me to look at migrating from Bleach to nh3.

May all your HTML fragments be sanitized,

Adam


Newly updated: my book Boost Your Django DX now covers Django 5.0 and Python 3.12.


Subscribe via RSS, Twitter, Mastodon, or email:

One summary email a week, no spam, I pinky promise.

Related posts:

Tags: