Using Markdown in Django

Manage Content in Django Sites


As developers, we rely on static analysis tools to check, lint and transform our code. We use these tools to help us be more productive and produce better code. However, when we write content using markdown the tools at our disposal are scarce.

In this article we describe how we developed a Markdown extension to address challenges in managing content using Markdown in Django sites.

Do you think they had a linter?<br><small>Photo by <a href="https://www.pexels.com/photo/typing-writing-typography-vintage-102100/">mali maeder from Pexels</a></small>
Do you think they had a linter?
Photo by mali maeder from Pexels

Table of Contents

The Problem

Like every website, we have different types of (mostly) static content in places like our home page, FAQ section and "About" page. For a very long time, we managed all of this content directly in Django templates.

When we finally decided it's time to move this content out of templates and into the database, we thought it's best to use Markdown. It's safer to produce HTML from Markdown, it provides a certain level of control and uniformity, and is easier for non-technical users to handle. As we progressed with the move, we noticed we are missing a few things:

Internal Links

Links to internal pages can get broken when the URL changes. In Django templates and views we use reverse and {% url %}, but this is not available in plain Markdown.

Copy Between Environments

Absolute internal links cannot be copied between environments. This can be resolved using relative links, but there is no way to enforce this out of the box.

Invalid Links

Invalid links can harm user experience and cause the user to question the reliability of the entire content. This is not something that is unique to Markdown, but HTML templates are maintained by developers who know a thing or two about URLs. Markdown documents on the other hand, are intended for non-technical writers.

Prior Work

When I was researching this issue I searched for Python linters, Markdown preprocessor and extensions to help produce better Markdown. I found very few results. One approach that stood out was to use Django templates to produce Markdown documents.

Preprocess Markdown using Django Template

Using Django templates, you can use template tags such as url to reverse URL names, as well as conditions, variables, date formats and all the other Django template features. This approach essentially uses Django template as a preprocessor for Markdown documents.

I personally felt like this may no be the best solution for non-technical writers. In addition, I was worried that providing access to Django template tags might be dangerous.


Using Markdown

With a better understanding of the problem, we were ready to dig a bit deeper into Markdown in Python.

Converting Markdown to HTML

To start using Markdown in Python, install the markdown package:

$ pip install markdown
Collecting markdown
Installing collected packages: markdown
Successfully installed markdown-3.2.1

Next, create a Markdown object and use the function convert to turn some Markdown into HTML:

>>> import markdown
>>> md = markdown.Markdown()
>>> md.convert("My name is **Haki**")
<p>My name is <strong>Haki</strong></p>

You can now use this HTML snippet in your template.

Using Markdown Extensions

The basic Markdown processor provides the essentials for producing HTML content. For more "exotic" options, the Python markdown package includes some built-in extensions. A popular extension is the "extra" extension that adds, among other things, support for fenced code blocks:

>>> import markdown
>>> md = markdown.Markdown(extensions=['extra'])
>>> md.convert("""```python
... print('this is Python code!')
... ```""")
<pre><code class="python">print(\'this is Python code!\')\n</code></pre>

To extend Markdown with our unique Django capabilities, we are going to develop an extension of our own.

If you look at the source, you'll see that to convert markdown to HTML, Markdown uses different processors. One type of processor is an inline processor. Inline processors match specific inline patterns such as links, backticks, bold text and underlined text, and converts them to HTML.

The main purpose of our Markdown extension is to validate and transform links. So, the inline processor we are most interested in is the LinkInlineProcessor. This processor takes markdown in the form of [Haki's website](https://hakibenita.com), parses it and returns a tuple containing the link and the text.

To extend the functionality, we extend LinkInlineProcessor and create a Markdown.Extension that uses it to handle links:

import markdown
from markdown.inlinepatterns import LinkInlineProcessor, LINK_RE


def get_site_domain() -> str:
    # TODO: Get your site domain here
    return 'example.com'


def clean_link(href: str, site_domain: str) -> str:
    # TODO: This is where the magic happens!
    return href


class DjangoLinkInlineProcessor(LinkInlineProcessor):
    def getLink(self, data, index):
        href, title, index, handled = super().getLink(data, index)
        site_domain = get_site_domain()
        href = clean_link(href, site_domain)
        return href, title, index, handled


class DjangoUrlExtension(markdown.Extension):
    def extendMarkdown(self, md, *args, **kwrags):
        md.inlinePatterns.register(DjangoLinkInlineProcessor(LINK_RE, md), 'link', 160)

Let's break it down:

  • The extension DjangoUrlExtension registers an inline link processor called DjangoLinkInlineProcessor. This processor will replace any other existing link processor.
  • The inline processor DjangoLinkInlineProcessor extends the built-in LinkInlineProcessor, and calls the function clean_link on every link it processes.
  • The function clean_link receives a link and a domain, and returns a transformed link. This is where we are going to plug in our implementation.

How to get the site domain

To identify links to your own site you must know the domain of your site. If you are using Django's sites framework you can use it to get the current domain.

I did not include this in my implementation because we don't use the sites framework. Instead, we set a variable in Django settings.

Another way to get the current domain is from an HttpRequest object. If content is only edited in your own site, you can try to plug the site domain from the request object. This may require some changes to the implementation.

To use the extension, add it when you initialize a new Markdown instance:

>>> md = markdown.Markdown(extensions=[DjangoUrlExtension()])
>>> md.convert("[haki's site](https://hakibenita.com)")
<p><a href="https://hakibenita.com">haki\'s site</a></p>

Great, the extension is being used and we are ready for the interesting part!


Now that we got the extension to call clean_link on all links, we can implement our validation and transformation logic.

To get the ball rolling, we'll start with a simple validation. mailto links are useful for opening the user's email client with a predefined recipient address, subject and even message body.

A common mailto link can look like this:

<a href="mailto:support@service.com?subject=I need help!">Help!</a>

This link will open your email client set to compose a new email to "support@service.com" with subject line "I need help!".

mailto links do not have to include an email address. If you look at the "share" buttons at the bottom of this article, you'll find a mailto link that looks like this:

<a
  href="mailto:?subject=Django Markdown by Haki Benita&body=http://hakibenita.com/django-markdown"
  title="Email">
  Share via Email
</a>

This mailto link does not include a recipient, just a subject line and message body.

Now that we have a good understanding of what mailto links look like, we can add the first validation to the clean_link function:

from typing import Optional
import re

from django.core.exceptions import ValidationError
from django.core.validators import EmailValidator


class Error(Exception):
    pass


class InvalidMarkdown(Error):
    def __init__(self, error: str, value: Optional[str] = None) -> None:
        self.error = error
        self.value = value

    def __str__(self) -> str:
        if self.value is None:
            return self.error
        return f'{self.error} "{self.value}"';


def clean_link(href: str, site_domain: str) -> str:
    if href.startswith('mailto:'):
        email_match = re.match('^(mailto:)?([^?]*)', href)
        if not email_match:
            raise InvalidMarkdown('Invalid mailto link', value=href)

        email = email_match.group(2)
        if email:
            try:
                EmailValidator()(email)
            except ValidationError:
                raise InvalidMarkdown('Invalid email address', value=email)

        return href

    # More validations to come...

    return href

To validate a mailto link we added the following code to clean_link:

  • Check if the link starts with mailto: to identify relevant links.
  • Split the link to its components using a regular expression.
  • Yank the actual email address from the mailto link, and validate it using Django's EmailValidator.

Notice that we also added a new type of exception called InvalidMarkdown. We defined our own custom Exception type to distinguish it from other errors raised by markdown itself.

Custom error class

I wrote about custom error classes in the past, why they are useful and when you should use them.

Before we move on, let's add some tests and see this in action:

>>> md = markdown.Markdown(extensions=[DjangoUrlExtension()])
>>> md.convert("[Help](mailto:support@service.com?subject=I need help!)")
'<p><a href="mailto:support@service.com?subject=I need help!">Help</a></p>'

>>> md.convert("[Help](mailto:?subject=I need help!)")
<p><a href="mailto:?subject=I need help!">Help</a></p>

>>> md.convert("[Help](mailto:invalidemail?subject=I need help!)")
InvalidMarkdown: Invalid email address "invalidemail"

Great! Worked as expected.

Now that we got our toes wet with mailto links, we can handle other types of links:

External Links

  • Links outside our Django app.
  • Must contains a scheme: either http or https.
  • Ideally, we also want to make sure these links are not broken, but we won't do that now.

Internal Links

  • Links to pages inside our Django app.
  • Link must be relative: this will allow us to move content between environments.
  • Use Django's URL names instead of a URL path: this will allow us to safely move views around without worrying about broken links in markdown content.
  • Links may contain query parameters (?) and a fragment (#).

SEO

From an SEO standpoint, public URL's should not change. When they do, you should handle it properly with redirects, otherwise you might get penalized by search engines.

With this list of requirements we can start working.

Resolving URL Names

To link to internal pages we want writers to provide a URL name, not a URL path. For example, say we have this view:

from django.urls import path
from app.views import home

urlpatterns = [
    path('', home, name='home'),
]

The URL path to this page is https://example.com/, the URL name is home. We want to use the URL name home in our markdown links, like this:

Go back to [homepage](home)

This should render to:

<p>Go back to <a href="/">homepage</a></p>

We also want to support query params and hash:

Go back to [homepage](home#top)
Go back to [homepage](home?utm_source=faq)

This should render to the following HTML:

<p>Go back to <a href="/#top">homepage</a></p>
<p>Go back to <a href="/?utm_source=faq">homepage</a></p>

Using URL names, if we change the URL path, the links in the content will not be broken. To check if the href provided by the writer is a valid url_name, we can try to reverse it:

>>> from django.urls import reverse
>>> reverse('home')
'/'

The URL name "home" points to the url path "/". When there is no match, an exception is raised:

>>> from django.urls import reverse
>>> reverse('foo')
NoReverseMatch: Reverse for 'foo' not found.
'foo' is not a valid view function or pattern name.

Before we move forward, what happens when the URL name include query params or a hash:

>>> from django.urls import reverse
>>> reverse('home#top')
NoReverseMatch: Reverse for 'home#top' not found.
'home#top' is not a valid view function or pattern name.

>>> reverse('home?utm_source=faq')
NoReverseMatch: Reverse for 'home?utm_source=faq' not found.
'home?utm_source=faq' is not a valid view function or pattern name.

This makes sense because query parameters and hash are not part of the URL name.

To use reverse and support query params and hashes, we first need to clean the value. Then, check that it is a valid URL name and return the URL path including the query params and hash, if provided:

import re
from django.urls import reverse

def clean_link(href: str, site_domain: str) -> str:
    # ... Same as before ...

    # Remove fragments or query params before trying to match the URL name.
    href_parts = re.search(r'#|\?', href)
    if href_parts:
        start_ix = href_parts.start()
        url_name, url_extra = href[:start_ix], href[start_ix:]
    else:
        url_name, url_extra = href, ''

    try:
        url = reverse(url_name)
    except NoReverseMatch:
        pass
    else:
        return url + url_extra

    return href

This snippet uses a regular expression to split href in the occurrence of either ? or #, and return the parts.

Make sure that it works:

>>> md = markdown.Markdown(extensions=[DjangoUrlExtension()])
>>> md.convert("Go back to [homepage](home)")
<p>Go back to <a href="/">homepage</a></p>

>>> md.convert("Go back to [homepage](home#top)")
<p>Go back to <a href="/#top">homepage</a></p>

>>> md.convert("Go back to [homepage](home?utm_source=faq)")
<p>Go back to <a href="/?utm_source=faq">homepage</a></p>

>>> md.convert("Go back to [homepage](home?utm_source=faq#top)")
<p>Go back to <a href="/?utm_source=faq#top">homepage</a></p>

Amazing! Writers can now use URL names in Markdown. They can also include query parameters and fragment to be added to the URL.

To handle external links properly we want to check two things:

  1. External links always provide a scheme, either http: or https:.
  2. Prevent absolute links to our own site. Internal links should use URL names.

So far, we handled URL names and mailto links. If we passed these two checks it means href is a URL. Let's start by checking if the link is to our own site:

from urllib.parse import urlparse

def clean_link(href: str, site_domain: str) -> str:
    parsed_url = urlparse(href)
    if parsed_url.netloc == site_domain:
        # TODO: URL is internal.

The function urlparse returns a named tuple that contains the different parts of the URL. If the netloc property equals the site_domain, the link is really an internal link.

If the URL is in fact internal, we need to fail. But, keep in mind that writers are not necessarily technical people, so we want to help them out a bit and provide a useful error message. We require that internal links use a URL name and not a URL path, so it's best to let writers know what is the URL name for the path they provided.

To get the URL name of a URL path, Django provides a function called resolve:

>>> from django.utils import resolve
>>> resolve('/')
ResolverMatch(
    func=app.views.home,
    args=(),
    kwargs={},
    url_name=home,
    app_names=[],
    namespaces=[],
    route=,
)
>>> resolve('/').url_name
'home'

When a match is found, resolve returns a ResolverMatch object that contains, among other information, the URL name. When a match is not found, it raises an error:

>>> resolve('/foo')
Resolver404: {'tried': [[<URLPattern '' [name='home']>]], 'path': 'foo'}

This is actually what Django does under the hood to determine which view function to execute when a new request comes in.

To provide writers with better error messages we can use the URL name from the ResolverMatch object:

from urllib.parse import urlparse

def clean_link(href: str, site_domain: str) -> str:
    # ...

    parsed_url = urlparse(href)
    if parsed_url.netloc == site_domain:
        try:
            resolver_match = resolve(parsed_url.path)
        except Resolver404:
            raise InvalidMarkdown(
                "Should not use absolute links to the current site.\n"
                "We couldn't find a match to this URL. Are you sure it exists?",
                value=href,
            )
        else:
            raise InvalidMarkdown(
                "Should not use absolute links to the current site.\n"
                'Try using the url name "{}".'.format(resolver_match.url_name),
                value=href,
            )

    return href

When we identify that the link in internal, we handle two cases:

  • We don't recognize the URL: The url is most likely incorrect. Ask the writer to check the URL for mistakes.
  • We recognize the URL: The url is correct so tell the writer what URL name to use instead.

Let's see it in action:

>>> clean_link('https://example.com/', 'example.com')
InvalidMarkdown: Should not use absolute links to the current site.
Try using the url name "home". "https://example.com/"

>>> clean_link('https://example.com/foo', 'example.com')
InvalidMarkdown: Should not use absolute links to the current site.
We couldn't find a match to this URL.
Are you sure it exists? "https://example.com/foo"

>>> clean_link('https://external.com', 'example.com')
'https://external.com'

Nice! External links are accepted and internal links are rejected with a helpful message.

Requiring Scheme

The last thing we want to do is to make sure external links include a scheme, either http: or https:. Let's add that last piece to the function clean_link:

def clean_link(href: str, site_domain: str) -> str:
    # ...
    parsed_url = urlparse(href)

    #...
    if parsed_url.scheme not in ('http', 'https'):
        raise InvalidMarkdown(
            'Must provide an absolute URL '
            '(be sure to include https:// or http://)',
            href,
        )

    return href

Using the parsed URL we can easily check the scheme. Let's make sure it's working:

>>> clean_link('external.com', 'example.com')
InvalidMarkdown: Must provide an absolute URL (be sure to include https:// or http://) "external.com"

We provided the function with a link that has no scheme, and it failed with a helpful message. Cool!

Putting it All Together

This is the complete code for the clean_link function:

def clean_link(href: str, site_domain: str) -> str:
    if href.startswith('mailto:'):
        email_match = re.match(r'^(mailto:)?([^?]*)', href)
        if not email_match:
            raise InvalidMarkdown('Invalid mailto link', value=href)

        email = email_match.groups()[-1]
        if email:
            try:
                EmailValidator()(email)
            except ValidationError:
                raise InvalidMarkdown('Invalid email address', value=email)

        return href

    # Remove fragments or query params before trying to match the url name
    href_parts = re.search(r'#|\?', href)
    if href_parts:
        start_ix = href_parts.start()
        url_name, url_extra = href[:start_ix], href[start_ix:]
    else:
        url_name, url_extra = href, ''

    try:
        url = reverse(url_name)
    except NoReverseMatch:
        pass
    else:
        return url + url_extra

    parsed_url = urlparse(href)

    if parsed_url.netloc == site_domain:
        try:
            resolver_match = resolve(parsed_url.path)
        except Resolver404:
            raise InvalidMarkdown(
                "Should not use absolute links to the current site.\n"
                "We couldn't find a match to this URL. Are you sure it exists?",
                value=href,
            )
        else:
            raise InvalidMarkdown(
                "Should not use absolute links to the current site.\n"
                'Try using the url name "{}".'.format(resolver_match.url_name),
                value=href,
            )

    if parsed_url.scheme not in ('http', 'https'):
        raise InvalidMarkdown(
            'Must provide an absolute URL '
            '(be sure to include https:// or http://)',
            href,
        )

    return href

To get a sense of what a real use case for all of these features look like, take a look at the following content:

# How to Get Started?

Download the [mobile app](https://some-app-store.com/our-app) and log in to your account.
If you don't have an account yet, [sign up now](signup?utm_source=getting_started).
For more information about pricing, check our [pricing plans](home#pricing-plans)

This will produce the following HTML:

<h1>How to Get Started?</h1>
<p>Download the <a href="https://some-app-store.com/our-app">mobile app</a> and log in to your account.
If you don't have an account yet, <a href="signup/?utm_source=getting_started">sign up now</a>.
For more information about pricing, check our <a href="/#pricing-plans">pricing plans</a></p>

Nice!

Conclusion

We now have a pretty sweet extension that can validate and transform links in Markdown documents! It is now much easier to move documents between environments and keep our content tidy and most importantly, correct and up to date!

Source

The full source code can be found in this gist.

Taking it Further

The capabilities described in this article worked well for us, but you might want to adjust it to fit your own needs.

If you need some ideas, then in addition to this extension we also created a markdown Preprocessor that lets writers use constants in Markdown. For example, we defined a constant called SUPPORT_EMAIL, and we use it like this:

Contact our support at [$SUPPORT_EMAIL](mailto:$SUPPORT_EMAIL)

The preprocessor will replace the string $SUPPORT_EMAIL with the text we defined, and only then render the Markdown.




Similar articles