Getting Started With Property-Based Testing in Python With Hypothesis and Pytest

This tutorial will be your gentle guide to property-based testing. Property-based testing is a testing philosophy; a way of approaching testing, much like unit testing is a testing philosophy in which we write tests that verify individual components of your code.

By going through this tutorial, you will:

learn what property-based testing is;
understand the key benefits of using property-based testing;
see how to create property-based tests with Hypothesis;
attempt a small challenge to understand how to write good property-based tests; and
Explore several situations in which you can use property-based testing with zero overhead.

What is Property-Based Testing?

In the most common types of testing, you write a test by running your code and then checking if the result you got matches the reference result you expected. This is in contrast with property-based testing, where you write tests that check that the results satisfy certain properties. This shift in perspective makes property-based testing (with Hypothesis) a great tool for a variety of scenarios, like fuzzing or testing roundtripping.

In this tutorial, we will be learning about the concepts behind property-based testing, and then we will put those concepts to practice. In order to do that, we will use three tools: Python, pytest, and Hypothesis.

Python will be the programming language in which we will write both our functions that need testing and our tests.
pytest will be the testing framework.
Hypothesis will be the framework that will enable property-based testing.

Both Python and pytest are simple enough that, even if you are not a Python programmer or a pytest user, you should be able to follow along and get benefits from learning about property-based testing.

Setting up your environment to follow along

If you want to follow along with this tutorial and run the snippets of code and the tests yourself – which is highly recommendable – here is how you set up your environment.

Installing Python and pip

Start by making sure you have a recent version of Python installed. Head to the Python downloads page and grab the most recent version for yourself. Then, make sure your Python installation also has pip installed. [pip] is the package installer for Python and you can check if you have it on your machine by running the following command:

python -m pip --version

(This assumes python is the command to run Python on your machine.) If pip is not installed, follow their installation instructions.

Installing pytest and Hypothesis

pytest, the Python testing framework, and Hypothesis, the property-based testing framework, are easy to install after you have pip. All you have to do is run this command:

python -m pip install pytest hypothesis --upgrade

This tells pip to install pytest and Hypothesis and additionally it tells pip to update to newer versions if any of the packages are already installed.

To make sure pytest has been properly installed, you can run the following command:

> python -m pytest --version
pytest 7.2.0

The output on your machine may show a different version, depending on the exact version of pytest you have installed.

To ensure Hypothesis has been installed correctly, you have to open your Python REPL by running the following:

python

and then, within the REPL, type import hypothesis. If Hypothesis was properly installed, it should look like nothing happened. Immediately after, you can check for the version you have installed with hypothesis.__version__. Thus, your REPL session would look something like this:

>>> import hypothesis
>>> hypothesis.__version__
'6.60.0'

Your first property-based test

In this section, we will write our very first property-based test for a small function. This will show how to write basic tests with Hypothesis.

The function to test

Suppose we implemented a function gcd(n, m) that computes the greatest common divisor of two integers. (The greatest common divisor of n and m is the largest integer d that divides evenly into n and m.) What’s more, suppose that our implementation handles positive and negative integers. Here is what this implementation could look like:

def gcd(n, m):
    """Compute the GCD of two integers by Euclid's algorithm."""

    n, m = abs(n), abs(m)
    n, m = min(n, m), max(n, m)  # Sort their absolute values.
    while m % n:         # While `n` doesn't divide into `m`:
        n, m = m % n, n  # update the values of `n` amd `m`.
    return n

If you save that into a file, say gcd.py, and then run it with:

> python -i gcd.py

you will enter an interactive REPL with your function already defined. This allows you to play with it a bit:

λ python -i main.py
>>> gcd(15, 6)
3
>>> gcd(15, 5)
5
>>> gcd(-9, 15)
3

Now that the function is running and looks about right, we will test it with Hypothesis.

The property test

A property-based test isn’t wildly different from a standard (pytest) test, but there are some key differences. For example, instead of writing inputs to the function gcd, we let Hypothesis generate arbitrary inputs. Then, instead of hardcoding the expected outputs, we write assertions that ensure that the solution satisfies the properties that it should satisfy.

Thus, to write a property-based test, you need to determine the properties that your answer should satisfy.

Thankfully for us, we already know the properties that the result of gcd must satisfy:

“[…] the greatest common divisor (GCD) of two or more integers […] is the largest positive integer that divides each of the integers.”

So, from that Wikipedia quote, we know that if d is the result of gcd(n, m), then:

d is positive;
d divides n;
d divides m; and
no other number larger than d divides both n and m.

To turn these properties into a test, we start by writing the signature of a test_ function that accepts the same inputs as the function gcd:

def test_gcd(n, m):
    ...

(The prefix test_ is not significant for Hypothesis. We are using Hypothesis with pytest and pytest looks for functions that start with test_, so that is why our function is called test_gcd.)

The arguments n and m, which are also the arguments of gcd, will be filled in by Hypothesis. For now, we will just assume that they are available.

If n and m are arguments that are available and for which we want to test the function gcd, we have to start by calling gcd with n and m and then saving the result. It is after calling gcd with the supplied arguments and getting the answer that we get to test the answer against the four properties listed above.

Taking the four properties into account, our test function could look like this:

def test_gcd(n, m):
    d = gcd(n, m)

    assert d > 0  # 1) `d` is positive
    assert n % d == 0  # 2) `d` divides `n`
    assert m % d == 0  # 3) `d` divides `m`

    # 4) no other number larger than `d` divides both `n` and `m`
    for i in range(d + 1, min(n, m)):
        assert (n % i) or (m % i)

Go ahead and put this test function next to the function gcd in the file gcd.py. Typically, tests live in a different file from the code being tested but this is such a small example that we can have everything in the same file.

Plugging in Hypothesis

We have written the test function but we still haven’t used Hypothesis to power the test. Let’s go ahead and use Hypothesis’ magic to generate a bunch of arguments n and m for our function gcd. In order to do that, we need to figure out what are all the legal inputs that our function gcd should handle.

For our function gcd, the valid inputs are all integers, so we need to tell Hypothesis to generate integers and feed them into test_gcd. To do that, we need to import a couple of things:

from hypothesis import given, strategies as st

given is what we will use to tell Hypothesis that a test function needs to be given data. The submodule strategies is the module that contains lots of tools that know how to generate data.

With these two imports, we can annotate our test:

from hypothesis import given, strategies as st

def gcd(n, m):
    ...

@given(st.integers(), st.integers())
def test_gcd(n, m):
    d = gcd(n, m)
    # ...

You can read the decorator @given(st.integers(), st.integers()) as “the test function needs to be given one integer, and then another integer”. To run the test, you can just use pytest:

λ pytest gcd.py

(Note: depending on your operating system and the way you have things configured, pytest may not end up in your path, and the command pytest gcd.py may not work. If that is the case for you, you can use the command python -m pytest gcd.py instead.)

As soon as you do so, Hypothesis will scream an error message at you, saying that you got a ZeroDivisionError. Let us try to understand what Hypothesis is telling us by looking at the bottom of the output of running the tests:

...
gcd.py:8: ZeroDivisionError
--------------------------------- Hypothesis ----------------------------------
Falsifying example: test_gcd(
    m=0, n=0,
)
=========================== short test summary info ===========================
FAILED gcd.py::test_gcd - ZeroDivisionError: integer division or modulo by zero
============================== 1 failed in 0.67s ==============================

This shows that the tests failed with a ZeroDivisionError, and the line that reads “Falsifying example: …” contains information about the test case that blew our test up. In our case, this was n = 0 and m = 0. So, Hypothesis is telling us that when the arguments are both zero, our function fails because it raises a ZeroDivisionError.

The problem lies in the usage of the modulo operator %, which does not accept a right argument of zero. The right argument of % is zero if n is zero, in which case the result should be m. Adding an if statement is a possible fix for this:

def gcd(n, m):
    """Compute the GCD of two integers by Euclid's algorithm."""

    n, m = abs(n), abs(m)
    n, m = min(n, m), max(n, m)  # Sort their absolute values.

    if not n:
        return m

    while m % n:         # While `n` doesn't divide into `m`:
        n, m = m % n, n  # update the values of `n` and `m`.
    return n

However, Hypothesis still won’t be happy. If you run your test again, with pytest gcd.py, you get this output:

> pytest gcd.py
...

FAILED gcd.py::test_gcd - assert 0 > 0

This time, the issue is with the very first property that should be satisfied. We can know this because Hypothesis tells us which assertion failed while also telling us which arguments led to that failure. In fact, if we look further up the output, this is what we see:

n = 0, m = 0   <-- Hypothesis tells you what arguments broke the test

    @example(0, 0)
    @given(st.integers().filter(lambda n: n != 0), st.integers())
    def test_gcd(n, m):
        d = gcd(n, m)

>       assert d > 0  # 1) `d` is positive
E       assert 0 > 0
E       Falsifying explicit example: test_gcd(
E           n=0, m=0,
E       )

gcd.py:23: AssertionError
===================== short test summary info =====================
FAILED gcd.py::test_gcd - assert 0 > 0

This time, the issue isn’t really our fault. The greatest common divisor is not defined when both arguments are zero, so it is ok for our function to not know how to handle this case. Thankfully, Hypothesis lets us customise the strategies used to generate arguments. In particular, we can say that we only want to generate integers between a minimum and a maximum value.

The code below changes the test so that it only runs with integers between 1 and 100 for the first argument (n) and between -500 and 500 for the second argument (m):

@given(
    st.integers(min_value=1, max_value=100),
    st.integers(min_value=-500, max_value=500),
)
def test_gcd(n, m):
    d = gcd(n, m)
    # ...

That is it! This was your very first property-based test.

Why bother with Property-Based Testing?

To write good property-based tests you need to analyse your problem carefully to be able to write down all the properties that are relevant. This may look quite cumbersome. However, using a tool like Hypothesis has very practical benefits:

Hypothesis can generate dozens or hundreds of tests for you, while you would typically only write a couple of them;
tests you write by hand will typically only cover the edge cases you have already thought of, whereas Hypothesis will not have that bias; and
thinking about your solution to figure out its properties can give you deeper insights into the problem, leading to even better solutions.

These are just some of the advantages of using property-based testing.

Using Hypothesis for free

There are some scenarios in which you can use property-based testing essentially for free (that is, without needing to spend your precious brain power), because you don’t even need to think about properties. Let’s look at two such scenarios.

Testing Roundtripping

Hypothesis is a great tool to test roundtripping. For example, the built-in functions int and str in Python should roundtrip. That is, if x is an integer, then int(str(x)) should still be x. In other words, converting x to a string and then to an integer again should not change its value.

We can write a simple property-based test for this, leveraging the fact that Hypothesis generates dozens of tests for us. Save this in a Python file:

from hypothesis import given, strategies as st

@given(st.integers())
def test_int_str_roundtripping(x):
    assert x == int(str(x))

Now, run this file with pytest. Your test should pass!

Fuzzing

Did you notice that, in our gcd example above, the very first time we ran Hypothesis we got a ZeroDivisionError? The test failed, not because of an assert, but simply because our function crashed.

Hypothesis can be used for tests like this. You do not need to write a single property because you are just using Hypothesis to see if your function can deal with different inputs. Of course, even a buggy function can pass a fuzzing test like this, but this helps catch some types of bugs in your code.

Comparing against a gold standard

Sometimes, you want to test a function f that computes something that could be computed by some other function f_alternative. You know this other function is correct (that is why you call it a “gold standard”), but you cannot use it in production because it is very slow, or it consumes a lot of resources, or for some other combination of reasons.

Provided it is ok to use the function f_alternative in a testing environment, a suitable test would be something like the following:

@given(...)
def test_f(...):
    assert f(...) == f_alternative(...)

When possible, this type of test is very powerful because it directly tests if your solution is correct for a series of different arguments.

For example, if you refactored an old piece of code, perhaps to simplify its logic or to make it more performant, Hypothesis will give you confidence that your new function will work as it should.

The importance of property completeness

In this section you will learn about the importance of being thorough when listing the properties that are relevant. To illustrate the point, we will reason about property-based tests for a function called my_sort, which is your implementation of a sorting function that accepts lists of integers.

The results are sorted

When thinking about the properties that the result of my_sort satisfies, you come up with the obvious thing: the result of my_sort must be sorted.

So, you set out to assert this property is satisfied:

@given(...)
def test_my_sort(int_list):
    result = my_sort(int_list)
    for a, b in zip(result, result[1:]):
        assert a <= b

Now, the only thing missing is the appropriate strategy to generate lists of integers. Thankfully, Hypothesis knows a strategy to generate lists, which is called lists. All you need to do is give it a strategy that generates the elements of the list.

from hypothesis import given, strategies as st


@given(st.lists(st.integers()))
def test_my_sort(int_list):
    result = my_sort(int_list)
    for a, b in zip(result, result[1:]):
        assert a <= b

Now that the test has been written, here is a challenge. Copy this code into a file called my_sort.py. Between the import and the test, define a function my_sort that is wrong (that is, write a function that does not sort lists of integers) and yet passes the test if you run it with pytest my_sort.py. (Keep reading when you are ready for spoilers.)

Notice that the only property that we are testing is “all elements of the result are sorted”, so we can return whatever result we want, as long as it is sorted. Here is my fake implementation of my_sort:

def my_sort(int_list):
    return []

This passes our property test and yet is clearly wrong because we always return an empty list. So, are we missing a property? Perhaps.

The lengths are the same

We can try to add another obvious property, which is that the input and the output should have the same length, obviously. This means that our test becomes:

@given(st.lists(st.integers()))
def test_my_sort(int_list):
    result = my_sort(int_list)

    assert len(result) == len(int_list)

    for a, b in zip(result, result[1:]):
        assert a <= b

Now that the test has been improved, here is a challenge. Write a new version of my_sort that passes this test and is still wrong. (Keep reading when you are ready for spoilers.)

Notice that we are only testing for the length of the result and whether or not its elements are sorted, but we don’t test which elements are contained in the result. Thus, this fake implementation of my_sort would work:

def my_sort(int_list):
    return list(range(len(int_list)))

Use the right numbers

To fix this, we can add the obvious property that the result should only contain numbers from the original list. With sets, this is easy to test:

@given(st.lists(st.integers()))
def test_my_sort(int_list):
    result = my_sort(int_list)

    assert len(result) == len(int_list)  # Should have same length.

    assert set(result) <= set(int_list)  # Should use numbers from input.

    for a, b in zip(result, result[1:]):  # Result is actually sorted.
        assert a <= b

Now that our test has been improved, I have yet another challenge. Can you write a fake version of my_sort that passes this test? (Keep reading when you are ready for spoilers).

Here is a fake version of my_sort that passes the test above:

def my_sort(int_list):
    if not int_list:
        return []
    return len(int_list) * [int_list[0]]

The issue here is that we were not precise enough with our new property. In fact, set(result) <= set(int_list) ensures that we only use numbers that were available in the original list, but it doesn’t ensure that we use all of them. What is more, we can’t fix it by simply replacing the <= with ==. Can you see why?I will give you a hint. If you just replace the <= with a ==, so that the test becomes:

@given(st.lists(st.integers()))
def test_my_sort(int_list):
    result = my_sort(int_list)

    assert len(result) == len(int_list)  # Should have same length.

    assert set(result) == set(int_list)  # Same numbers as input.

    for a, b in zip(result, result[1:]):  # Result is actually sorted.
        assert a <= b

then you can write this passing version of my_sort that is still wrong:

def my_sort(int_list):
    if not int_list:
        return []

    s = sorted(set(int_list))
    return s + [s[-1]] * (len(int_list) - len(s))

This version is wrong because it reuses the largest element of the original list without respecting the number of times each integer should be used. For example, for the input list [1, 1, 2, 2, 3, 3] the result should be unchanged, whereas this version of my_sort returns [1, 2, 3, 3, 3, 3].

The final test

A test that is correct and complete would have to take into account how many times each number appears in the original list, which is something the built-in set is not prepared to do. Instead, one could use the collections.Counter from the standard library:

@given(st.lists(st.integers()))
def test_my_sort(int_list):
    result = my_sort(int_list)

    assert len(result) == len(int_list)  # Should have same length.

    assert Counter(result) == Counter(int_list)  # Should use numbers from input.

    for a, b in zip(result, result[1:]):  # Result is actually sorted.
        assert a <= b

So, at this point, your test function test_my_sort is complete. At this point, it is no longer possible to fool the test! That is, the only way the test will pass is if my_sort is a real sorting function.

Use properties and specific examples

This section showed that the properties that you test should be well thought-through and you should strive to come up with a set of properties that are as specific as possible. When in doubt, it is better to have properties that may look redundant over having too few.

Another strategy that you can follow to help mitigate the danger of having come up with an insufficient set of properties is to mix property-based testing with other forms of testing, which is perfectly reasonable.

For example, on top of having the property-based test test_my_sort, you could add the following test:

def test_my_sort_specific_examples():
    assert my_sort([]) == []
    assert my_sort(list(range(10)[::-1])) == list(range(10))
    assert my_sort([42, 73, 0, 16, 10]) == [0, 10, 16, 42, 73]

Conclusion

This article covered two examples of functions to which we added property-based tests. We only covered the basics of using Hypothesis to run property-based tests but, more importantly, we covered the fundamental concepts that enable a developer to reason about and write complete property-based tests.

Property-based testing isn’t a one-size-fits-all solution that means you will never have to write any other type of test, but it does have characteristics that you should take advantage of whenever possible. In particular, we saw that property-based testing with Hypothesis was beneficial in that:

Hypothesis can generate dozens or hundreds of tests for you, while you would typically only write a couple of them;
tests you write by hand will typically only cover the edge cases you have already thought of, whereas Hypothesis will not have that bias; and
thinking about your solution to figure out its properties can give you deeper insights into the problem, leading to even better solutions.

This article also went over a couple of common gotchas when writing property-based tests and listed scenarios in which property-based testing can be used with no overhead.

If you are interested in learning more about Hypothesis and property-based testing, we recommend you take a look at the Hypothesis docs and, in particular, to the page “What you can generate and how”.

5 thoughts on “Getting Started With Property-Based Testing in Python With Hypothesis and Pytest”

Ivelin Ivanov says:

February 21, 2023 at 11:07 pm

Awesome intro to property based testing for Python. Thank you, Dan and Rodrigo!

Layco says:

June 2, 2023 at 11:52 am

Greeting! Unfortunately, I don’t understand due to translation difficulties. PyCharm writes error messages and does not run the codes. The installation was done fine, check ok. I created a virtual environment.
I would like a single good, usable, complete code, an example of what to write in gcd.py and what in test_gcd.py, which the development environment runs without errors. Thanks!

Ivy says:

June 19, 2023 at 3:14 pm

Thanks for article!

Wiktor says:

August 22, 2023 at 2:57 pm

“it is better to have properties that may look redundant over having too few”
Isn’t it the case with:
assert len(result) == len(int_list)
and:
assert Counter(result) == Counter(int_list)
?
I mean: is it possible to satisfy the second condition without satisfying the first ?

Toni Akinjiola says:

March 2, 2024 at 3:23 am

Yes. One case could be if result = [0,1], int_list = [0,1,1], and the implementation of Counter returns unique count.