How and why I use pytest's xfail

Although I was initially skeptical, several years ago I was convinced that xfail is a quite useful tool for managing an evolving codebase. I was first convinced as a result of J. Brock Mendel's[1] suggestion to add an xfail test as part of a bug report. I wasn't crazy about the idea, because it felt a bit like committing a bunch of commented out code — why store broken tests directly in the repository?

The philosophy I've settled on, partially thanks to Brock's answer, is that xfailing tests provide several benefits:

  1. They serve as an acceptance criterion for the deficiency they're marking.
  2. They provide evidence that your tests will properly serve as regression tests.
  3. They can be used as a sentinel for when unrelated behavior changes fix known bugs — and they give you regression tests for those bugs for free!

I will justify these points one at a time, but first I should back up and clarify what it means to mark a test as xfail.

xfail vs skip

Despite the pytest documentation containing a very clear explanation of when to use pytest.mark.skip and when to use pytest.mark.xfail, I still see a lot of confusion on this point (it's almost as if people don't sit down and read the entirety of a library's documentation before they start using it…). It's understandable — both of them are broadly used to mark tests that won't succeed. The main difference is that with skip you don't run the test at all, whereas with xfail, you run the test and make sure that it failed (if it doesn't fail, it's marked as XPASS). So the way I use these is:

Use skip for tests that aren't supposed to work — if you have a test for Windows-specific functionality, or tests that only work on Python 3.8+, conditionally skip them when running the test suites where they are not supported. This means that you should essentially always be using skip with a condition[2][3], because otherwise it makes no sense to include the test at all: you don't intend to change the software to make the test pass, and you never actually run the test, so it's just dead code.

Use xfail for tests that should, but don't currently pass. A high quality bug report has basically the same features as a good test: it has a self-contained, minimal reproduction of a particular failure mode in a desired behavior of the software. If you've gone through all the trouble of crafting a minimal bug report, you can often times simply take your example reproducer and add it to the test suite, marked as pytest.mark.xfail. Similarly, you could write the tests for a feature you'd like before writing the code to implement it in TDD style, marking the failing tests as xfail until the feature is implemented.

Making failing to fail a failure

By default, when you mark a test as xfail, the test will be run as part of the test suite and the failure or success will be recorded as either XFAIL or XPASS, but the test suite itself will pass no matter which of the two outcomes came to pass. In my opinion, this makes xfail a lot closer to a more resource-hungry version of skip, because when running hundreds or thousands of tests, most people won't notice that a few of them were XPASS and track them down. What gives xfail its super-powers, in my opinion, is the fact that you can make it strict, meaning that any XPASSing tests will cause the test suite to fail. You can either do this per-test with pytest.mark.xfail(strict=True), or you can set it globally in setup.cfg or one of the other global configuration locations.

[tool:pytest]
xfail_strict = true

This immediately makes xfail more useful, because it is enforcing that you've written a test that fails in the current state of the world. If you've written the test correctly, it should start passing when the bug is fixed, and you can be much more confident that your test is actually testing the behavior you're intending to test.

That said, this is not perfect — xfail only measures that a test fails, not necessarily why. For example, imagine we have the following broken function and test:

def is_perfect_square(n: int) -> bool:
    """Determine if any int i exists such that i × i = n."""
    s = math.sqrt(n)
    return s == int(s)

@pytest.mark.xfail(
    reason="Bug #11493: Negative values not supported",
    strict=True
)
def test_negative():
    # When called with a negative value for n, this test raises ValueError!
    v = -4
    assert not is_perfect_square(w)

This test will XFAIL just fine before you fix the underlying code, but if you change the implementation to:

def is_perfect_square(n: int) -> bool:
    """Determine if any int i exists such that i × i = n."""
    if n < 0:
        return False  # No negative numbers are perfect squares

    s = math.sqrt(n)
    return s == int(s)

Your test suite should start failing, because you haven't removed the xfail decorator, but if you were to run the tests at this point, you'd find that they pass! Eagle-eyed readers will notice that the last line of test_negative() is making assertions about unbound local variable w, rather than v! Turns out that our test was failing for the wrong reasons all along! You can mitigate this somewhat with the raises parameter, by specifying that the test must fail by raising ValueError:

@pytest.mark.xfail(strict=True, raises=ValueError)
def test_negative():
    # When called with a negative value for n, this test raises ValueError!
    # See bug #11493
    v = -4
    assert not is_perfect_square(w)

This would have solved our specific issue, but it's a very blunt instrument. It's very easy to imagine a situation where an assertion fails for the wrong reasons but still raises the expected error class (particularly if that error class is AssertionError). Still, strict xfailing tests is better than the alternative, which gives you no insight into whether or not your test was failing.

Reasons to use xfail

As promised, we now come back to look at my three purported benefits of xfail in more detail.

They serve as acceptance criteria

Once you have decided that a feature should be implemented or a bug should be fixed, part of the design process is deciding when the ticket can be closed. As mentioned earlier, these acceptance criteria often take the form of minimal examples demonstrating how the desired behavior is deficient, along with suggestions for what the desired behavior should be, so why not simply add these to the test suite as automated acceptance criteria? Rather than try and explain to new contributors how to fix the bug, you say, "When you remove the xfail marker on these tests and the test suite still passes, the bug is fixed!"

This also allows you to break up the work between people with specialized knowledge in some cases — a user can do an excellent job of designing their acceptance criteria without knowing anything about how to actually fix your library or application. Similarly a maintainer or subject matter expert may understand the code base well enough to fix bugs, but not have a great deal of insight into how end users might use the feature. Using xfail lets you accept the contribution of regression tests before the feature is implemented.

They provide evidence that your tests work

As we saw earlier, although it's not perfect, the fact that a test fails before the bug is fixed (possibly even constrained to raising the right kind of error) and stops failing after the bug is fixed is pretty reasonable evidence that the test will work well as a regression test — since you aren't putting the xfail marker back on, you can imagine it works like a ratchet: once the test is fixed, it stays fixed.

They serve as a sentinel for when you accidentally fix a bug

Particularly in larger code bases, everyone doesn't usually keep every bug at the top of their head, nor do they necessarily understand when changes in one behavior might fix already-reported bugs. With strict xfail, any time you fix a known bug with an xfailing test, you'll be immediately notified, because the test will start XPASSing. And the best part is that the only action you need to take is to remove the marker, and you immediately get a regression test for the bug you just fixed by accident!

I don't know that it's an amazingly common experience to accidentally fix a bug, but I have definitely more than once seen tickets closed because the original bug report is no longer reproducible or the bug has been fixed by some unrelated change, or when a dependency updated. I would not be surprised if occasionally you get a situation where a bug is accidentally fixed, and then later changes accidentally break it again because no regression tests were imposed. Having the xfail there not only puts everyone on notice that a given bug has been fixed, it also prevents that sort of thing from happening.

Parting thoughts

Perhaps it is idealistic to imagine a future where every bug report comes with a well-crafted xfailing test suite that the maintainer immediately accepts, and every time you try to fix one bug you accidentally close out 3 other tickets, but I do feel that there is a strong case for using xfail in this way.

I often think of these things through the lens of open source development (you may have seen this bias in my assumptions earlier in the post), so I will also note that "I would be happy to accept an xfailing test for this" is a great way to signal to your users that you actually intend (or at least hope) to fix their bug.

Note

I have also covered similar material in a short 5-minute talk that I've given a few times. A recording of one version of the talk is available from PyGotham 2019.

Footnotes

[1]Brock is a co-maintainer of dateutil and since the post in question has become a maintainer of pandas as well.
[2]There is a case to be made for using pytest.mark.skip for resource-intensive tests that just happen to be failing at the moment, so that you don't bog down your test suite with expensive but known-failing tests. I would be inclined to instead add a custom pytest.mark.heavy test, and configure your test runner to skip things that are marked as both heavy and xfail by default, if you are concerned about this.
[3]One other situation where you may want to use skip instead of xfail is when it's very difficult to boil down the failure condition to a simple conditional. For example, you have a test that is flaky, or fails on a very specific combination of platforms and Python versions, and it would be more trouble than it's worth to investigate the exact conditions under which the test fails just to mark it as xfail. As an expedient, unconditional skip with a comment as to why would make sense here, though arguably there's a testing strategy whereby you automatically re-run the flaky tests a sufficient number of times to ensure that at least one failure is nearly guaranteed.