Simon Willison’s Weblog

Subscribe

Finding and reporting an asyncio bug in Python 3.10

9th October 2021

I found a bug in Python 3.10 today! Some notes on how I found it and my process for handling it once I figured out what was going on.

Testing Datasette against Python 3.10

I finally got around to attempting to upgrade Datasette to the just-released Python 3.10 this morning. I started by adding "3.10" to the matrix of Python versions that Datasette is tested against by GitHub Actions:

strategy:
  matrix:
    python-version: ["3.6", "3.7", "3.8", "3.9", "3.10"]

That line previously looked like this:

python-version: [3.6, 3.7, 3.8, 3.9]

I switched to using quoted strings here based on a tip from Jeff Triplett, who pointed out that 3.10 is actually a floating point number in YAML syntax which will be treated as referring to Python 3.1 instead!

This one-line change in a pull request was all it took to run the tests against the new version of Python... and they failed, spectacularly.

Annoyingly the root error wasn’t displayed directly in the tests, because it was triggered and then caught by Datasette’s own error handling and turned into a 500 HTTP error. So, I needed to run Python 3.10 locally and it the debugger.

Running Python 3.10 locally

I occasionally use pyenv to manage multiple Python versions on my machine. I have this installed through Homebrew.

I had to run brew upgrade pyenv first, because my installed version didn’t know about Python 3.10. Then I ran this:

pyenv install 3.10.0

Which gave me a Python 3.10 installation on my Mac in ~/.pyenv/versions/3.10.0/bin/python.

Finally, I created a new virtual environment using this version of Python.

~/.pyenv/versions/3.10.0/bin/python \
  -m venv /tmp/py310
# Now activate that virtual environment:
source /tmp/py310/bin/activate

I put this in my /tmp directory so I don’t have to remember to clean it up later.

Having done this, I installed my Datasette test dependencies into the new environment and used pytest to run my tests:

% cd ~/Dropbox/Development/datasette
% pip install -e '.[test]'

Running pytest -x --pdb stopped at the first failing test and dropped me into a debugger where I could finally access the full traceback, which looked something like this:

  File "datasette/datasette/views/base.py", line 122, in dispatch_request
    await self.ds.refresh_schemas()
  File "datasette/datasette/app.py", line 344, in refresh_schemas
    await self._refresh_schemas()
  File "datasette/datasette/app.py", line 349, in _refresh_schemas
    await init_internal_db(internal_db)
  File "datasette/datasette/utils/internal_db.py", line 5, in init_internal_db
    await db.execute_write(
  File "datasette/datasette/database.py", line 102, in execute_write
    return await self.execute_write_fn(_inner, block=block)
  File "datasette/datasette/database.py", line 113, in execute_write_fn
    reply_queue = janus.Queue()
  File "janus/__init__.py", line 39, in __init__
    self._async_not_empty = asyncio.Condition(self._async_mutex)
  File "lib/python3.10/asyncio/locks.py", line 234, in __init__
    raise ValueError("loop argument must agree with lock")
ValueError: loop argument must agree with lock

So, something is going long with asyncio.Condition(), which is being called by janus.Queue().

Some background on Janus

The Janus library referenced here describes itself as a “Thread-safe asyncio-aware queue for Python”.

I use it in Datasette for the write queue—a mechanism where writes can be safely made to a SQLite database by first queueing them up in memory.

Janus provides a mechanism that lets asyncio event tasks send and receive messages to Python threads, and vice-versa, via a queue class that is modelled on Python’s own standard library queues. It’s really neat.

Tracking the investigation in issue comments

Any time I’m investigating a bug I make sure that there’s an associated GitHub issue or pull request, to give me somewhere to build up detailed notes of my investigation.

I used my PR #1481 for this. Now that I had an error message—“loop argument must agree with lock”—I did some initial Google searches and found to my surprise that it had very few existing mentions—my first clue that this might be related to the new Python 3.10 release itself.

I tracked down the relevant source code in both Python and the Janus library based on the traceback and linked to them from the issue comments. I pasted in copies of the relevant code too, since GitHub only magically embeds code from the same repository, not code from other repositories. You can see those comments here.

I always like to include copies of the code most likely to be the cause of a bug in an issue comment, to save myself from having to dig around for it again later on and to act as a historical record once the bug has been fixed.

I also started an issue in the Janus repo, with the title Error with Python 3.10: “ValueError: loop argument must agree with lock”. I linked this to my own issue and posted relevant research in both places.

The two lines of code in Janus that caused the bug where these ones (note my added comment):

self._async_mutex = asyncio.Lock()
# "loop argument must agree with lock" exception is raised here:
self._async_not_empty = asyncio.Condition(self._async_mutex)

This is where I began to suspect a Python bug. The above code can be simplified to this:

asyncio.Condition(asyncio.Lock())

This is a documented way of using conditions in Python: you can instantiate conditions with an optional lock, which allows that lock to be shared by more than one condition.

So I tried that in the Python 3.10 interpreter... and it didn’t throw the error:

% ~/.pyenv/versions/3.10.0/bin/python
Python 3.10.0 (default, Oct  7 2021, 13:45:58) [Clang 12.0.0 (clang-1200.0.32.29)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import asyncio
>>> print(asyncio.Condition(asyncio.Lock()))
<asyncio.locks.Condition object at 0x1062521d0 [unlocked]>

From browsing the source code I could see that there was something going on here involving the event loop—so on a hunch I tried running that code in an event loop instead, like so:

>>> async def example():
...     print(asyncio.Condition(asyncio.Lock()))
... 
>>> asyncio.run(example())
Traceback (most recent call last):
  ...
  File "<stdin>", line 2, in example
  File "/Users/simon/.pyenv/versions/3.10.0/lib/python3.10/asyncio/locks.py", line 234, in __init__
    raise ValueError("loop argument must agree with lock")
ValueError: loop argument must agree with lock

There’s the exception!

So it looked like this might be a genuine Python 3.10 bug. After all, I knew that this code worked in prior versions of Python (since Janus and Datasette worked fine there). In fact I could confirm it like so:

~ % python3.9
Python 3.9.7 (default, Sep  3 2021, 12:45:31) 
[Clang 12.0.0 (clang-1200.0.32.29)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import asyncio
>>> async def example():
...     print(asyncio.Condition(asyncio.Lock()))
... 
>>> asyncio.run(example())
<asyncio.locks.Condition object at 0x1049aab50 [unlocked]>

Reporting the bug to the Python issue tracker

Python bugs are tracked on bugs.python.org—thankfully they support sign in with GitHub (and Google and even OpenID) so I signed in with my GitHub account to report the bug.

I filed Issue 45416 entitled:

“loop argument must agree with lock” instantiating asyncio.Condition

And included a brief description with my minimal Python script that demonstrated the bug.

Then I added a comment pointing back to one of the related GitHub issues where I had been figuring this out, and another comment that linked to the most relevant test in Python’s own test suite, which I found using GitHub search for the term Condition.

Any time I file an issue against a project—my own or someone else’s—I always like to do a bit of extra research and link to both the likely code at fault and the likely test.

As a maintainer, this saves me a bunch of time—if the issue links straight to the code and tests I don’t have to spend any time digging around looking for them.

My hope is that this saves time for the people I am reporting the bug to as well, which increases the chance that they’ll dig into it and help me find a fix!

Then I tweeted about it. I have a bunch of followers with relevant experience, so I thought it was time to get a few more eyes on the problem.

I included a screenshot of my steps-to-reproduce script to save anyone who might be casually interested from having to click through to the issue itself.

Łukasz Langa to the rescue

Łukasz is the “CPython Developer in Residence”, which means he is paid by the Python Software Foundation to work on Python full-time.

Less than an hour after I filed the issue he posted a comment that both confirmed the bug and provided two short-term workarounds for it!

I really liked his first workaround, which looks like this:

>>> l = asyncio.Lock()
>>> getattr(l, '_get_loop', lambda: None)()
<_UnixSelectorEventLoop running=True closed=False debug=False>

It works by calling the lock._get_loop() method, but only if it exists—so even though that’s an undocumented internal method this should be safe even when run against future Python versions that might remove the method.

You can see that method here—it safely populates the self._loop property, which helps work around the bug.

Submitting a PR to Janus

The best place to apply this workaround is to Janus—so I submitted a PR there which adds the workaround and updates their CI configuration to test against Python 3.10: Janus PR #359. The GitHub Actions tests for that PR are now passing against Python 3.10.

Their README links to a gitter chat so I dropped a link to my PR in there too. Hopefully that can be merged soon!

What I should have done

I’m feeling pretty good about this situation—the bug is set to be fixed, we have a great workaround thanks to Łukasz and I’m optimistic that the upstream library from Datasette will land the fix soon.

If only I’d done this a few weeks earlier!

The Python 3.10 release schedule shipped the first alpha a year ago, in October 2020. It totalled 7 alphas, 4 betas and 2 release candidates before the final 3.10 release on Monday 4th October 2021.

If I’d taken the time to get Datasette’s test suite running against Python 3.10 for any of those pre-releases, I could have helped spot this bug before it landed in the final release!

So lesson learned: Python has alphas and betas and RCs for a reason. For future Pythons I’m going to pay attention to testing against these for my projects before the final release lands.