Unravelling generator expressions

In this post on Python's syntactic sugar, I want to try to tackle generator expressions. If you look at the language definition for generator expressions you will see that it says, "[a] generator expression yields a new generator object" for what is specified (which is essentially a compact for loop with an expression for the body). So what does that look like if you take away the Python "magic" and unravel it down to its core Python semantics?

The bytecode

Let's take the following example:

def spam():
    return (b for b in a)
Example generator expression

The bytecode for this is:

  1           0 LOAD_CONST               1 (<code object <genexpr> at 0x10076b500, file "<stdin>", line 1>)
              2 LOAD_CONST               2 ('spam.<locals>.<genexpr>')
              4 MAKE_FUNCTION            0
              6 LOAD_GLOBAL              0 (a)
              8 GET_ITER
             10 CALL_FUNCTION            1
             12 RETURN_VALUE

Disassembly of <code object <genexpr> at 0x10076b500, file "<stdin>", line 1>:
  1           0 LOAD_FAST                0 (.0)
        >>    2 FOR_ITER                10 (to 14)
              4 STORE_FAST               1 (b)
              6 LOAD_FAST                1 (b)
              8 YIELD_VALUE
             10 POP_TOP
             12 JUMP_ABSOLUTE            2
        >>   14 LOAD_CONST               0 (None)
             16 RETURN_VALUE
Bytecode for the example generator expression

You may notice a couple of things that are interesting about this:

  1. The generator expression is very much just a for loop in a generator.
  2. The generator expression is stored as a constant in the function.
  3. a gets explicitly passed into the generator expression.

The semantics

The explicit passing of a is the surprising bit in how generator expressions work, but it actually makes sense when you read the explanation as to why this occurs:

... the iterable expression in the leftmost for clause is immediately evaluated, so that an error produced by it will be emitted at the point where the generator expression is defined, rather than at the point where the first value is retrieved. Subsequent for clauses and any filter condition in the leftmost for clause cannot be evaluated in the enclosing scope as they may depend on the values obtained from the leftmost iterable.

So by passing a in, the code for a is evaluated at the time of creation of the generator expression, not at the time of execution. That way if there's an error with that part of the code the traceback will help you find where it was defined and not simply to where the generator expression happened to be run. Since subsequent for loops in the generator expression may rely on the loop variant in the first clause, you can't eagerly evaluate any other parts of the expression.

The unravelling

There's a couple of details that required  to make unravelling a generator expression successful, so I'm going to build up a running example to cover all the various cases.

With only one for loop

Let's start with (c for b in a) where c can be some expression. To unravel this we need to make a generator which takes in a as an argument to guarantee it is eagerly evaluated where the generator expression is defined.

def _gen_exp(_leftmost_iter):
    for b in _leftmost_iter:
        yield c
        
_gen_exp(iter(a))
Unravelling (c for b in a)

We end up with a generator function which takes a single argument for the leftmost iterator. We call iter() outside of the generator to control for scoping. We should also technically unravel the for loop, but for readability I'm going to leave it in place. Now let's see what this looks like in some code that would use the generator expression:

def spam(a, b):
    func(arg=(str(b) for b in a))
Example of using a generator expression

This would then unravel to:

def spam(a, b):
    def _gen_exp(_leftmost_iterator):
        for b in _leftmost_iterator:
            yield str(b)
    
    func(arg=_gen_exp(iter(a)))
Unravelling the generator expression usage example

With multiple for loops

Now let's toss in another for loop: (e for b in a for d in c). This unravels to:

def _gen_expr(_leftmost_iterator):
    for b in _leftmost_iterator:
        d in c:
            yield e
(e for b in a for d in c) unravelled

Since only the leftmost iterable is evaluated eagerly we can rely on the scoping rules for closures to get all of the other variables from the call site implicitly (this is where Python's simple namespace system comes in handy).

Putting this into an example like:

def spam():
    x = range(2)
    y = range(3)
    return ((a, b) for a in x for b in y)
Example using multiple for loops in a generator expression

lead to an unravelling of:

def spam():
    x = range(2)
    y = range(3)
    
    def _gen_exp(_leftmost_iterable):
        for a in _leftmost_iterable:
            for b in y:
                yield (a, b)
                
    return _gen_exp(iter(x))
Unravelling of a generator expression with multiple for loops

The generator expression needs x passed in because it's the leftmost iterable, but everything else is captured by the closure.

Assignment expressions

Let's make life complicated and throw in an assignment expression:

def spam():
    list(b := a for a in range(2))
    return b
Example of a generator expression with an assignment expression

The bytecode for this becomes:

  2           0 LOAD_GLOBAL              0 (list)
              2 LOAD_CLOSURE             0 (b)
              4 BUILD_TUPLE              1
              6 LOAD_CONST               1 (<code object <genexpr> at 0x1008393a0, file "<stdin>", line 2>)
              8 LOAD_CONST               2 ('spam.<locals>.<genexpr>')
             10 MAKE_FUNCTION            8 (closure)
             12 LOAD_GLOBAL              1 (range)
             14 LOAD_CONST               3 (2)
             16 CALL_FUNCTION            1
             18 GET_ITER
             20 CALL_FUNCTION            1
             22 CALL_FUNCTION            1
             24 POP_TOP

  3          26 LOAD_DEREF               0 (b)
             28 RETURN_VALUE

Disassembly of <code object <genexpr> at 0x1008393a0, file "<stdin>", line 2>:
  2           0 LOAD_FAST                0 (.0)
        >>    2 FOR_ITER                14 (to 18)
              4 STORE_FAST               1 (a)
              6 LOAD_FAST                1 (a)
              8 DUP_TOP
             10 STORE_DEREF              0 (b)
             12 YIELD_VALUE
             14 POP_TOP
             16 JUMP_ABSOLUTE            2
        >>   18 LOAD_CONST               0 (None)
             20 RETURN_VALUE
Bytecode for example of generator expression with an assignment expression

The key thing to notice is the various *_DEREF opcodes which are what CPython uses to load/store nonlocal variables.

Now we could just add a nonlocal statement to our unravelled generator expression and assume we are done, but there is one issue to watch out for: has the variable previously been defined in the enclosing scope? If the variable doesn't exist when the scope with the nonlocal is defined (technically the compiler walking the AST has not seen the variable yet), Python will raise an exception: SyntaxError: no binding for nonlocal 'b' found.

Python gets to take a shortcut when it comes to a generator expression with an assignment expression and simply consider the nonlocal as implicit without regards as to whether the variable was previously defined. But we don't get to cheat, and that means we may have to define the variable with a dummy value to make the CPython compiler happy.

But we also have to deal with whether the generator expression is ever run or runs but never sets b (i.e. the iterable has a length of 0). In the example that would raise UnboundLocalError: local variable 'b' referenced before assignment. To replicate that we need to delete b if it never gets set appropriately.

What all of this means is our example unravels to:

def spam():
    b = _PLACEHOLDER
    
    def _gen_expr(_leftmost_iterable):
        nonlocal b
        for a in _leftmost_iterable:
            yield (b := a)
            
    list(_gen_expr(range(2)))
    if b is _PLACEHOLDER:
        del b
    return b
Unravelling of generator expression example with an assignment expression

But remember, we only want to do any of this nonlocal work if there are assignment expressions to worry about.

The best laid plans ...

I actually wrote this entire post thinking I had solved the unravelling of generator expressions, and then I realized assignment expressions thwarted me in another way. Consider the following example:

def spam():
    return ((b := x for x in range(5)), b)
Example where the result of an assignment expression is relied upon in the same statement

If you run that example you end up with UnboundLocalError: local variable 'b' referenced before assignment. Now let's unravel this:

def spam():
    b = _PLACEHOLDER
    def _gen_expr(_leftmost_iterable):
        nonlocal b
        for x in _leftmost_iterable:
            yield (b := x)
            
    return _gen_expr(iter(range(5))), b
Unravelling of the assignment expression reliance example

Unfortunately calling this function succeeds. And since del is a statement there's no way to insert ourselves into that expression to prevent b from being resolved. But luckily Guido told me of a trick to still get the UnboundLocalError: insert the initial assignment into an if False calls. If we do if False: b = _PLACEHOLDER it essentially tricks the compiler into doing the what we want and causing UnboundLocalError to  be raised when the variable isn't set while still having nonlocal allow for assigning to the variable.

def spam():
    if False:
        b = _PLACEHOLDER
    def _gen_expr(_leftmost_iterable):
        nonlocal b
        for x in _leftmost_iterable:
            yield (b := x)
            
    return _gen_expr(iter(range(5))), b
Triggering UnboundLocalError

Taking our earlier unravelling that was successful, it change into the following:

def spam():
    if False:
        b = _PLACEHOLDER
    
    def _gen_expr(_leftmost_iterable):
        nonlocal b
        for a in _leftmost_iterable:
            yield (b := a)
            
    list(_gen_expr(range(2)))
    if b is _PLACEHOLDER:
        del b
    return b
Using if False with our earlier assignment expression example

Aside: what came first, the expression or the comprehension?

If you have not been programming in Python for more than 15 years you may think generator expressions came first, then list comprehensions. But actually it's the other way around: list comprehensions were introduced in Python 2.0 and generator expressions came in Python 2.4. This is because generators were introduced in Python 2.2 (thanks to inspiration from Icon), and so the possibility of even having generator expressions didn't exist when list comprehensions came into existence (thanks to inspiration from Haskell).

Acknowledgements

Thanks to Guido for pointing out the if False trick to make unexecuted assignment expressions work. This was not part of the original post and so I initially thought the unraveling had failed.

Thanks to Serhiy Storchaka for pointing out that the scoping was off if you didn't call iter() outside the generator and you needed to unravel the for loops.