|
|
Subscribe / Log in / New account

An overview of structural pattern matching for Python

Did you know...?

LWN.net is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net.

By Jake Edge
May 3, 2022
PyCon

Python's match statement, which provides a long-sought C-like switch statement—though it is far more than that—has now been part of the language for more than six months. One of the authors of the series of Python Enhancement Proposals (PEPs) that described the feature, Brandt Bucher, came to PyCon 2022 in Salt Lake City, Utah to talk about the feature. He gave an overview of its history, some of its many-faceted abilities, a bit about how it was implemented, and some thoughts on its future, in a presentation on April 29, which was the first day of talks for the conference.

Bucher said that he was studying computer engineering when he encountered Python, which made him realize that he liked developing software a lot more than he liked designing hardware. He got involved in Python core development during that time and has been a core developer for the language for nearly two years at this point. He is now working for Microsoft on the Faster CPython team, though his biggest project to date has been the work on shepherding and implementing structural pattern matching (i.e. match), much of which was done while he was working for Research Affiliates.

History

The genesis of the PEP was a "nerd sniping email on a Wednesday during COVID" from Guido van Rossum to Bucher to see if he wanted to be part of the effort to try to add a match statement to the language. Bucher had been looking for a project, so he was definitely interested and joined the team of six authors working on what became PEP 622 ("Structural Pattern Matching"). It was, Bucher said, "a monster": a huge PEP that was not written with a clear audience in mind, which contained lots of extra bells and whistles that bloated the PEP (and feature).

The Python steering council diligently read the PEP, he said, and asked the authors to rewrite it so that the council could render a judgment on the feature. That resulted in three new PEPs to fully describe it:

  • PEP 634: "Structural Pattern Matching: Specification"
  • PEP 635: "Structural Pattern Matching: Motivation and Rationale"
  • PEP 636: "Structural Pattern Matching: Tutorial"

Making that change improved the PEP, by splitting it into three pieces, each of which was aimed at a different audience. Bucher said that while PEP 634 is important, he hoped that the only people who read it were those who were implementing the feature. The other two are less dry and more approachable, he said; PEP 635 was specifically targeted at making the case to the council and other core developers. PEP 636 is one that everyone should read; it shows how to use match by walking through an example of a text-based adventure game.

The PEP authors used a dedicated GitHub repository for collaborating on the feature. That repository still exists and it makes a good record of the project going forward, he said. The decisions that were made and the reasons behind them are contained in the pull requests, issues, and so forth. During the process, several medium-sized applications using match were developed; as the feature changed during the process, that allowed the team to judge if a change truly improved the feature or not.

Not a switch statement

Bucher stressed, as he did multiple times in the discussions along the way, that the feature is not a switch statement. It looks a lot like a switch statement and can behave like one, but there is a lot more to it. By seeing it as only that, much of the power of structural pattern matching is lost. His goal in the talk was to convince attendees that the match feature is one that has far more uses and that it deserves a place in their code.

[Brandt Bucher talk]

Structural pattern matching combines two features that Python developers are already familiar with: control flow and destructuring. Control flow is "just branching; we do this all the time". He showed two simple if-elif-else code snippets, one of which looked at the value of a particular object and the other that looked at the "shape" of the object by looking at its length. The control flow was changed based on tests of those two characteristics of the object.

Destructuring is simply pulling data out of an object, which can be done in various ways. One of his tests was accessing meal[0], which uses sequence unpacking, but there are other destructuring operations, such as extracting the key-value pairs from a dictionary or accessing arbitrary attributes on a Python object.

So structural pattern matching is allowing the programmer to branch while they destructure or destructure while they branch. The feature allows a "very powerful declarative programming style that creates a new way of viewing a lot of the same problems that we have been struggling with for a very long time". As he pointed out, though, he was roughly one-third of the way through the talk and had not actually shown a match statement, so he was about to change that.

He put up a slide with a simple example of the feature:

    match meal:
        case entrée, side:
	    ...
The syntax is to have match followed by an expression to evaluate. Then there can be one or more case statements, each of which has a pattern following the word case He wanted to emphasize that "entrée, side" is a pattern and not an expression. Instead of telling Python to create a tuple of two elements, the code is saying how it wants the language to pull apart a sequence of length two should it encounter one in meal. It is sort of like the pattern is the left-hand side of an assignment statement, Bucher said.

To clarify what the code is doing, he put up the equivalent code from Python 3.9:

    if isinstance(meal, Sequence) and len(meal) == 2:
	entrée, side = meal
	...
As with the match version, the code matches a sequence of length two and pulls out the two elements of the sequence into entrée and side. The match pattern used is a "sequence pattern" that will match any sequence of, in this instance, two elements. Inside the sequence pattern are two "capture patterns" that specify the variable names to destructure a matching sequence into. Sequence patterns can use either parentheses or square brackets if desired, so the following two patterns are equivalent to the original one above:
    case (entrée, side):
    case [entrée, side]:

The "_" is used as a wildcard pattern, which is like a capture pattern, in that it will match anything, but it does not bind any value, unlike a capture pattern. So, a pattern of "[_, side]" would match a two-element sequence but only bind side as if only meal[1] were extracted in the equivalent Python 3.9 code. A common idiom is to use "case _:" as the last case entry to serve as a catch-all. Each case is tried in turn; if none of the previous ones match, the catch-all will match and can, for example, raise an exception.

The other pattern type that Bucher described was value patterns, which just look like a literal value:

    case ["Spam", side]:
	...
    # equivalent to:
    if isinstance(meal, Sequence) and len(meal) == 2 and meal[0] == "Spam":
	side = meal[1]
	...
Beyond just strings, value patterns support byte strings, integers, floats, complex numbers, booleans, and None; value patterns effectively just use the equality operator. There are other pattern types, but he wanted to focus on these in the talk; he suggested reading PEP 636 to find out more. He did quickly show a few more examples, without dwelling on them, just to give attendees a taste of the other matching abilities:
    case ["Spam" | "eggs", side]:
    case ["Spam", side] if not self.has_tried(side):
    case {"entrée": "Spam", "side": side}:
    case {"meal": ["Spam", side]}:
    case Meal(Food("Spam"), Food(side)):
The last one is a class pattern, which he finds to be the most exciting pattern type. The example will match a Meal object that contains two Food objects; it is not actually creating any instances of those classes, simply instructing the language how to pull one apart if it encounters it. The case compiles to a bunch of isinstance() checks and attribute accesses.

Class patterns are what led to his "big 'a-ha' moment" with the feature. He was writing code for a red-black tree, which is a self-balancing binary tree; it is a data structure that he never fully understood, even though he had implemented one before. Implementing the operations for the tree (e.g. insertion, rebalancing, deletion, etc.) using class patterns "really really clarified how the logic worked" because it allowed him to see "the shape of the data" and the pieces he was working with as they were being rearranged.

"We didn't invent really any of this", Bucher said. Structural pattern matching has been around, mostly in functional programming languages, for more than 50 years. He showed a short function to recursively calculate the factorial of a number using match:

    def f(n: int) -> int:
        match n:
             case 0 | 1:
                 return 1
             case _:
                 return n * f(n - 1)
He showed the equivalent programs in Rust and Scala, using their match features, which look quite similar the one above. The PEP authors definitely looked at the other languages with an eye to making Python programmers able to read (and, perhaps write) structural pattern matching code for other languages—and vice versa.

Implementation

He turned to the implementation of the feature, which he said took him around nine months, all of it during COVID lockdown. The match feature was able to take advantage of a new capability that comes with the PEG parser that is now used for CPython. The "match" and "case" strings are "soft keywords"; the parser is able to recognize when they are being used as keywords versus when they are used as regular identifiers. He cannot take credit for that part of the feature as Van Rossum implemented it as one of the first things that was done.

Soft keywords allowed the feature to be added without interfering with existing code that uses those names as identifiers, which is likely in plenty of code in the ecosystem. So they were able to add a big new feature, with a lot of surface area, that completely maintained backward compatibility. He showed an example of converting some code that uses those names, which nicely demonstrates that:

    ...
    match = re.match(PATTERN, "...")
    if match is not None:
	case, status = match
	if status == "closed":
	    ...

    # could become

    ...
    match = re.match(PATTERN, "...")
    match match:
	case case, "closed":
	    ...
It looks kind of odd, but it works, he said.

He described a bit about the structural pattern matching compiler, which he likes to call the "SPaM compiler" for brevity's sake. He showed and compared the bytecode output for both versions of the first example above, starting with the Python 3.9 version. Even though both versions do the same thing, the version using match was noticeably shorter because it takes advantage of a new opcode (MATCH_SEQUENCE). That opcode was implemented by Mark Shannon and it does the same isinstance() check as the Python 3.9 code, but effectively does so as quickly as is possible by checking a flag on the CPython object and it is implemented in the CPython runtime. The bytecode also uses a new GET_LEN opcode, rather than pushing the name "len" onto the stack and then calling it; it just directly checks the length of the sequence in the CPython object.

That example shows the advantage of having dedicated syntax in the language for the match feature, Bucher said. The Python compiler has a lot of extra knowledge about exactly what you are trying to do within each case, so it can generate more efficient code. "This is the sort of thing that you only get from native syntax"; other techniques like transpiling to Python or using a Python package to achieve a similar effect cannot be as fast.

Future plans

He finished the talk by looking at some features that he is working on for the future; he has mostly working code for them, but they are not ready to be added quite yet. If he finds the time to finish them up, they may become part of Python 3.12 (which is due in October 2023). "The future is faster."

One area that he is "really excited about" is improved control flow. He gave an example:

    match meal:
	case ["Spam", side]:
	    print("Yay, Spam!")
	case ["eggs", side]:
	    print("Oh, eggs?")
	case [_, side]:
	    print("Hm, something else.")
He noted that the existing implementation steps through the cases in a fairly simplistic way. In particular, it checks whether meal is a sequence three times, checks the length of the meal sequence three times, and assigns side three times, but none of those is really necessary to do more than once. So the new code will recognize that it only needs to check if meal is a sequence with a length of two and, if so, assign meal[1] to side and then branch based on the value of meal[0].

This optimization uses a technique known as decision trees, which has a lot prior art and research done on it for languages like OCaml. It allows merging overlapping checks for adjacent patterns. "It is not too difficult to do; Python throws a couple of curve balls your way that make it non-trivial", but he has a 98% working version of it. It can apply to many different kinds of patterns beyond just the one he showed.

Something that falls out of the decision-tree work is improved reachability checks. If a particular case cannot be reached because what it would match is already matched by previous entries, CPython can detect that and give the developer a warning. For example:

    for number in range(100):
	match number % 5, number % 3:
	    case _, 0: print("Spam!")
	    case 0, _: print("Eggs?")
	    case 0, 0: print("Spam and eggs.")
	    case _, _: print(number)
With that code, the third case is not reachable because one of the two before it will catch that situation, but it may not be obvious when looking at the code. It is difficult to determine that with the current CPython compiler, but it is trivial to detect using the decision-tree analysis and to give a warning. The solution in this case is simple: just move the third case up to the top.

That completed his talk, but he could not resist a joke at the end: "I heard there was great hiking in Salt Lake City but I did not realize it was inside the convention center." That was met with laughter and applause from the large, standing-room-only crowd of attendees. It was indeed something like a ten-minute walk to most of the track rooms from the keynotes and expo floor.

But, after two years of virtual PyCons, everyone seemed happy to return to something approaching normalcy. The vaccination and mask requirements helped to give attendees some level of comfort with maintaining their health at the event as well. Overall, this year's return to an in-person PyCon seemed like it was a rousing success.

[I would like to thank LWN subscribers for supporting my trip to Salt Lake City for PyCon.]


Index entries for this article
ConferencePyCon/2022
Pythonmatch statement
PythonPython Enhancement Proposals (PEP)/PEP 634


(Log in to post comments)

An overview of structural pattern matching for Python

Posted May 4, 2022 0:41 UTC (Wed) by tialaramex (subscriber, #21167) [Link]

Structural pattern matching is really nice.

I wondered why they went with a statement, because expression match feels nicer to me. But PEP 635 says they explicitly considered this question and decided against an expression, and they know much more about Python and about Python programmers than I ever will.

However it is painful to see, also in PEP 635, how problematic new keywords are for Python. If the language wants to go around adding keywords (and it seems it does based on this) it needs to find a way to make that less painful, it got lucky for match but it can't count on always getting lucky.

An overview of structural pattern matching for Python

Posted May 4, 2022 6:56 UTC (Wed) by epa (subscriber, #39769) [Link]

I don’t think there is any language that can easily add new keywords. You always end up with ugliness like _bool. Perl started out with an explicit & prefix for subroutine calls but soon dropped it because everyone prefers to see a bare word, even though it will make it hard to add new builtin functions or keywords later.

All in all, I favour the approach of adding the new keyword in a new version of the language, and let everyone change their code where it clashes, and if you don’t want to do that then don’t upgrade.

The context-aware parser in Python is great as a way to do a smooth upgrade, but I would still appreciate a warning if I have used keywords like match and case as variable names. I would still want to change them everywhere to avoid confusing humans, even if the parser can cope.

(Stroustrup wrote about how C’s ‘bool’ as a header file #define does not really solve any of the keyword clash problems and it would have been better to just add the keyword properly. Anyone have a link to that?)

An overview of structural pattern matching for Python

Posted May 4, 2022 7:12 UTC (Wed) by NYKevin (subscriber, #129325) [Link]

> The context-aware parser in Python is great as a way to do a smooth upgrade, but I would still appreciate a warning if I have used keywords like match and case as variable names. I would still want to change them everywhere to avoid confusing humans, even if the parser can cope.

You can probably hack that together with ast.walk() and look for ast.Name instances with a "bad" name. I'm pretty sure that's, like, four lines of code. See https://docs.python.org/3/library/ast.html for documentation.

An overview of structural pattern matching for Python

Posted May 4, 2022 7:43 UTC (Wed) by mpr22 (subscriber, #60784) [Link]

> I don’t think there is any language that can easily add new keywords.

The Edition mechanism in Rust, with its "raw identifier" syntax, looks like the best approach I've seen so far.

An overview of structural pattern matching for Python

Posted May 4, 2022 21:53 UTC (Wed) by tialaramex (subscriber, #21167) [Link]

Right now, Rust's approach lets them smoothly add a keyword in the medium term. This is how Rust added, for example, async.

Suppose Rust decided it must have a "new" keyword. Rust 2024 edition could reserve this keyword, and people writing Rust 2024 would be obliged to write r#new when they meant the identifier named "new" while all existing code still compiles. They could then (any time after Rust 2024 "ships") use the reserved keyword for whatever purpose they had in mind in Rust 2024. Since lots of things in Rust are named new e.g. Vec::new() is a very common function call, and would need to be re-written Vec::r#new() this is awkward but it would work.

However, if things were more urgent, say the "new" feature was so critical that otherwise conservative languages like C all rush to provide it, and Rust must do likewise, raw identifiers aren't enough. As I understand it, the intent in that case is to ship an entire feature to provide this capability, named raw keywords. k#new would be the (awkwardly spelled) new keyword immediately, and then Rust 2024 would bless the actual name "new" for it and do all the tidying up in a few years.

The entire namespace where "raw keywords" could live is reserved, so like __Bool in C this is merely awkward spelling. If anything the biggest problem is picking opportunities to use it.

An overview of structural pattern matching for Python

Posted May 4, 2022 9:29 UTC (Wed) by Wol (subscriber, #4433) [Link]

> I don’t think there is any language that can easily add new keywords. You always end up with ugliness like _bool. Perl started out with an explicit & prefix for subroutine calls but soon dropped it because everyone prefers to see a bare word, even though it will make it hard to add new builtin functions or keywords later.

DataBASIC (depending on your variant) uses REM to mean four different things. Different compilers interpret it differently.

I always used to use it as a variable, but it was also a function (REMainder). Then someone added using it as a statement (REMark). And you could also use it as a label ...

Which variants of REM your compiler accepted depended on the compiler. I think modern versions accept every variation except maybe the variable. Which is a pain, because one of the biggest implementations of its time used it that way ...

Cheers,
Wol

An overview of structural pattern matching for Python

Posted May 4, 2022 18:21 UTC (Wed) by jezuch (subscriber, #52988) [Link]

> I don’t think there is any language that can easily add new keywords.

There's Java :) Kind of, at least. They've recently added a couple of keywords by using a trick: they're context-dependent. See for example "yield" in the new enhanced switch. The latest proposals of pattern-matching switch go even further, adding "when" keyword as a case guard.

The current design is here: https://openjdk.java.net/jeps/427 As a bonus you can compare it with what is described in the article ;)

An overview of structural pattern matching for Python

Posted May 5, 2022 15:22 UTC (Thu) by kokada (guest, #92849) [Link]

Well, from the looks of the text it does looks to me that `match` in Python s also context-dependent. The text even gave an example on how you could have both a variable called `match` and a `match` statement one after another (with the amusing resulting `match match:`).

An overview of structural pattern matching for Python

Posted May 7, 2022 1:48 UTC (Sat) by smitty_one_each (subscriber, #28989) [Link]

Right. "Soft" keywords, due to the PEG parser.

An overview of structural pattern matching for Python

Posted May 4, 2022 18:33 UTC (Wed) by bartoc (subscriber, #124262) [Link]

Algol 68, and Oberon-7 (among other versions) put keywords in a totally separate namespace and reserve the whole space. Everything ALLCAPS in oberon is a keyword, and you must not use such things as identifiers. Similarly algol 68 required surrounding keywords in back-ticks or something.

The middle ground is allowing some kind of stropping to turn what would be a keyword into an identifier, that way you can “update” in an automated fashion

if you have a module system the parser can just parse according to what the module expects.

An overview of structural pattern matching for Python

Posted May 7, 2022 1:51 UTC (Sat) by smitty_one_each (subscriber, #28989) [Link]

If I ever get around to crafting a language, I would achieve this by having editor support for wrapping all keywords in JSON, e.g. {'for':'kw'} so that it looks like `for` but lets the coder use `for` for whatever else is interesting. [Waves hands as to how that would actually work from a UX perspective.]

An overview of structural pattern matching for Python

Posted May 7, 2022 18:30 UTC (Sat) by NYKevin (subscriber, #129325) [Link]

Personally, I don't want to write code in rich text.

An overview of structural pattern matching for Python

Posted May 9, 2022 2:04 UTC (Mon) by smitty_one_each (subscriber, #28989) [Link]

Obviously editor support would be a going-in requirement.

An overview of structural pattern matching for Python

Posted May 4, 2022 17:46 UTC (Wed) by stefanha (subscriber, #55072) [Link]

It would have been nice to explain class patterns a little more. The tutorial shows what is possible:
https://peps.python.org/pep-0636/#adding-a-ui-matching-ob...


Copyright © 2022, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds