|
|
Subscribe / Log in / New account

The PEPs of Python 3.9

Please consider subscribing to LWN

Subscriptions are the lifeblood of LWN.net. If you appreciate this content and would like to see more of it, your subscription will help to ensure that LWN continues to thrive. Please visit this page to join up and keep LWN on the net.

By Jake Edge
May 20, 2020

With the release of Python 3.9.0b1, the first of four planned betas for the development cycle, Python 3.9 is now feature-complete. There is still plenty to do in terms of testing and stabilization before the October final release. The release announcement lists a half-dozen Python Enhancement Proposals (PEPs) that were accepted for 3.9. We have looked at some of those PEPs along the way; there are some updates on those. It seems like a good time to fill in some of the gaps on what will be coming in Python 3.9

String manipulations

Sometimes the simplest (seeming) things are the hardest—or at least provoke an outsized discussion. Much of that was bikeshedding over—what else?—naming, but the idea of adding functions to the standard string objects to remove prefixes and suffixes was fairly uncontroversial. Whether those affixes (a word for both prefixes and suffixes) could be specified as sequences, so more than one affix could be handled in a single call, was less clear cut; ultimately, it was removed from the proposal, awaiting someone else to push that change through the process.

Toward the end of March, Dennis Sweeney asked on the python-dev mailing list for a core developer to sponsor PEP 616 ("String methods to remove prefixes and suffixes"). He pointed to a python-ideas discussion from March 2019 about the idea. Eric V. Smith agreed to sponsor the PEP, which led Sweeney to post it and kick off the discussion. In the original version, he used cutprefix() and cutsuffix() as the names of the string object methods to be added. Four types of Python objects would get the new methods: str (Unicode strings), bytes (binary sequences), bytearray (mutable binary sequences), and collections.UserString (a wrapper around string objects). It would work as follows:

    'abcdef'.cutprefix('abc')   # returns 'def'
    'abcdef'.cutsuffix('ef')    # returns 'abcd'

There were plenty of suggestions in the name department. Perhaps the most widespread agreement was that few liked "cut", so "strip", "trim", and "remove" were all suggested and garnered some support. stripprefix() (and stripsuffix(), of course) seemed to run into opposition due, at least in part, to one of the rationales specified in the PEP; the existing "strip" functions are confusing so reusing that name should be avoided. The str.lstrip() and str.rstrip() methods also remove leading and trailing characters, but they are a source of confusion to programmers actually looking for the cutprefix() functionality. The *strip() calls take a string argument, but treat it as a set of characters that should be eliminated from the front or end of the string:

    'abcdef'.lstrip('abc')      # returns 'def' as "expected"
    'abcbadefed'.lstrip('abc')  # returns 'defed' not at all as expected

Eventually, removeprefix() and removesuffix() seemed to gain the upper hand, which is what Sweeney eventually switched to. It probably did not hurt that Guido van Rossum supported those names as well. Eric Fahlgren amusingly summed up the name fight this way:

I think name choice is easier if you write the documentation first:

cutprefix - Removes the specified prefix.
trimprefix - Removes the specified prefix.
stripprefix - Removes the specified prefix.
removeprefix - Removes the specified prefix. Duh. :)

Sweeney announced an update to the PEP that addressed a number of comments, but also added the requested ability to take a tuple of strings as an affix (that version can be seen in the PEP GitHub repository). But Steven D'Aprano was not so sure it made sense to do that. He pointed out that the only string operations that take a tuple are str.startswith() and str.endswith(), which do not return a string (just a boolean value). He is leery of adding a method that returns a (potentially changed) version of the string while taking a tuple because whatever rules are chosen on how to process the tuple will be the "wrong" choice for some. For example:

The difficulty here is that the notion of "cut one of these prefixes" is ambiguous if two or more of the prefixes match. It doesn't matter for startswith:
    "extraordinary".startswith(('ex', 'extra'))
since it is True whether you match left-to-right, shortest-to-largest, or even in random order. But for cutprefix, which prefix should be deleted?

As he said, the rule as proposed is that the first matching string processing the tuple left-to-right is used, but some might want the longest match or the last match; it all depends on the context of the use. He suggested that the feature get more "soak time" before committing to adding that behavior: "We ought to get some real-life exposure to the simple case first, before adding support for multiple prefixes/suffixes."

Ethan Furman agreed with D'Aprano. But Victor Stinner was strongly in favor of the tuple-argument idea. He wondered about the proposed behavior, however, when the empty string is passed as part of the tuple. As proposed, encountering the empty string (which effectively matches anything) when processing the tuple would simply return the original string, which leads to surprising results:

cutsuffix("Hello World", ("", " World"))    # returns "Hello World"
cutsuffix("Hello World", (" World", ""))    # returns "Hello"

The problem is not likely to manifest so obviously; affixes will not necessarily be hard coded so empty strings might slip into unexpected places. Stinner suggested raising ValueError if an empty string is encountered, similar to str.split(). But Sweeney decided to remove the tuple-argument feature entirely to "allow someone else with a stronger opinion about it to propose and defend a set of semantics in a different PEP" He posted the last version of the PEP on March 28.

On April 9, Sweeney opened a steering council issue requesting a review of the PEP. On April 20, Stinner accepted it on behalf of the council. It is a pretty minimal change but worth the time to try to ensure that it has the right interface (and semantics) for the long haul. We will see removeprefix() and removesuffix() in Python 3.9.

New parser

It should not really surprise anyone that the new parser for CPython, covered here in mid-April, has been accepted by the steering council. PEP 617 ("New PEG parser for CPython") was proposed by project founder and former benevolent dictator for life (BDFL) Guido van Rossum, along with Pablo Galindo Salgado and Lysandros Nikolaou; it is already working well and its performance is within 10% of the existing parser in terms of speed and memory use. It will also make the language specification simpler because the parser is based on a parsing expression grammar (PEG). The existing LL(1) parser for CPython suffers from a number of shortcomings and contains some hacks that the new parser will eliminate.

The change paves the way for Python to move beyond having an LL(1) grammar—though the existing language is not precisely LL(1)—down the road. That change will not come soon as the plans are to keep the existing parser available in Python 3.9 behind a command-line switch. But Python 3.10 will remove the existing parser, which could allow language changes. If those kinds of changes are made, however, alternative Python implementations (e.g. PyPy, MicroPython) may need to switch their parsers to something other than LL(1) in order to keep up with the language specification. That might give the core developers pause before making a change of that nature.

And more

We looked at PEP 615 ("Support for the IANA Time Zone Database in the Standard Library") back in early March. It would add a zoneinfo module to the standard library that would facilitate getting time-zone information from the IANA time zone database (also known as the "Olson database") to populate a time-zone object. It was looked on favorably at the time of the article and at the end of March Paul Ganssle asked for a decision on the PEP. He thought it might be amusing to have it accepted (assuming it was) during an interesting time window:

[...] I was hoping (for reasons of whimsy) to get this accepted on Sunday, April 5th either between 02:00-04:00 UTC or between 13:00 and 17:30 UTC, since those times represent ambiguous datetimes somewhere on earth (mostly in Australia). There is one other opportunity for this, which is that on Sunday April 19th, the hours between 01:00 and 03:00 UTC are ambiguous in Western Sahara.

He recognized that it might be difficult to pull off and it certainly was not a priority. The steering council did not miss the second window by much; Barry Warsaw announced the acceptance of the PEP on April 20. Python will now have a mechanism to access the system's time-zone database for creating and handling time zones. In addition, there is a tzdata module in the Python Package Index (PyPI) that contains the IANA data for systems that lack it; it will be maintained by the Python core developers as well.

PEP 593 ("Flexible function and variable annotations") adds a way to associate context-specific metadata with functions and variables. Effectively, the type hint annotations have squeezed out other use cases that were envisioned in PEP 3107 ("Function Annotations") that was implemented in Python 3.0 many years ago. PEP 593 creates a new mechanism for those use cases using the Annotated typehint. Another kind of clean up comes in PEP 585 ("Type Hinting Generics In Standard Collections"). It will allow the removal of a parallel set of type aliases maintained in the typing module in order to support generic types. For example, the typing.List type will no longer be needed to support annotations like "dict[str, list[int]]" (i.e.. a dictionary with string keys and values that are lists of integers).

The dictionary union operation for "addition" will also be part of Python 3.9. It was a bit contentious at times, but PEP 584 ("Add Union Operators To dict") was recommended for acceptance by Van Rossum in mid-February. The steering council promptly agreed and the feature was merged on February 24.

The last PEP on the list is PEP 602 ("Annual Release Cycle for Python"). As it says on the tin, it changes the release cadence from every 18 months to once per year. The development and release cycles overlap, though, so that a full 12 months is available for feature development. Python 3.10 feature development begins when the first Python 3.9 beta has been released—which is now. Stay tuned for the next round of PEPs in the coming year.


Index entries for this article
PythonPython Enhancement Proposals (PEP)


(Log in to post comments)

The PEPs of Python 3.9

Posted May 21, 2020 7:08 UTC (Thu) by sytoka (guest, #38525) [Link]

s/^abc//
s/ef$//
Sometime, regex are simpler than human language ;-)

The PEPs of Python 3.9

Posted May 21, 2020 11:07 UTC (Thu) by smurf (subscriber, #17840) [Link]

Plus, a reasonable regexp engine should be able to replace its invocation with "removeprefix" if that's the RE's effect.

On the other hand, too much magic behavior (plus heaps of TMTOWTDI) is exactly why I'm using Python instead of Perl these days.

The PEPs of Python 3.9

Posted May 21, 2020 14:41 UTC (Thu) by smitty_one_each (subscriber, #28989) [Link]

I just fear that the PEG parser and the waning of Guido's benevolent dictatorship could lead to a shift away from Monty Python and toward Steely Dan references in future releases.

The PEPs of Python 3.9

Posted May 21, 2020 14:23 UTC (Thu) by excors (subscriber, #95769) [Link]

$_ = "abcdef";     s/ef$//;  # "abcd"
$_ = "abcdef\n";   s/ef$//;  # "abcd\n"
$_ = "abcdef\n\n"; s/ef$//;  # "abcdef\n\n"
Sometimes regexes aren't quite as simple as they seem.

The PEPs of Python 3.9

Posted May 21, 2020 16:04 UTC (Thu) by kay (subscriber, #1362) [Link]

so what's the result with "abcdef\n\n".removepostfix("def")?

The PEPs of Python 3.9

Posted May 21, 2020 17:02 UTC (Thu) by NYKevin (subscriber, #129325) [Link]

  1. AttributeError (because it's removesuffix).
  2. If we fix that, then it would return the string unchanged, because "def" is not a suffix of "abcdef\n\n" (according to the endswith() method). Newline is just another character and doesn't get magical treatment here. If you want newlines to be ignored, you have to call strip("\n") (or lstrip or rstrip) to remove them.

See the specification if you're curious about any other aspects here.

The PEPs of Python 3.9

Posted May 23, 2020 7:52 UTC (Sat) by flussence (subscriber, #85566) [Link]

Raku saves the day (again):

dd (S/ ef $ //, S/ ef $$ //) for "abcdef", "abcdef\n", "abcdef\n\n";
("abcd", "abcd")
("abcdef\n", "abcd\n")
("abcdef\n\n", "abcd\n\n")

$ and $$ here are invariant replacements for $, \Z and \z that aren't affected by regex postfix switches (and don't have nightmare edge cases with blank lines apparently).

There's only one — hopefully obvious — way to do it, one that also doesn't require remembering what thesaurus word a function name uses or whether or not it accepts list args…

The PEPs of Python 3.9

Posted May 21, 2020 20:53 UTC (Thu) by gerdesj (subscriber, #5446) [Link]

I may have accidentally stumbled behind the wrong bike shed but what does pre/post fix mean in languages that don't run L->R.

Also, what's wrong with left(), right() and/or mid() which are already in use in other languages and hence "obvious"?

The PEPs of Python 3.9

Posted May 22, 2020 0:41 UTC (Fri) by excors (subscriber, #95769) [Link]

left()/right() sound even more confusing, because the 'left' character of "עִבְרִית‎" might be the one that's on either the left or the right of the screen depending on your text editor.

The PEPs of Python 3.9

Posted May 22, 2020 16:52 UTC (Fri) by smcv (subscriber, #53363) [Link]

Text is normally represented in "logical order", where the first letter of the first word appears first in memory, regardless of whether it will be displayed to the left or the right of subsequent letters. It stays in logical order during editing and manipulation, until something like Pango turns it into pixels.

For purely right-to-left text, logical order is the opposite of "visual order". For mixed left-to-right and right-to-left text (for example an English web page containing some Arabic words or vice versa), the logical order is still first-word-first, and the visual order is complicated.

Writing content in visual order basically can't work unless the text width is known and fixed (that is, the lines of text are hard-wrapped, as they would be in a terminal emulator).

For example in HTML: https://www.w3.org/International/questions/qa-visual-vs-l...

For RTL text in logical order, left- and right-oriented API names like Python's lstrip() and rstrip() or BASIC's Left() and Right() do the opposite of what their names would suggest: lstrip() deletes the first characters of the string, which are the first letters of the first word (even though they would be displayed on the right), while rstrip() deletes the last characters (even though they would be displayed on the left).

Talking about a prefix or suffix (like GLib's g_str_has_prefix(), g_str_has_suffix), or the start and end (like Python's str.startswith() and str.endswith()), or numeric positions (like Python's str[:3] or str[5:]) makes more sense than "left" and "right" when working with logical order.

The PEPs of Python 3.9

Posted May 22, 2020 18:27 UTC (Fri) by NYKevin (subscriber, #129325) [Link]

> the logical order is still first-word-first, and the visual order is complicated.

If anyone is curious about the visual order, the Unicode Consortium has written down the gory details here: https://unicode.org/reports/tr9/ (TL;DR: "Complicated" is an understatement, but basically, it tries to figure out which pieces of text are "embedded" in surrounding text of the opposite directionality, and then lays things out to preserve the visual order of each level of embedding. It also spends a great deal of complexity on "guessing" the directionality of neutral characters such as punctuation, digits, and whitespace.)

The PEPs of Python 3.9

Posted May 22, 2020 23:37 UTC (Fri) by gerdesj (subscriber, #5446) [Link]

Thank you for that explanation. Your comment speaks volumes and I can almost see the blood dripping through clenched teeth.

String manipulation function definitions are quite tricky, when the definition of string is hard.

The PEPs of Python 3.9

Posted May 29, 2020 14:05 UTC (Fri) by quietbritishjim (subscriber, #114117) [Link]

This is a great comment, thanks. I still found the term "visual order" confusing though, so I had a Google around for it. Here are two things to clarify.

The first thing I got confused by is that I assumed that "visual order" would still mean the rightmost character to be "first" if you're looking at a purely right-to-left language. After all, if you asked a native Arabic speaker what the visually first character is then they would point over on the right hand side. So when you said "For purely right-to-left text, logical order is the opposite of visual order", that would be a contradiction in terms. But it seems that in practice this term is usually used to mean what I'd call "visually left-to-right order", so "first" is always leftmost regardless of the text.

Second, I thought you only brought up "visual order" to explain the difference between how strings are stored vs. how they're rendered. I didn't realise that some people actually have stored text in visually left-to-right order *in memory*. So `mystr[0]` would be the logically last character in a right-to-left language, because that's on the left. Yuck!


Copyright © 2020, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds