The PEPs of Python 3.9
Please consider subscribing to LWN Subscriptions are the lifeblood of LWN.net. If you appreciate this content and would like to see more of it, your subscription will help to ensure that LWN continues to thrive. Please visit this page to join up and keep LWN on the net. |
With the release of Python 3.9.0b1, the first of four planned betas for the development cycle, Python 3.9 is now feature-complete. There is still plenty to do in terms of testing and stabilization before the October final release. The release announcement lists a half-dozen Python Enhancement Proposals (PEPs) that were accepted for 3.9. We have looked at some of those PEPs along the way; there are some updates on those. It seems like a good time to fill in some of the gaps on what will be coming in Python 3.9
String manipulations
Sometimes the simplest (seeming) things are the hardest—or at least provoke an outsized discussion. Much of that was bikeshedding over—what else?—naming, but the idea of adding functions to the standard string objects to remove prefixes and suffixes was fairly uncontroversial. Whether those affixes (a word for both prefixes and suffixes) could be specified as sequences, so more than one affix could be handled in a single call, was less clear cut; ultimately, it was removed from the proposal, awaiting someone else to push that change through the process.
Toward the end of March, Dennis Sweeney asked
on the python-dev mailing list for a core developer to sponsor PEP 616
("String methods to remove prefixes and suffixes
"). He
pointed to a python-ideas discussion
from March 2019 about the idea.
Eric
V. Smith agreed
to sponsor the PEP, which led Sweeney to post
it and kick off the discussion. In the original version, he used
cutprefix() and cutsuffix() as the names of the string
object methods
to be added. Four types of Python objects would get the new methods: str (Unicode
strings), bytes
(binary sequences), bytearray
(mutable binary sequences), and collections.UserString
(a wrapper around string objects). It would work as follows:
'abcdef'.cutprefix('abc') # returns 'def' 'abcdef'.cutsuffix('ef') # returns 'abcd'
There were plenty of suggestions in the name department. Perhaps the most widespread agreement was that few liked "cut", so "strip", "trim", and "remove" were all suggested and garnered some support. stripprefix() (and stripsuffix(), of course) seemed to run into opposition due, at least in part, to one of the rationales specified in the PEP; the existing "strip" functions are confusing so reusing that name should be avoided. The str.lstrip() and str.rstrip() methods also remove leading and trailing characters, but they are a source of confusion to programmers actually looking for the cutprefix() functionality. The *strip() calls take a string argument, but treat it as a set of characters that should be eliminated from the front or end of the string:
'abcdef'.lstrip('abc') # returns 'def' as "expected" 'abcbadefed'.lstrip('abc') # returns 'defed' not at all as expected
Eventually, removeprefix() and removesuffix() seemed to gain the upper hand, which is what Sweeney eventually switched to. It probably did not hurt that Guido van Rossum supported those names as well. Eric Fahlgren amusingly summed up the name fight this way:
cutprefix - Removes the specified prefix.
trimprefix - Removes the specified prefix.
stripprefix - Removes the specified prefix.
removeprefix - Removes the specified prefix. Duh. :)
Sweeney announced an update to the PEP that addressed a number of comments, but also added the requested ability to take a tuple of strings as an affix (that version can be seen in the PEP GitHub repository). But Steven D'Aprano was not so sure it made sense to do that. He pointed out that the only string operations that take a tuple are str.startswith() and str.endswith(), which do not return a string (just a boolean value). He is leery of adding a method that returns a (potentially changed) version of the string while taking a tuple because whatever rules are chosen on how to process the tuple will be the "wrong" choice for some. For example:
"extraordinary".startswith(('ex', 'extra'))since it is True whether you match left-to-right, shortest-to-largest, or even in random order. But for cutprefix, which prefix should be deleted?
As he said, the rule as proposed is that the first matching string processing the
tuple left-to-right is used, but some might want
the longest match or the last match; it all depends on the context of the
use. He suggested that the feature get more "soak time" before committing
to adding that behavior: "We ought to get some real-life exposure to
the simple case first, before
adding support for multiple prefixes/suffixes.
"
Ethan Furman agreed with D'Aprano. But Victor Stinner was strongly in favor of the tuple-argument idea. He wondered about the proposed behavior, however, when the empty string is passed as part of the tuple. As proposed, encountering the empty string (which effectively matches anything) when processing the tuple would simply return the original string, which leads to surprising results:
cutsuffix("Hello World", ("", " World")) # returns "Hello World" cutsuffix("Hello World", (" World", "")) # returns "Hello"
The problem is not likely to manifest so obviously; affixes will not
necessarily be hard coded so empty strings might slip into unexpected
places. Stinner suggested raising ValueError if an empty string is
encountered, similar to str.split().
But Sweeney decided
to remove the tuple-argument feature entirely to "allow someone else with a stronger
opinion about it to propose and defend a set of semantics in a different
PEP
" He posted
the last version of the PEP on March 28.
On April 9, Sweeney opened a steering council issue requesting a review of the PEP. On April 20, Stinner accepted it on behalf of the council. It is a pretty minimal change but worth the time to try to ensure that it has the right interface (and semantics) for the long haul. We will see removeprefix() and removesuffix() in Python 3.9.
New parser
It should not really surprise anyone that the new parser for CPython, covered here in mid-April, has been accepted
by the steering council. PEP 617 ("New
PEG parser for CPython
") was proposed by project founder and former
benevolent dictator for life (BDFL) Guido van Rossum, along with Pablo Galindo
Salgado and Lysandros Nikolaou; it is already working well and its
performance is
within 10% of the existing parser in terms of speed and memory use. It
will also make the language specification simpler because the parser is
based on a parsing
expression grammar (PEG). The existing LL(1) parser for CPython
suffers from a number of shortcomings and contains some hacks that the new
parser will eliminate.
The change paves the way for Python to move beyond having an LL(1) grammar—though the existing language is not precisely LL(1)—down the road. That change will not come soon as the plans are to keep the existing parser available in Python 3.9 behind a command-line switch. But Python 3.10 will remove the existing parser, which could allow language changes. If those kinds of changes are made, however, alternative Python implementations (e.g. PyPy, MicroPython) may need to switch their parsers to something other than LL(1) in order to keep up with the language specification. That might give the core developers pause before making a change of that nature.
And more
We looked at PEP 615
("Support for the IANA Time Zone Database in the Standard
Library
") back in early March. It would add a zoneinfo
module to the standard library that would facilitate getting time-zone
information from the IANA time zone
database (also known as the "Olson database") to populate a time-zone
object. It was looked on favorably at the time of the article and at the end
of March Paul Ganssle asked for a
decision on the PEP. He thought it might be amusing to have it accepted
(assuming it was) during an
interesting time window:
He recognized that it might be difficult to pull off and it certainly was not a priority. The steering council did not miss the second window by much; Barry Warsaw announced the acceptance of the PEP on April 20. Python will now have a mechanism to access the system's time-zone database for creating and handling time zones. In addition, there is a tzdata module in the Python Package Index (PyPI) that contains the IANA data for systems that lack it; it will be maintained by the Python core developers as well.
PEP 593
("Flexible function and variable annotations
") adds a way to
associate context-specific metadata with functions and variables.
Effectively, the type hint annotations have
squeezed out other use cases that were envisioned in PEP 3107
("Function Annotations
") that was implemented in
Python 3.0 many years ago. PEP 593 creates a new mechanism for
those use cases using
the Annotated typehint.
Another kind of clean up comes in PEP 585
("Type Hinting Generics In Standard Collections
"). It will
allow the removal of a parallel set of type aliases maintained in the typing
module in order to support generic types. For example, the
typing.List type will no longer be needed to support annotations
like "dict[str, list[int]]" (i.e.. a dictionary with string keys
and values that are lists of integers).
The dictionary union operation for
"addition" will also be part of Python 3.9. It was a bit
contentious at times, but PEP 584
("Add Union Operators To dict
") was recommended
for acceptance by Van Rossum in mid-February. The steering council promptly agreed
and the feature was merged on February 24.
The last PEP on the list is PEP 602
("Annual Release Cycle for Python
"). As it says on the tin,
it changes the release cadence from every
18 months to once per year. The development and release cycles overlap,
though, so that a full 12 months is available for feature development.
Python 3.10 feature development begins when the first Python 3.9
beta has been released—which is now. Stay tuned for the next round of PEPs
in the coming year.
Index entries for this article | |
---|---|
Python | Python Enhancement Proposals (PEP) |
(Log in to post comments)
The PEPs of Python 3.9
Posted May 21, 2020 7:08 UTC (Thu) by sytoka (guest, #38525) [Link]
s/^abc// s/ef$//Sometime, regex are simpler than human language ;-)
The PEPs of Python 3.9
Posted May 21, 2020 11:07 UTC (Thu) by smurf (subscriber, #17840) [Link]
On the other hand, too much magic behavior (plus heaps of TMTOWTDI) is exactly why I'm using Python instead of Perl these days.
The PEPs of Python 3.9
Posted May 21, 2020 14:41 UTC (Thu) by smitty_one_each (subscriber, #28989) [Link]
The PEPs of Python 3.9
Posted May 21, 2020 14:23 UTC (Thu) by excors (subscriber, #95769) [Link]
$_ = "abcdef"; s/ef$//; # "abcd" $_ = "abcdef\n"; s/ef$//; # "abcd\n" $_ = "abcdef\n\n"; s/ef$//; # "abcdef\n\n"Sometimes regexes aren't quite as simple as they seem.
The PEPs of Python 3.9
Posted May 21, 2020 16:04 UTC (Thu) by kay (subscriber, #1362) [Link]
The PEPs of Python 3.9
Posted May 21, 2020 17:02 UTC (Thu) by NYKevin (subscriber, #129325) [Link]
- AttributeError (because it's removesuffix).
- If we fix that, then it would return the string unchanged, because "def" is not a suffix of "abcdef\n\n" (according to the endswith() method). Newline is just another character and doesn't get magical treatment here. If you want newlines to be ignored, you have to call strip("\n") (or lstrip or rstrip) to remove them.
See the specification if you're curious about any other aspects here.
The PEPs of Python 3.9
Posted May 23, 2020 7:52 UTC (Sat) by flussence (subscriber, #85566) [Link]
Raku saves the day (again):
dd (S/ ef $ //, S/ ef $$ //) for "abcdef", "abcdef\n", "abcdef\n\n"; ("abcd", "abcd") ("abcdef\n", "abcd\n") ("abcdef\n\n", "abcd\n\n")
$ and $$ here are invariant replacements for $, \Z and \z that aren't affected by regex postfix switches (and don't have nightmare edge cases with blank lines apparently).
There's only one — hopefully obvious — way to do it, one that also doesn't require remembering what thesaurus word a function name uses or whether or not it accepts list args…
The PEPs of Python 3.9
Posted May 21, 2020 20:53 UTC (Thu) by gerdesj (subscriber, #5446) [Link]
Also, what's wrong with left(), right() and/or mid() which are already in use in other languages and hence "obvious"?
The PEPs of Python 3.9
Posted May 22, 2020 0:41 UTC (Fri) by excors (subscriber, #95769) [Link]
left()/right() sound even more confusing, because the 'left' character of"עִבְרִית"
might be the one that's on either the left or the right of the screen depending on your text editor.
The PEPs of Python 3.9
Posted May 22, 2020 16:52 UTC (Fri) by smcv (subscriber, #53363) [Link]
For purely right-to-left text, logical order is the opposite of "visual order". For mixed left-to-right and right-to-left text (for example an English web page containing some Arabic words or vice versa), the logical order is still first-word-first, and the visual order is complicated.
Writing content in visual order basically can't work unless the text width is known and fixed (that is, the lines of text are hard-wrapped, as they would be in a terminal emulator).
For example in HTML: https://www.w3.org/International/questions/qa-visual-vs-l...
For RTL text in logical order, left- and right-oriented API names like Python's lstrip() and rstrip() or BASIC's Left() and Right() do the opposite of what their names would suggest: lstrip() deletes the first characters of the string, which are the first letters of the first word (even though they would be displayed on the right), while rstrip() deletes the last characters (even though they would be displayed on the left).
Talking about a prefix or suffix (like GLib's g_str_has_prefix(), g_str_has_suffix), or the start and end (like Python's str.startswith() and str.endswith()), or numeric positions (like Python's str[:3] or str[5:]) makes more sense than "left" and "right" when working with logical order.
The PEPs of Python 3.9
Posted May 22, 2020 18:27 UTC (Fri) by NYKevin (subscriber, #129325) [Link]
If anyone is curious about the visual order, the Unicode Consortium has written down the gory details here: https://unicode.org/reports/tr9/ (TL;DR: "Complicated" is an understatement, but basically, it tries to figure out which pieces of text are "embedded" in surrounding text of the opposite directionality, and then lays things out to preserve the visual order of each level of embedding. It also spends a great deal of complexity on "guessing" the directionality of neutral characters such as punctuation, digits, and whitespace.)
The PEPs of Python 3.9
Posted May 22, 2020 23:37 UTC (Fri) by gerdesj (subscriber, #5446) [Link]
String manipulation function definitions are quite tricky, when the definition of string is hard.
The PEPs of Python 3.9
Posted May 29, 2020 14:05 UTC (Fri) by quietbritishjim (subscriber, #114117) [Link]
The first thing I got confused by is that I assumed that "visual order" would still mean the rightmost character to be "first" if you're looking at a purely right-to-left language. After all, if you asked a native Arabic speaker what the visually first character is then they would point over on the right hand side. So when you said "For purely right-to-left text, logical order is the opposite of visual order", that would be a contradiction in terms. But it seems that in practice this term is usually used to mean what I'd call "visually left-to-right order", so "first" is always leftmost regardless of the text.
Second, I thought you only brought up "visual order" to explain the difference between how strings are stored vs. how they're rendered. I didn't realise that some people actually have stored text in visually left-to-right order *in memory*. So `mystr[0]` would be the logically last character in a right-to-left language, because that's on the left. Yuck!