Reworking StringIO concatenation in Python
LWN.net needs you! Without subscribers, LWN would simply not exist. Please consider signing up for a subscription and helping to keep LWN publishing |
Python string objects are immutable, so changing the value of a string requires that a new string object be created with the new value. That is fairly well-understood within the community, but there are some "anti-patterns" that arise; it is pretty common for new users to build up a longer string by repeatedly concatenating to the end of the "same" string. The performance penalty for doing that could be avoided by switching to a type that is geared toward incremental updates, but Python 3 has already optimized the penalty away for regular strings. A recent thread on the python-ideas mailing list explored this topic some.
Paul Sokolovsky posted his lengthy description of a fairly simple idea on March 29. The common anti-pattern of building up a string might look something like:
buf = "" for i in range(50000): buf += "foo" print(buf)
buf = io.StringIO() for i in range(50000): buf.write("foo") print(buf.getvalue())
To make it easier for existing programs to be switched from one form to the
other, he suggested adding a "+=" operator for StringIO
as an alias for the write() method.
Adding an __iadd__()
method for the StringIO class would allow the write() call to be removed in
favor of using +=. The buffer initialization and
getvalue() call would still be needed, but those are each typically
done in only one place, while the concatenation may be done in multiple
places. So a code base could fairly easily be switched from the
anti-pattern to more proper Python just by creating the buffer instead of a
string and getting its value where needed with getvalue(), the
rest of the code can stay the same; "it will leave the rest of code intact,
and not obfuscate the original content construction algorithm
".
As Sokolovsky noted, his performance benchmarking shows that CPython 3 has already optimized for the anti-pattern, though. So even though it is still considered to be a bad practice, there is no real penalty for writing code of that sort in CPython 3—but only for that version of the language:
The optimization, which is described by Paul Ganssle in a blog post, effectively allows CPython to treat the string as mutable in the case where there are no other references to it. In a loop like the one in the example, there is no other reference to the string object being used, so instead of creating a new object and freeing the old, it simply changes the existing object in place. CPython can detect that case because it uses reference counts on its objects for garbage collection; PyPy is not reference-counted, so it cannot use the same trick.
But Sokolovsky is trying to target the bad practice regardless of the (lack of a)
performance impact. The practice is widespread; the optimization added to
CPython is evidence that it needs addressing, he said. He suggested that
other implementations can either follow the lead of CPython (if possible)
or try to promote better practices: "This would require improving
ergonomics of existing string buffer
object, to make its usage less painful for both writing new code and
refactoring existing.
" And, of course, he was advocating the latter.
He also noted that, since the performance problem does not really exist for
CPython, it might be seen as an argument that there is nothing to
fix. "This is related to a bigger
[question] 'whether a life outside CPython exists', or put more
formally, where's the border between Python-the-language and
CPython-the-implementation.
" Beyond that, one could "fix" the
problem by creating a new class derived from StringIO that has an
__iadd__(), but that suffers from worse performance as well, which
argues that the problem should be addressed in C in StringIO
itself.
The overall reception to the idea was chilly, at best, perhaps partly fueled by Sokolovsky's somewhat aggressive tone in his original note and some of the followups. Andrew Barnert replied that the join() mechanism is really the better alternative:
Barnert said that StringIO is meant to be a file object that resides in memory, so it is appropriate that its API does not support +=. He concluded with a third option for alternative Python implementations beyond the two that Sokolovsky presented:
The problem with the join() mechanism is that it is somewhat non-intuitive, especially for those coming to Python from another language. As Barnert noted, though, it can use more memory as well. Sokolovsky attempted to measure the difference in memory use, but the technique he used was not entirely convincing. His focus would appear to be on embedded Python, such as his Pycopy Python implementation. Pycopy is descended from MicroPython, which he also worked on. For the embedded use case, StringIO may well be the better choice for building strings, at least from a memory perspective; is that enough of a reason to turn a file-like object (StringIO) into a string-like object, but only for concatenation (+=)? The consensus answer would seem to be "no".
There was some discussion of having a generalized mutable string type,
though that was not at all what Sokolovsky was after; there are some good reasons
why that idea has never really taken off for Python, as Christopher Barker
described. "So
I'd say it hasn't been done because (1) it's a lot of work and (2) it would
be a bit of a pain to use, and not gain much at all.
"
The objections to the original idea are basically that += can be trivially
implemented for a derived class of StringIO; if the performance of
that is not sufficient, switching to join() would fix that
problem.
The existing "join() on a list of strings" idiom works well for most people and nearly
all use cases; it is the preferred way
to solve this problem in Python, so making another idiom more usable
is muddying the water to a certain degree. As The Zen of Python puts it:
"There should be one-- and preferably only one --obvious way to do
it.
"
On the other hand, CPython is the dominant player in the ecosystem, as Steven D'Aprano pointed out; that means applications can be written to take advantage of CPython quirks. On the other hand, even if all of the other Python implementations agreed on a change, it will not really be used unless CPython follows suit.
But unless CPython does so too, it won't do them much good, because hardly anyone will take advantage of it. When one platform dominates 90% of the ecosystem, one can sensibly write code that depends on that platform's specific optimizations, but going the other way, not so much.
That is something for the CPython community to keep in mind. The existence of the other implementations of the language may provide opportunities to make some changes that are meant to be CPython-only (or at least not mandated for Python the language). But those changes can still get baked into the language via the back door—because most Python code runs on CPython.
In the final analysis, it is a pretty miniscule change being sought. The existence of the string concatenation optimization indicates that there is interest in helping "badly written" code to some extent, but perhaps adding += to StringIO is a bridge too far. There definitely does not seem to be any kind of groundswell of support for the idea and there are costs, beyond just the (minimal) code maintenance required, including in documentation and user education. The benefits, which some find to be dubious to begin with, are seemingly not enough to outweigh them.
Index entries for this article | |
---|---|
Python | Enhancements |
(Log in to post comments)
Reworking StringIO concatenation in Python
Posted Apr 2, 2020 1:48 UTC (Thu) by gus3 (guest, #61103) [Link]
In this case, I think the "new machine" is Python, and the "old machine" is C++ blended with Smalltalk. Some programmers, familiar with C++ or Smalltalk idioms, wrote Python code with the same idioms, and expected similar behavior. Their code got the job done, but the running code behaved a lot worse in Python.
If there's a way to make the += idiom work for Python strings, I'm all for it. There's a lot of Python code deployed in corner-cases and edge spaces.
But naturally, the next question will be, how do we make this next big thing work:
str = "abc"
str *= 300
Or even:
str = "abc"
str -= "c"
How far do we follow the white rabbit?
Reworking StringIO concatenation in Python
Posted Apr 2, 2020 2:07 UTC (Thu) by NYKevin (subscriber, #129325) [Link]
Reworking StringIO concatenation in Python
Posted Apr 2, 2020 8:28 UTC (Thu) by mina86 (guest, #68442) [Link]
Reworking StringIO concatenation in Python
Posted Apr 5, 2020 0:25 UTC (Sun) by jmaa (guest, #128856) [Link]
Do you have a source for which String concat patterns the Java compiler can optimize? Based on an article [1] from 2015, Java 8 wouldn't optimize the loop case demonstrated in the LWM article, but it shouldn't be too hard with a slightly beefier analysis, and they may already have added it. I can't get my Java 13 compiler to generate the same bytecode as show in [1]; but this may because of a range of alternative optimizations (loop unrolling, constant folding, whole program, etc.)
[1]: http://www.pellegrino.link/2015/08/22/string-concatenatio...
Reworking StringIO concatenation in Python
Posted Apr 17, 2020 13:24 UTC (Fri) by kevincox (guest, #93938) [Link]
So "return a + b + c" is equivalent to "StringBuilder b = new StringBuilder(); b.append(a); b.append(b); b.append(c); return b.toString()".
Reworking StringIO concatenation in Python
Posted Apr 23, 2020 20:55 UTC (Thu) by jezuch (subscriber, #52988) [Link]
https://cl4es.github.io/2019/05/14/String-Concat-Redux.html
The foundation is JEP 280:
https://openjdk.java.net/jeps/280
Enjoy!
Reworking StringIO concatenation in Python
Posted Apr 2, 2020 12:45 UTC (Thu) by FLHerne (guest, #105373) [Link]
>>> str = "abc"
>>> str *= 3
>>> str
'abcabcabc'
Reworking StringIO concatenation in Python
Posted Apr 2, 2020 23:20 UTC (Thu) by gus3 (guest, #61103) [Link]
Note to self: RTFM.
Reworking StringIO concatenation in Python
Posted Apr 2, 2020 8:42 UTC (Thu) by Homer512 (subscriber, #85295) [Link]
Reworking StringIO concatenation in Python
Posted Apr 3, 2020 4:14 UTC (Fri) by NYKevin (subscriber, #129325) [Link]
For simple cases, sorta kinda yes, but in practice, not really. Here's why not:
Python 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 21:26:53) [MSC v.1916 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> import inspect >>> def foo(): ... x = "this is a string" ... return bar() ... >>> def bar(): ... x = inspect.stack()[1].frame.f_locals['x'] ... return x + " which has escaped from its original context" ... >>> foo() 'this is a string which has escaped from its original context'
An escape analysis would have to track that sort of shenanigans across runtime imports, monkey-patching, polymorphism, etc. and this rapidly becomes completely impractical. Or alternatively, it only works in completely toy examples where there's no meaningful performance advantage. It's Not Worth It (TM) when you can just tell the user to write ''.join(xs) instead.
CPython, on the other hand, just has to check the refcount. That's completely trivial, so CPython implements this "feature" and PyPy doesn't.
Reworking StringIO concatenation in Python
Posted Apr 2, 2020 11:16 UTC (Thu) by dottedmag (subscriber, #18590) [Link]
Languages aren't created to make their implementations simple, they are created to make programming more palatable.
Reworking StringIO concatenation in Python
Posted Apr 2, 2020 11:17 UTC (Thu) by dottedmag (subscriber, #18590) [Link]
Reworking StringIO concatenation in Python
Posted Apr 2, 2020 13:24 UTC (Thu) by Paf (subscriber, #91811) [Link]
If Python is really aiming to include non-specialists (and obviously it is), then they really need to exert themselves to make the obvious patterns non-disastrous. I know that’s tough, but “if you want to frob this vorb (where frobbing is a very natural, simple thing to do to a vorb), you can use a vorb but it’s bad, you actually need a vorb.frobulatable” and then that’s good...
That’s really, really ugly, and the kind of thinking that says “treating a string like other simple types is just bad programming, whatever to those guys” is very... bubble-thinking for a language that is supposed to pull in non-specialists.
Reworking StringIO concatenation in Python
Posted Apr 2, 2020 17:55 UTC (Thu) by esemwy (guest, #83963) [Link]
Reworking StringIO concatenation in Python
Posted Apr 2, 2020 18:26 UTC (Thu) by NYKevin (subscriber, #129325) [Link]
Because there's a + operator. If you have a + operator, it's generally a Bad Idea to not provide +=. That would be really confusing.
Note that, in Python, augmented assignment operators (left += right, left *= right, etc.) work in one of two ways:
- For immutable types such as numbers and tuples, they simply return left + right. The language then assigns the result to left automatically.
- For mutable types such as lists, they mutate left in-place and return left. The language then assigns left to left automatically (which is a no-op).
Strings are on the "immutable" side of this divide, so they would ordinarily require a realloc and copy. CPython can detect the case where the string only has one reference and implement it as if the string was mutable. Other implementations can't do that because they're not reference counted. If strings were mutable, it wouldn't be possible to use them as dictionary keys, which would break all of Python (lots of language internals are implemented as dictionaries with strings as keys). So I think it's fair to say that they really are "serious about immutable strings." It's just that they made the mistake of making a special case to deal with a common anti-pattern, and now people are reliant on that exception. The special case isn't even all that robust; if the string has any other references (even just another local variable), then it goes back to the slow path.
The real question, IMHO, is whether any of this should matter. O(n^2) is quite fast when n=5. Are people really concatenating hundreds or thousands of strings in this fashion in the first place?
Reworking StringIO concatenation in Python
Posted Apr 3, 2020 13:29 UTC (Fri) by excors (subscriber, #95769) [Link]
Languages like Perl manage to have mutable strings and dicts(/hashes). It effectively does a copy-by-value when adding a key to the hash, and copy-by-value when passing keys back to the application (e.g. when iterating over the hash contents), so the hash keys remain immutable while regular strings can still be mutated efficiently.
It's far too late for Python to change now, and I expect it would have ramifications on a lot of minor details of the language design, but is there a fundamental reason why copy-on-value keys couldn't work in a Python-like language?
(I think that technically Perl has a global table of reference-counted immutable hash-key strings, so the "copy-by-value" when adding to the hash is actually just a lookup in that table, to save time and memory when the same key is used in many hashes. And I think code like $foo{"bar"} gets compiled to bytecode that directly references "bar" from that global table, so in the common OOP case where hashes are used with constant keys there's no run-time copying at all.)
Reworking StringIO concatenation in Python
Posted Apr 3, 2020 21:27 UTC (Fri) by JoeBuck (subscriber, #2330) [Link]
I only write quick and dirty little Python scripts, but I've often used the += string accumulation pattern, and since CPython does the right thing it hasn't been a problem. I have no plans to change that.
Way back in the 90s, Microsoft had their own C++ string class, part of Microsoft Foundation Classes (this was before std::string existed), which was widely used, and repeated string appends had a quadratic cost, because they made the rookie mistake (which subsequently became a common interview question) of growing the buffer by adding just enough room to store the new piece. People expect the classic += string append pattern to perform well. Microsoft defenders used to call people who did the obvious thing and got awful performance ignorant. This was wrong. The implementation was broken.
If, in your implementation, this pattern (or antipattern if you insist) has quadratic cost, it is a bug in your implementation (or at minimum a quality defect). Fix it, at least for the common cases, don't attack users for using the primitives you give them in a natural way.
Seeing that the string has one reference is one way to do this. Even in an implementation that lacks reference counts, if the accumulator string object is local to a small function, the needed analysis can easily be performed to determine that a change to an efficient mutable implementation is required.
Yes, someone posted an example where this analysis is hard to perform. That's OK; fix the cases that users are likely to write, like CPython did, if you want your alternative Python implementation to be competitive. But telling people that they must use a completely different, and less ergonomic method to get their work done, when this isn't required by the most commonly used implementation, isn't a good look.
Reworking StringIO concatenation in Python
Posted Apr 4, 2020 17:28 UTC (Sat) by NYKevin (subscriber, #129325) [Link]
The "example where this analysis is hard to perform" is just that - an example. The analysis is hard to perform any time you have a string accessible via a variable (regardless of scope), you do anything that could potentially call a function (like dereferencing an attribute, using any of the infix operators, etc.), and then you extend the string with + or +=. In practice, this is pretty much every case that users are likely to write, except for completely trivial cases like foo = "hello"; foo += "world". If you have code in between the initialization and the concatenation, it's probably intractable.
Reworking StringIO concatenation in Python
Posted Apr 9, 2020 19:37 UTC (Thu) by njs (subscriber, #40338) [Link]
> Way back in the 90s, Microsoft had their own C++ string class, part of Microsoft Foundation Classes (this was before std::string existed), which was widely used, and repeated string appends had a quadratic cost, because they made the rookie mistake (which subsequently became a common interview question) of growing the buffer by adding just enough room to store the new piece.
CPython's += implementation for strings also makes this "rookie mistake". It's not really a rookie mistake in their case, but rather an intentional design decision. All of CPython's mutable types like 'list' and 'bytearray' have amortized linear cost on repeated appends. But since 'str' objects are incredibly common and very rarely mutated in-place, you don't want to be carrying around extra buffer space inside every 'str' object just in case someone tries to append. So 'str' objects are always sized to be exactly the size they need to be, with no spare buffer space.
CPython's optimization for += is that in many cases it can use 'realloc' to extend the underlying buffer to the new size, versus a naive implementation that would always 'malloc' a new buffer + copy the data into it. And if you get lucky, the underlying memory allocator might find that it has some spare space free immediately after the string allocation, so 'realloc' might be able to extend the allocation in-place without copying.
You're particularly likely to get lucky with large buffers, since heap implementations will often round those up to whole pages sizes, and these are the buffers where the quadratic overhead would hurt the most.
But CPython's += handling is still quadratic in general, even with the optimization, and AFAICT it's exactly the same as the old MS behavior that you're criticizing.
Reworking StringIO concatenation in Python
Posted Apr 3, 2020 20:36 UTC (Fri) by perennialmind (guest, #45817) [Link]
Immutable strings are great, and even better when they come with natural accumulator types. str + str
could result in an immutable sequence type holding references to both strings. Having introduced indirect stringy types, you might as well toss in slice type. But to make that usable you'd need a "Stringy" concept in the platform and python went all-in on str
.
To the real question, yes, people really do obviously-wrong-in-retrospect things like building up a big doc with +=
. The inevitable reality check can be inconvenient.
Reworking StringIO concatenation in Python
Posted Apr 9, 2020 19:44 UTC (Thu) by njs (subscriber, #40338) [Link]
There are trade-offs though: it makes the implementation substantially more complex, requires a really strong abstraction boundary, and can have surprising performance cliffs where innocent-looking code tweaks can suddenly cause 1000x slowdowns.
CPython OTOH has always optimized for simplicity of implementation and has a long tradition of letting users peek through the abstractions when necessary (e.g. using the C API). These are really beneficial in a lot of other cases, but the situation with 'str' concatenation is arguably one of the downsides.
Reworking StringIO concatenation in Python
Posted Apr 10, 2020 17:51 UTC (Fri) by perennialmind (guest, #45817) [Link]
Python wouldn't be python with that kind of compromise on ability to reason about a core type. I gree with thumperward: I'd rather string addition buf += 'foo'
had been disallowed and string concatenation spelled differently. It's easy enough to write buf = f'{buf}foo'
today, but I think it'd be a strong clue that buf
isn't buffering.
When possible, I'd rather leave the buffering to str.join
:
buf = str.join(
'foo' for i in range(50000)
)
print(buf)
Except that it's an instance method, not a static, so it has to be written ''.join
, which is irritating and unintuitive.
Reworking StringIO concatenation in Python
Posted Apr 10, 2020 23:54 UTC (Fri) by ABCD (subscriber, #53650) [Link]
Because it's a method of the str class, it can also be written:
buf = str.join('', ('foo' for i in range(50000)) ) print(buf)
You just have to pass the string to join on as the first argument, because all instance methods are effectively static methods that take self as the first argument; the generator has to be parenthesized because it is no longer (syntactically) the only argument to join.
Why not allow writing of lists as is?
Posted Apr 2, 2020 14:15 UTC (Thu) by kleptog (subscriber, #1183) [Link]
If you take a step back, what is the result of such a concatenation going to be used for? Almost always it's going to written out to a file or socket. The interesting thing is that such I/O never actually needs to manifest the string as a whole, it can write out the individual parts. Which is essentially the optimisation Erlang uses in this context. It defines the concept of an IOString which is a string, or a list of IOStrings. These can be passed anywhere where a string would be used in I/O.Translating this to Python, perhaps a better idea is to allow users to to pass such lists of strings directly to socket.write(), then you can skip the whole joining in the first place. In other words, why not make this work:
s = [] for i in range(50000): s.append("foo") sys.stdout.write(("prefix", s, "suffix"))This also integrates well with template rendering, which often spends a lot of time joining strings.
It's a bit late to change the semantics of print(["foo"])
but the sys.stdout.write([])
method returns an error now.
Why not allow writing of lists as is?
Posted Apr 2, 2020 18:24 UTC (Thu) by mgedmin (subscriber, #34497) [Link]
Socket objects, OTOH, don't have .write() at all -- they have a .send() and several other variations.
Why not allow writing of lists as is?
Posted Apr 3, 2020 16:10 UTC (Fri) by cortana (subscriber, #24596) [Link]
Why not allow writing of lists as is?
Posted Apr 2, 2020 18:43 UTC (Thu) by NYKevin (subscriber, #129325) [Link]
You can already do that with map(sys.stdout.write, s). But Python 3 is lazy, so you actually need something like runes_written = sum(map(sys.stdout.write, s)), because otherwise it won't bother to evaluate the writes.
(POSIX allows write(2) to perform a partial write. So you'd actually need to wrap raw streams in something like io.BufferedWriter to do the necessary buffering, or else a string could get cut off. Fortunately, Python usually gives you such buffered streams by default when you call open(), unless you disable buffering. None of this applies to text streams, including stdio, because they have to be buffered for the Unicode encode/decode anyway.)
Why not allow writing of lists as is?
Posted Apr 3, 2020 12:17 UTC (Fri) by kleptog (subscriber, #1183) [Link]
> You can already do that with map(sys.stdout.write, s).
Not quite, because that's not recursive. You would need to flatten the list first so write(['prefix', ['foo', 'bar'], 'suffix'])
works. And it calls write()
many times whereas scatter/gather techniques could be used at OS level to make it a single OS call.
However, this sort of this is much more important in situations where memory usage is an issue. For example, in webservers it's useful to be able to stream templated content without making unnecessary copies; any static content never needs to be copied. But being lightweight has never been one of Python's goals.
Another possibility is if write()
could accept an iterator, then you can leave all the magic to the user.
Incidentally, I've often wanted a recursive flatten method in Python precisely for this reason. It's a pity itertools doesn't include one.
Why not allow writing of lists as is?
Posted Apr 5, 2020 21:35 UTC (Sun) by NYKevin (subscriber, #129325) [Link]
(Frankly, I'm puzzled why so many people claim to need recursive flattening in the first place. I've never needed it outside of job interviews, and I'm a bit baffled as to what kind of application logic would even produce unwanted list nesting in the first place. The list API maintains a distinction between extend() and append() for precisely the purpose of creating flat lists as you go, and on the lazy side of things, generator functions have yield from.)
str.join
Posted Apr 3, 2020 7:13 UTC (Fri) by tnoo (subscriber, #20427) [Link]
', '.join(list_of_string)
str.join
Posted Apr 3, 2020 21:39 UTC (Fri) by nevyn (subscriber, #33129) [Link]
str.join
Posted Apr 4, 2020 13:20 UTC (Sat) by thumperward (guest, #34368) [Link]
str.join
Posted Apr 7, 2020 21:01 UTC (Tue) by nix (subscriber, #2304) [Link]
str.join's syntax is... acceptable, if weird, when there is a literal separator, and downright attractive when the separator is itself variable: but that's far from the common case, and all the common cases look strange at best.
str.join
Posted Apr 7, 2020 21:04 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]
It should be. A proper replacement: "${foo}${bar}".
str.join
Posted Apr 10, 2020 18:57 UTC (Fri) by flussence (subscriber, #85566) [Link]
Real-world examples: ++ . ~
Reworking StringIO concatenation in Python
Posted Apr 5, 2020 10:28 UTC (Sun) by mathewcohle (guest, #118622) [Link]
fn main() {
let string = "This " + "is.";
println!("{}", string);
}
yields compiler error:
error[E0369]: cannot add `&str` to `&str`
--> src/main.rs:5:26
|
5 | let string = "This " + "is.";
| ------- ^ ----- &str
| | |
| | `+` cannot be used to concatenate two `&str` strings
| &str
|
help: `to_owned()` can be used to create an owned `String` from a string reference. >>>String concatenation appends the string on the right to the string on the left and may require reallocation.<<< This requires ownership of the string on the left
The help is right away pointing you to the problem (the exact problem why building string with + is anti-pattern in Python).
Not sure if there is anything to be done in this stage of Python to fix this the same "proper" way, but it counts imho to the pile of arguments why it's unfortunate to pass the burden of keeping this in mind on programmer's shoulders (and yeah, RTFD, but if there is too many docs to read, it becomes infeasible fix).
Reworking StringIO concatenation in Python
Posted Apr 8, 2020 22:30 UTC (Wed) by xi0n (subscriber, #138144) [Link]
Reworking StringIO concatenation in Python
Posted Apr 9, 2020 15:55 UTC (Thu) by mathewcohle (guest, #118622) [Link]
So yes, they are definitely different beasts, just wanted to point out that the issue (bug) might not exist by design of the language.