|
|
Subscribe / Log in / New account

Reworking StringIO concatenation in Python

Benefits for LWN subscribers

The primary benefit from subscribing to LWN is helping to keep us publishing, but, beyond that, subscribers get immediate access to all site content and access to a number of extra site features. Please sign up today!

By Jake Edge
April 1, 2020

Python string objects are immutable, so changing the value of a string requires that a new string object be created with the new value. That is fairly well-understood within the community, but there are some "anti-patterns" that arise; it is pretty common for new users to build up a longer string by repeatedly concatenating to the end of the "same" string. The performance penalty for doing that could be avoided by switching to a type that is geared toward incremental updates, but Python 3 has already optimized the penalty away for regular strings. A recent thread on the python-ideas mailing list explored this topic some.

Paul Sokolovsky posted his lengthy description of a fairly simple idea on March 29. The common anti-pattern of building up a string might look something like:

buf = ""
for i in range(50000):
    buf += "foo"
print(buf)
As the Python FAQ notes, though, each concatenation creates a new object, which leads to a quadratic runtime cost based on the total string length. The FAQ recommends using a list to collect up all of the string pieces, then calling the join() string method to turn the list into the final string. But Sokolovsky focused on a different mechanism in his post; the FAQ also suggests using the io.StringIO class in order to change strings in place. Using that instead of repeated concatenation might look like:
buf = io.StringIO()
for i in range(50000):
    buf.write("foo")
print(buf.getvalue())

To make it easier for existing programs to be switched from one form to the other, he suggested adding a "+=" operator for StringIO as an alias for the write() method. Adding an __iadd__() method for the StringIO class would allow the write() call to be removed in favor of using +=. The buffer initialization and getvalue() call would still be needed, but those are each typically done in only one place, while the concatenation may be done in multiple places. So a code base could fairly easily be switched from the anti-pattern to more proper Python just by creating the buffer instead of a string and getting its value where needed with getvalue(), the rest of the code can stay the same; "it will leave the rest of code intact, and not obfuscate the original content construction algorithm".

As Sokolovsky noted, his performance benchmarking shows that CPython 3 has already optimized for the anti-pattern, though. So even though it is still considered to be a bad practice, there is no real penalty for writing code of that sort in CPython 3—but only for that version of the language:

These results can be summarized as follows: of more than half-dozen Python implementations, CPython3 is the only implementation which optimizes for the dubious usage of an immutable string type as an accumulating character buffer. For all other implementations, unintended usage of str incurs overhead of about one order of magnitude, 2 order of magnitude for implementations optimized for particular usecases (this includes PyPy optimized for speed vs MicroPython/Pycopy optimized for small code size and memory usage).

The optimization, which is described by Paul Ganssle in a blog post, effectively allows CPython to treat the string as mutable in the case where there are no other references to it. In a loop like the one in the example, there is no other reference to the string object being used, so instead of creating a new object and freeing the old, it simply changes the existing object in place. CPython can detect that case because it uses reference counts on its objects for garbage collection; PyPy is not reference-counted, so it cannot use the same trick.

But Sokolovsky is trying to target the bad practice regardless of the (lack of a) performance impact. The practice is widespread; the optimization added to CPython is evidence that it needs addressing, he said. He suggested that other implementations can either follow the lead of CPython (if possible) or try to promote better practices: "This would require improving ergonomics of existing string buffer object, to make its usage less painful for both writing new code and refactoring existing." And, of course, he was advocating the latter.

He also noted that, since the performance problem does not really exist for CPython, it might be seen as an argument that there is nothing to fix. "This is related to a bigger [question] 'whether a life outside CPython exists', or put more formally, where's the border between Python-the-language and CPython-the-implementation." Beyond that, one could "fix" the problem by creating a new class derived from StringIO that has an __iadd__(), but that suffers from worse performance as well, which argues that the problem should be addressed in C in StringIO itself.

The overall reception to the idea was chilly, at best, perhaps partly fueled by Sokolovsky's somewhat aggressive tone in his original note and some of the followups. Andrew Barnert replied that the join() mechanism is really the better alternative:

It’s usually about 40% faster than using StringIO or relying on the string-concat optimization in CPython, it’s efficient across all implementations of Python, and it’s obvious _why_ it’s efficient. It can sometimes take more memory, but the [tradeoff] is usually worth it.

Barnert said that StringIO is meant to be a file object that resides in memory, so it is appropriate that its API does not support +=. He concluded with a third option for alternative Python implementations beyond the two that Sokolovsky presented:

Recognize that Python and CPython have been promoting str.join for this problem for decades, and most performance-critical code is already doing that, and make sure that solution is efficient, and [recognize] that poorly-written code is uncommon but does exist, and may take a bit more work to optimize than a 1-line change to optimize, but that’s acceptable—and not the responsibility of any alternate Python implementation to help with.

The problem with the join() mechanism is that it is somewhat non-intuitive, especially for those coming to Python from another language. As Barnert noted, though, it can use more memory as well. Sokolovsky attempted to measure the difference in memory use, but the technique he used was not entirely convincing. His focus would appear to be on embedded Python, such as his Pycopy Python implementation. Pycopy is descended from MicroPython, which he also worked on. For the embedded use case, StringIO may well be the better choice for building strings, at least from a memory perspective; is that enough of a reason to turn a file-like object (StringIO) into a string-like object, but only for concatenation (+=)? The consensus answer would seem to be "no".

There was some discussion of having a generalized mutable string type, though that was not at all what Sokolovsky was after; there are some good reasons why that idea has never really taken off for Python, as Christopher Barker described. "So I'd say it hasn't been done because (1) it's a lot of work and (2) it would be a bit of a pain to use, and not gain much at all."

The objections to the original idea are basically that += can be trivially implemented for a derived class of StringIO; if the performance of that is not sufficient, switching to join() would fix that problem. The existing "join() on a list of strings" idiom works well for most people and nearly all use cases; it is the preferred way to solve this problem in Python, so making another idiom more usable is muddying the water to a certain degree. As The Zen of Python puts it: "There should be one-- and preferably only one --obvious way to do it."

On the other hand, CPython is the dominant player in the ecosystem, as Steven D'Aprano pointed out; that means applications can be written to take advantage of CPython quirks. On the other hand, even if all of the other Python implementations agreed on a change, it will not really be used unless CPython follows suit.

It seems to me that Paul makes a good case that, unlike the string concat optimization, just about every interpreter could add this to StringIO without difficulty or great cost. Perhaps they could even get together and agree to all do so.

But unless CPython does so too, it won't do them much good, because hardly anyone will take advantage of it. When one platform dominates 90% of the ecosystem, one can sensibly write code that depends on that platform's specific optimizations, but going the other way, not so much.

That is something for the CPython community to keep in mind. The existence of the other implementations of the language may provide opportunities to make some changes that are meant to be CPython-only (or at least not mandated for Python the language). But those changes can still get baked into the language via the back door—because most Python code runs on CPython.

In the final analysis, it is a pretty miniscule change being sought. The existence of the string concatenation optimization indicates that there is interest in helping "badly written" code to some extent, but perhaps adding += to StringIO is a bridge too far. There definitely does not seem to be any kind of groundswell of support for the idea and there are costs, beyond just the (minimal) code maintenance required, including in documentation and user education. The benefits, which some find to be dubious to begin with, are seemingly not enough to outweigh them.


Index entries for this article
PythonEnhancements


(Log in to post comments)

Reworking StringIO concatenation in Python

Posted Apr 2, 2020 1:48 UTC (Thu) by gus3 (guest, #61103) [Link]

This calls back to Epigram #120: Adapting old programs to fit new machines usually means adapting new machines to behave like old ones.

In this case, I think the "new machine" is Python, and the "old machine" is C++ blended with Smalltalk. Some programmers, familiar with C++ or Smalltalk idioms, wrote Python code with the same idioms, and expected similar behavior. Their code got the job done, but the running code behaved a lot worse in Python.

If there's a way to make the += idiom work for Python strings, I'm all for it. There's a lot of Python code deployed in corner-cases and edge spaces.

But naturally, the next question will be, how do we make this next big thing work:

str = "abc"
str *= 300

Or even:

str = "abc"
str -= "c"

How far do we follow the white rabbit?

Reworking StringIO concatenation in Python

Posted Apr 2, 2020 2:07 UTC (Thu) by NYKevin (subscriber, #129325) [Link]

Augmented multiplication already works. Unless you have a different definition of "work" than I do?

Reworking StringIO concatenation in Python

Posted Apr 2, 2020 8:28 UTC (Thu) by mina86 (guest, #68442) [Link]

Might be worth noting that Java faced the same problem and solved it at compile time by replacing chain of string concatenation with a use of a StringBuilder.

Reworking StringIO concatenation in Python

Posted Apr 5, 2020 0:25 UTC (Sun) by jmaa (guest, #128856) [Link]

The (C?)Python community is famously reluctant to perform bytecode optimizations, but if Mypy & co. aren't as reluctant, it might be a viable route.

Do you have a source for which String concat patterns the Java compiler can optimize? Based on an article [1] from 2015, Java 8 wouldn't optimize the loop case demonstrated in the LWM article, but it shouldn't be too hard with a slightly beefier analysis, and they may already have added it. I can't get my Java 13 compiler to generate the same bytecode as show in [1]; but this may because of a range of alternative optimizations (loop unrolling, constant folding, whole program, etc.)

[1]: http://www.pellegrino.link/2015/08/22/string-concatenatio...

Reworking StringIO concatenation in Python

Posted Apr 17, 2020 13:24 UTC (Fri) by kevincox (guest, #93938) [Link]

I don't have a source but I think it only optimizes sequences of + in a single expression.

So "return a + b + c" is equivalent to "StringBuilder b = new StringBuilder(); b.append(a); b.append(b); b.append(c); return b.toString()".

Reworking StringIO concatenation in Python

Posted Apr 23, 2020 20:55 UTC (Thu) by jezuch (subscriber, #52988) [Link]

Straight from the horse's mouth:

https://cl4es.github.io/2019/05/14/String-Concat-Redux.html

The foundation is JEP 280:

https://openjdk.java.net/jeps/280

Enjoy!

Reworking StringIO concatenation in Python

Posted Apr 2, 2020 12:45 UTC (Thu) by FLHerne (guest, #105373) [Link]

[flh ~/]$ python
>>> str = "abc"
>>> str *= 3
>>> str
'abcabcabc'

Reworking StringIO concatenation in Python

Posted Apr 2, 2020 23:20 UTC (Thu) by gus3 (guest, #61103) [Link]

*facepalm*

Note to self: RTFM.

Reworking StringIO concatenation in Python

Posted Apr 2, 2020 8:42 UTC (Thu) by Homer512 (subscriber, #85295) [Link]

Can't PyPy do an escape analysis to figure out whether it can do this kind of optimization?

Reworking StringIO concatenation in Python

Posted Apr 3, 2020 4:14 UTC (Fri) by NYKevin (subscriber, #129325) [Link]

For simple cases, sorta kinda yes, but in practice, not really. Here's why not:

Python 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 21:26:53) [MSC v.1916 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import inspect
>>> def foo():
...   x = "this is a string"
...   return bar()
...
>>> def bar():
...   x = inspect.stack()[1].frame.f_locals['x']
...   return x + " which has escaped from its original context"
...
>>> foo()
'this is a string which has escaped from its original context'

An escape analysis would have to track that sort of shenanigans across runtime imports, monkey-patching, polymorphism, etc. and this rapidly becomes completely impractical. Or alternatively, it only works in completely toy examples where there's no meaningful performance advantage. It's Not Worth It (TM) when you can just tell the user to write ''.join(xs) instead.

CPython, on the other hand, just has to check the refcount. That's completely trivial, so CPython implements this "feature" and PyPy doesn't.

Reworking StringIO concatenation in Python

Posted Apr 2, 2020 11:16 UTC (Thu) by dottedmag (subscriber, #18590) [Link]

Or just the item from Python FAQ list and let implementations improve.

Languages aren't created to make their implementations simple, they are created to make programming more palatable.

Reworking StringIO concatenation in Python

Posted Apr 2, 2020 11:17 UTC (Thu) by dottedmag (subscriber, #18590) [Link]

*drop the item

Reworking StringIO concatenation in Python

Posted Apr 2, 2020 13:24 UTC (Thu) by Paf (subscriber, #91811) [Link]

I feel like it’s worth considering just how user unfriendly just saying “this is badly written code” is, or asking users to know about a string IO class.

If Python is really aiming to include non-specialists (and obviously it is), then they really need to exert themselves to make the obvious patterns non-disastrous. I know that’s tough, but “if you want to frob this vorb (where frobbing is a very natural, simple thing to do to a vorb), you can use a vorb but it’s bad, you actually need a vorb.frobulatable” and then that’s good...

That’s really, really ugly, and the kind of thinking that says “treating a string like other simple types is just bad programming, whatever to those guys” is very... bubble-thinking for a language that is supposed to pull in non-specialists.

Reworking StringIO concatenation in Python

Posted Apr 2, 2020 17:55 UTC (Thu) by esemwy (guest, #83963) [Link]

Truly. If they’d really been serious about immutable strings, why is there a “+=“ operator?

Reworking StringIO concatenation in Python

Posted Apr 2, 2020 18:26 UTC (Thu) by NYKevin (subscriber, #129325) [Link]

Because there's a + operator. If you have a + operator, it's generally a Bad Idea to not provide +=. That would be really confusing.

Note that, in Python, augmented assignment operators (left += right, left *= right, etc.) work in one of two ways:

  • For immutable types such as numbers and tuples, they simply return left + right. The language then assigns the result to left automatically.
  • For mutable types such as lists, they mutate left in-place and return left. The language then assigns left to left automatically (which is a no-op).

Strings are on the "immutable" side of this divide, so they would ordinarily require a realloc and copy. CPython can detect the case where the string only has one reference and implement it as if the string was mutable. Other implementations can't do that because they're not reference counted. If strings were mutable, it wouldn't be possible to use them as dictionary keys, which would break all of Python (lots of language internals are implemented as dictionaries with strings as keys). So I think it's fair to say that they really are "serious about immutable strings." It's just that they made the mistake of making a special case to deal with a common anti-pattern, and now people are reliant on that exception. The special case isn't even all that robust; if the string has any other references (even just another local variable), then it goes back to the slow path.

The real question, IMHO, is whether any of this should matter. O(n^2) is quite fast when n=5. Are people really concatenating hundreds or thousands of strings in this fashion in the first place?

Reworking StringIO concatenation in Python

Posted Apr 3, 2020 13:29 UTC (Fri) by excors (subscriber, #95769) [Link]

> If strings were mutable, it wouldn't be possible to use them as dictionary keys, which would break all of Python (lots of language internals are implemented as dictionaries with strings as keys).

Languages like Perl manage to have mutable strings and dicts(/hashes). It effectively does a copy-by-value when adding a key to the hash, and copy-by-value when passing keys back to the application (e.g. when iterating over the hash contents), so the hash keys remain immutable while regular strings can still be mutated efficiently.

It's far too late for Python to change now, and I expect it would have ramifications on a lot of minor details of the language design, but is there a fundamental reason why copy-on-value keys couldn't work in a Python-like language?

(I think that technically Perl has a global table of reference-counted immutable hash-key strings, so the "copy-by-value" when adding to the hash is actually just a lookup in that table, to save time and memory when the same key is used in many hashes. And I think code like $foo{"bar"} gets compiled to bytecode that directly references "bar" from that global table, so in the common OOP case where hashes are used with constant keys there's no run-time copying at all.)

Reworking StringIO concatenation in Python

Posted Apr 3, 2020 21:27 UTC (Fri) by JoeBuck (subscriber, #2330) [Link]

I only write quick and dirty little Python scripts, but I've often used the += string accumulation pattern, and since CPython does the right thing it hasn't been a problem. I have no plans to change that.

Way back in the 90s, Microsoft had their own C++ string class, part of Microsoft Foundation Classes (this was before std::string existed), which was widely used, and repeated string appends had a quadratic cost, because they made the rookie mistake (which subsequently became a common interview question) of growing the buffer by adding just enough room to store the new piece. People expect the classic += string append pattern to perform well. Microsoft defenders used to call people who did the obvious thing and got awful performance ignorant. This was wrong. The implementation was broken.

If, in your implementation, this pattern (or antipattern if you insist) has quadratic cost, it is a bug in your implementation (or at minimum a quality defect). Fix it, at least for the common cases, don't attack users for using the primitives you give them in a natural way.

Seeing that the string has one reference is one way to do this. Even in an implementation that lacks reference counts, if the accumulator string object is local to a small function, the needed analysis can easily be performed to determine that a change to an efficient mutable implementation is required.

Yes, someone posted an example where this analysis is hard to perform. That's OK; fix the cases that users are likely to write, like CPython did, if you want your alternative Python implementation to be competitive. But telling people that they must use a completely different, and less ergonomic method to get their work done, when this isn't required by the most commonly used implementation, isn't a good look.

Reworking StringIO concatenation in Python

Posted Apr 4, 2020 17:28 UTC (Sat) by NYKevin (subscriber, #129325) [Link]

> Yes, someone posted an example where this analysis is hard to perform. That's OK; fix the cases that users are likely to write, like CPython did, if you want your alternative Python implementation to be competitive.

The "example where this analysis is hard to perform" is just that - an example. The analysis is hard to perform any time you have a string accessible via a variable (regardless of scope), you do anything that could potentially call a function (like dereferencing an attribute, using any of the infix operators, etc.), and then you extend the string with + or +=. In practice, this is pretty much every case that users are likely to write, except for completely trivial cases like foo = "hello"; foo += "world". If you have code in between the initialization and the concatenation, it's probably intractable.

Reworking StringIO concatenation in Python

Posted Apr 9, 2020 19:37 UTC (Thu) by njs (subscriber, #40338) [Link]

> I only write quick and dirty little Python scripts, but I've often used the += string accumulation pattern, and since CPython does the right thing it hasn't been a problem. I have no plans to change that.

> Way back in the 90s, Microsoft had their own C++ string class, part of Microsoft Foundation Classes (this was before std::string existed), which was widely used, and repeated string appends had a quadratic cost, because they made the rookie mistake (which subsequently became a common interview question) of growing the buffer by adding just enough room to store the new piece.

CPython's += implementation for strings also makes this "rookie mistake". It's not really a rookie mistake in their case, but rather an intentional design decision. All of CPython's mutable types like 'list' and 'bytearray' have amortized linear cost on repeated appends. But since 'str' objects are incredibly common and very rarely mutated in-place, you don't want to be carrying around extra buffer space inside every 'str' object just in case someone tries to append. So 'str' objects are always sized to be exactly the size they need to be, with no spare buffer space.

CPython's optimization for += is that in many cases it can use 'realloc' to extend the underlying buffer to the new size, versus a naive implementation that would always 'malloc' a new buffer + copy the data into it. And if you get lucky, the underlying memory allocator might find that it has some spare space free immediately after the string allocation, so 'realloc' might be able to extend the allocation in-place without copying.

You're particularly likely to get lucky with large buffers, since heap implementations will often round those up to whole pages sizes, and these are the buffers where the quadratic overhead would hurt the most.

But CPython's += handling is still quadratic in general, even with the optimization, and AFAICT it's exactly the same as the old MS behavior that you're criticizing.

Reworking StringIO concatenation in Python

Posted Apr 3, 2020 20:36 UTC (Fri) by perennialmind (guest, #45817) [Link]

Immutable strings are great, and even better when they come with natural accumulator types. str + str could result in an immutable sequence type holding references to both strings. Having introduced indirect stringy types, you might as well toss in slice type. But to make that usable you'd need a "Stringy" concept in the platform and python went all-in on str.

To the real question, yes, people really do obviously-wrong-in-retrospect things like building up a big doc with +=. The inevitable reality check can be inconvenient.

Reworking StringIO concatenation in Python

Posted Apr 9, 2020 19:44 UTC (Thu) by njs (subscriber, #40338) [Link]

Javascript engines commonly play tricks like this under the hood. You don't even need a 'stringy' type; they just build all the cleverness into the regular string type.

There are trade-offs though: it makes the implementation substantially more complex, requires a really strong abstraction boundary, and can have surprising performance cliffs where innocent-looking code tweaks can suddenly cause 1000x slowdowns.

CPython OTOH has always optimized for simplicity of implementation and has a long tradition of letting users peek through the abstractions when necessary (e.g. using the C API). These are really beneficial in a lot of other cases, but the situation with 'str' concatenation is arguably one of the downsides.

Reworking StringIO concatenation in Python

Posted Apr 10, 2020 17:51 UTC (Fri) by perennialmind (guest, #45817) [Link]

Python wouldn't be python with that kind of compromise on ability to reason about a core type. I gree with thumperward: I'd rather string addition
    buf += 'foo'
had been disallowed and string concatenation spelled differently. It's easy enough to write
    buf = f'{buf}foo'
today, but I think it'd be a strong clue that buf isn't buffering.

When possible, I'd rather leave the buffering to str.join:

    buf = str.join(
        'foo' for i in range(50000)
    )
    print(buf)

Except that it's an instance method, not a static, so it has to be written ''.join, which is irritating and unintuitive.

Reworking StringIO concatenation in Python

Posted Apr 10, 2020 23:54 UTC (Fri) by ABCD (subscriber, #53650) [Link]

Because it's a method of the str class, it can also be written:

    buf = str.join('',
        ('foo' for i in range(50000))
    )
    print(buf)

You just have to pass the string to join on as the first argument, because all instance methods are effectively static methods that take self as the first argument; the generator has to be parenthesized because it is no longer (syntactically) the only argument to join.

Why not allow writing of lists as is?

Posted Apr 2, 2020 14:15 UTC (Thu) by kleptog (subscriber, #1183) [Link]

If you take a step back, what is the result of such a concatenation going to be used for? Almost always it's going to written out to a file or socket. The interesting thing is that such I/O never actually needs to manifest the string as a whole, it can write out the individual parts. Which is essentially the optimisation Erlang uses in this context. It defines the concept of an IOString which is a string, or a list of IOStrings. These can be passed anywhere where a string would be used in I/O.

Translating this to Python, perhaps a better idea is to allow users to to pass such lists of strings directly to socket.write(), then you can skip the whole joining in the first place. In other words, why not make this work:

s = []
for i in range(50000):
    s.append("foo")
sys.stdout.write(("prefix", s, "suffix"))
This also integrates well with template rendering, which often spends a lot of time joining strings.

It's a bit late to change the semantics of print(["foo"]) but the sys.stdout.write([]) method returns an error now.

Why not allow writing of lists as is?

Posted Apr 2, 2020 18:24 UTC (Thu) by mgedmin (subscriber, #34497) [Link]

Python file objects have a .writelines() method that takes a sequence of strings. It doesn't add any newlines between them, despite the unfortunate name.

Socket objects, OTOH, don't have .write() at all -- they have a .send() and several other variations.

Why not allow writing of lists as is?

Posted Apr 3, 2020 16:10 UTC (Fri) by cortana (subscriber, #24596) [Link]

file.writelines matches up with file.readlines which returns a list of strings each of which ends in \n (except the last one iff the file being read doesn't end in a newline). So it's not as badly named a method as first appears.

Why not allow writing of lists as is?

Posted Apr 2, 2020 18:43 UTC (Thu) by NYKevin (subscriber, #129325) [Link]

You can already do that with map(sys.stdout.write, s). But Python 3 is lazy, so you actually need something like runes_written = sum(map(sys.stdout.write, s)), because otherwise it won't bother to evaluate the writes.

(POSIX allows write(2) to perform a partial write. So you'd actually need to wrap raw streams in something like io.BufferedWriter to do the necessary buffering, or else a string could get cut off. Fortunately, Python usually gives you such buffered streams by default when you call open(), unless you disable buffering. None of this applies to text streams, including stdio, because they have to be buffered for the Unicode encode/decode anyway.)

Why not allow writing of lists as is?

Posted Apr 3, 2020 12:17 UTC (Fri) by kleptog (subscriber, #1183) [Link]

> You can already do that with map(sys.stdout.write, s).

Not quite, because that's not recursive. You would need to flatten the list first so write(['prefix', ['foo', 'bar'], 'suffix']) works. And it calls write() many times whereas scatter/gather techniques could be used at OS level to make it a single OS call.

However, this sort of this is much more important in situations where memory usage is an issue. For example, in webservers it's useful to be able to stream templated content without making unnecessary copies; any static content never needs to be copied. But being lightweight has never been one of Python's goals.

Another possibility is if write() could accept an iterator, then you can leave all the magic to the user.

Incidentally, I've often wanted a recursive flatten method in Python precisely for this reason. It's a pity itertools doesn't include one.

Why not allow writing of lists as is?

Posted Apr 5, 2020 21:35 UTC (Sun) by NYKevin (subscriber, #129325) [Link]

You can't recursively flatten everything. A string is a sequence of one-character strings. So if you try to recursively flatten it, you overflow the stack. Instead, you would need some kind of "list-only recursive flatten," but that sort of hard-coded type sniffing is at best a code smell. I'm skeptical of building it into write().

(Frankly, I'm puzzled why so many people claim to need recursive flattening in the first place. I've never needed it outside of job interviews, and I'm a bit baffled as to what kind of application logic would even produce unwanted list nesting in the first place. The list API maintains a distinction between extend() and append() for precisely the purpose of creating flat lists as you go, and on the lazy side of things, generator functions have yield from.)

str.join

Posted Apr 3, 2020 7:13 UTC (Fri) by tnoo (subscriber, #20427) [Link]

To concatenate many strings, there is str.join which takes a list and a separator.

', '.join(list_of_string)

str.join

Posted Apr 3, 2020 21:39 UTC (Fri) by nevyn (subscriber, #33129) [Link]

Yes, that's the hack you need to use in non-CPython. However in CPython doing concatenation has worked fine for about a decade? Apparently PyPy still cares, which sucks for PyPy.

str.join

Posted Apr 4, 2020 13:20 UTC (Sat) by thumperward (guest, #34368) [Link]

It's not a hack. It's how the language supposes you're to do it. The problem here is that + operations should never have been permitted on strings, because the existence of two ways to do something inevitably results in people choosing the wrong one. This is supposed to be one of Python's core philosophies and it's not at all surprising that most of the more frustrating things about modern Python are similarly about being presented with unnecessary choice.

str.join

Posted Apr 7, 2020 21:01 UTC (Tue) by nix (subscriber, #2304) [Link]

So your position is that the easy-to-remember, obvious syntax of "foo"+"bar" should have been prohibited in favour of, uh, ''.join(["foo","bar"])? You seriously think the latter is a nicer syntax in the no-separators case, and you'd rather see code full of that than of +-for-concatenation? (To make a bad syntax worse, note how, in many fonts, just switching quote styles in the middle of that one makes it look like there's an unterminated quote in there.)

str.join's syntax is... acceptable, if weird, when there is a literal separator, and downright attractive when the separator is itself variable: but that's far from the common case, and all the common cases look strange at best.

str.join

Posted Apr 7, 2020 21:04 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

> So your position is that the easy-to-remember, obvious syntax of "foo"+"bar" should have been prohibited
It should be. A proper replacement: "${foo}${bar}".

str.join

Posted Apr 10, 2020 18:57 UTC (Fri) by flussence (subscriber, #85566) [Link]

Using anything else as an infix concat operator instead of ambiguating + would work too.
Real-world examples: ++ . ~

Reworking StringIO concatenation in Python

Posted Apr 5, 2020 10:28 UTC (Sun) by mathewcohle (guest, #118622) [Link]

Cannot help myself but it looks like Rust gets it right once again:

fn main() {
let string = "This " + "is.";
println!("{}", string);
}

yields compiler error:

error[E0369]: cannot add `&str` to `&str`
--> src/main.rs:5:26
|
5 | let string = "This " + "is.";
| ------- ^ ----- &str
| | |
| | `+` cannot be used to concatenate two `&str` strings
| &str
|
help: `to_owned()` can be used to create an owned `String` from a string reference. >>>String concatenation appends the string on the right to the string on the left and may require reallocation.<<< This requires ownership of the string on the left

The help is right away pointing you to the problem (the exact problem why building string with + is anti-pattern in Python).

Not sure if there is anything to be done in this stage of Python to fix this the same "proper" way, but it counts imho to the pile of arguments why it's unfortunate to pass the burden of keeping this in mind on programmer's shoulders (and yeah, RTFD, but if there is too many docs to read, it becomes infeasible fix).

Reworking StringIO concatenation in Python

Posted Apr 8, 2020 22:30 UTC (Wed) by xi0n (subscriber, #138144) [Link]

Rust uses a different memory management model (static ownership checks instead of a GC) and that’s the main reason this problem doesn’t apply to it. You may call that a happy side effect, but it doesn’t really provide much of a lesson to Python besides “be a completely different language”.

Reworking StringIO concatenation in Python

Posted Apr 9, 2020 15:55 UTC (Thu) by mathewcohle (guest, #118622) [Link]

Agree, moreover Python tries to be "beginner friendly" while I cannot imagine how would Data Scientist boot into more general programming while being constantly bullied by compiler :)

So yes, they are definitely different beasts, just wanted to point out that the issue (bug) might not exist by design of the language.


Copyright © 2020, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds