Development

A Brief Look at Cpython String

Ondřej Měkota

Have you ever noticed that a string with just a few characters in Python uses several dozen bytes of memory? We handle a lot of short strings in Python in our team, that works on matching offers of the same products together, and I have been wondering why Python seems to store strings seemingly so inefficiently.

In this article, we will take a brief look at how Python strings are implemented, why they consume so much memory and we will examine certain interesting aspects of Python string.

In the previous blog posts (the first and second), we analysed the implementation of float and list types in Python.

We have established that every Python object has a common ancestor, a common basis: the PyObject struct.

In this article I assume the reader is familiar with character representations and encodings (ASCII, UTF-8 – why variable character length is used) to some extent. Also, I assume a 64-bit *nix system, in any memory calculation.

Python String Structures

In lower-level languages such as C, a „string“ is commonly represented as an array of characters (generally integers; in C, types such as char, wchar_t are used) terminated by null character.

Python has three string representations (but the last one, PyUnicodeObject, is deprecated and it will be removed in Python 3.12 so I’ve omitted it from this article). The first string object, PyASCIIObject, is for ASCII only strings and the other, PyCompactUnicodeObject, is for strings that contain at least one non-ASCII character. PyCompactUnicodeObject actually uses PyASCIIObject as its base and extends it.

PyObject_HEAD is a macro for creating PyObject, which has 16 bytes as we have shown.
The length is the number of characters in the string and it has 8 bytes. There is also a hash
of the string (it does not have to be calculated) which is 8 bytes.

typedef struct {
    PyASCIIObject _base;
    Py_ssize_t utf8_length;     /* Number of bytes in utf8, excluding the
                                 * terminating \0. */
    char *utf8;                 /* UTF-8 representation (null-terminated) */
    Py_ssize_t wstr_length;     /* Number of code points in wstr, possible
                                 * surrogates count as two code points. */
} PyCompactUnicodeObject;

The PyCompactUnicodeObject definition in CPython 3.10

Then there is the struct with all the bitfields:

  • interned says whether the string is interned, not interned or interned and immortal
    • Interning is a memory-saving optimisation detail. It also improves the performance
      of some operations, such as equality comparison (if immutable objects are equal, then their contents are also equal).
    • By interning a string (or any object for that matter), Python ensures that only one instance of that object is stored in memory, i. e. if 'hello'was interned, then
      for any two strings a = 'hello' and b = 'hello' anywhere in the program a is b would be True
    • All function, class, and variable names are interned, and so are the keys of dictionaries, for example. Also, any string can be explicitly interned.
    • An interned string marked as immortal is never deleted. However, this feature has been deprecated since Python 3.10 and scheduled for removal in Python 3.12 (and it seems it is not used in practice).
      • A side note for readers interested in the immortality of objects: There is a PEP proposal on making some objects (e. g. None) truly immortal and immutable, including their reference count. This has several interesting consequences, such as the avoidance of cache invalidation after each reference count change.
  • kind either 1, 2, or 4 byte unicode symbol (ASCII has 1 byte)
  • compact (legacy bit) for the two described objects is always set to 1
  • ascii says whether the string is composed of ASCII only characters
  • ready (legacy bit) for the two described objects is always set to 1
  • And a 24-bit padding to align the structure to 4 bytes.

Finally, there is a pointer wstr that will be referred to later.

The string itself is stored just after this structure. Again, this will be explained later.

typedef struct {
    PyASCIIObject _base;
    Py_ssize_t utf8_length;     /* Number of bytes in utf8, excluding the
                                 * terminating \0. */
    char *utf8;                 /* UTF-8 representation (null-terminated) */
    Py_ssize_t wstr_length;     /* Number of code points in wstr, possible
                                 * surrogates count as two code points. */
} PyCompactUnicodeObject;

The PyCompactUnicodeObject definition in CPython 3.10

Strings are commonly created through the PyUnicode_New C function. If they contain a character with a code point higher than 127, the PyUnicodeCompactObject structure is used, otherwise PyASCIIObject. There are other ways to create string, and some result in the legacy PyUnicodeObject.

As mentioned before, the actual string is stored directly after this structure, and it’s not referenced from it. Also, the stored string is none of UTF-{8,16,32} encodings, but only the fixed-size UCS code points (=numbers, IDs,.) are stored. The reason for that is that fixed-size characters are necessary so that, for example, indexing works in constant time. It is generally is much easier to work
with strings where each character takes up the same amount of space. UTF only encodes these UCS code points to variable-sized characters, thus (usually) saving a lot of space to the detriment of some operations.

At the beginning, utf8 is NULL and utf8_length is zero. utf-8 is basically a cache for
the UTF-8 representation of the string. This representation is created on-demand, when a string is about to be written to a file (in UTF-8), for example, and it is cached for further usage. utf8_length is the number of bytes in the UTF-8 encoded version of the string, excluding the null at the end.

The wstr pointer and wstr_length are legacy and they are scheduled to be removed in Python 3.12, so I won’t describe them here. An interested reader can refer to string structs definitions and the PyObject_New function for further information.

Immutability

Strings in Python are immutable as described in the documentation. Immutable means that once the object is created it cannot be changed. There can be many references to such an object and all of them can assume the object will never change. Also, the immutability of an object is a necessary (but not a sufficient) condition for being a key in a dictionary.

However, sometimes strings are actually mutable. See the following example.

counter = 0
string1 = ''

for i in range(1_000_000):
     old_id = id(string1)
     string1 += str(i)
     if id(string1) != old_id:
         counter += 1

print(counter)  # returns 47

Example of string mutability

Edit: However, sometimes strings are actually mutable. I have been corrected by @gjmos that strings are always immutable. In this example, the actual string object remains unchanged and on the same memory address, with concatenated string appended at the end, but as @gjmos pointed out, it’s actually a new object, and we should not work with IDs of objects that have died because new object may be allocated at the same memory address, thus having the same ID (in CPython).

After most of the concatenations, the new object occupies the same space as the old object and reuses the data of the old object, appending a string at the end. It may happen only if there is only one reference to the object (and there is free memory space following the object). This is optimisation detail of CPython, making concatenation of strings much faster (in some cases)
There is a blog post about it. Or see the implementation of [unicode_resize][resize_inplace]functions in CPython.

Memory Consumption

Let us see how much memory strings use. If we create a new empty string, it uses PyASCIIObject and totals 49 bytes:

import sys

print(sys.getsizeof(''))  # prints 49

Why 49 bytes? Let’s count

  • PyObject uses 16 bytes
  • length, hash, wstr* use 8 bytes each, so 24 bytes total
  • state uses 4 bytes
  • the actual string uses 1 byte, as it is null-terminated (\0)

That makes it 45 bytes. However, the structure needs to be aligned to 8 bytes, so it is ’padded’ using 4 more bytes (the structure then takes 48 bytes).

If we add one ASCII character

print(sys.getsizeof('a')) # prints 50

it uses 50 bytes.

Now what if we use non-ASCII characters like ž. ž has code point 382, so at least 2 bytes are necessary to store this character. And it will use the PyCompactUnicodeObject:

print(sys.getsizeof('ž')) # prints 76

These strings are getting quite big 🙂. Let’s count again:

  • The PyASCIIObject uses 48 bytes
  • utf_length, utf8*, and wstr_length use 8 bytes each, so 24B total
  • 4 bytes for the actual string (two characters, including \0, each uses 2 bytes)

So that is 76 bytes in total.
Now let’s add a long ASCII text, 10000 characters:

text = 'a' * 10**4
print(sys.getsizeof(text)) # prints 10049

The size of this is expected to be 10001 bytes for the string, and 48 bytes for the structure.

Let’s add one non-ASCII character to the text, a big 4-byte one:

print(sys.getsizeof(text+🖕)) # prints 40080

The size of the string is about 4 times larger because now each character uses 4 bytes (the size of the largest character). The actual text (10002 characters) uses 40008 bytes, and the PyCompactUnicode structure uses 72 bytes as we have seen earlier.

Conclusion

In this blog post, we analyzed the structures that are used to store strings in CPython and compared them with actual Python code.

We have also shown that Python strings are not completely immutable.

Note that the above holds specifically for the reference Python implementation, CPython, and it might (and likely does) differ in other Python implementations.

References

Author

Ondřej Měkota

Ondřej is a Machine Learning Engineer in Heureka since August 2021. He works on matching product offers from eshops to products in Heureka. He enjoys watching movies, snowboarding, and cooking.

<We are social too/>

Interested in our work, technology, team, or anything else?
Contact our CTO Lukáš Putna.

lukas.putna@heureka.cz