sys.getsizeof is not what you want

Sunday 9 February 2020

This week at work, an engineer mentioned that they were looking at the sizes of data returned by an API, and it was always coming out the same, which seemed strange. It turned out the data was a dict, and they were looking at the size with sys.getsizeof.

Sounds great! sys.getsizeof has an appealing name, and the description in the docs seems really good:

sys.getsizeof(object)
Return the size of an object in bytes. The object can be any type of object. All built-in objects will return correct results [...]

But the fact is, sys.getsizeof is almost never what you want, for two reasons: it doesn’t count all the bytes, and it counts the wrong bytes.

The docs go on to say:

Only the memory consumption directly attributed to the object is accounted for, not the memory consumption of objects it refers to.

This is why it doesn’t count all the bytes. In the case of a dictionary, “objects it refers to” includes all of the keys and values. getsizeof is only reporting on the memory occupied by the internal table the dict uses to track all the keys and values, not the size of the keys and values themselves. In other words, it tells you about the internal bookkeeping, and not any of your actual data!

The reason my co-worker’s API responses was all the same size was because they were dictionaries with the same number of keys, and getsizeof was ignoring all the keys and values when reporting the size:

>>> d1 = {"a": "a", "b": "b", "c": "c"}
>>> d2 = {"a": "a"*100_000, "b": "b"*100_000, "c": "c"*100_000}
>>> sys.getsizeof(d1)
232
>>> sys.getsizeof(d2)
232

If you wanted to know how large all the keys and values were, you could sum their lengths:

>>> def key_value_length(d):
...     klen = sum(len(k) for k in d.keys())
...     vlen = sum(len(v) for v in d.values())
...     return klen + vlen
...
>>> key_value_length(d1)
6
>>> key_value_length(d2)
300003

You might ask, why is getsizeof like this? Wouldn’t it be more useful if it gave you the size of the whole dictionary, including its contents? Well, it’s not so simple. Data in memory can be shared:

>>> x100k = "x" * 100_000
>>> d3 = {"a": x100k, "b": x100k, "c": x100k}
>>> key_value_length(d3)
300003

Here there are three values, each 100k characters, but in fact, they are all the same value, actually the same object in memory. That 100k string only exists once. Is the “complete” size of the dict 300k? Or only 100k?

It depends on why you are asking about the size. Our d3 dict is only about 100k bytes in RAM, but if we try to write it out, it will probably be about 300k bytes.

And sys.getsizeof also reports on the wrong bytes:

>>> sys.getsizeof(1)
28
>>> sys.getsizeof("a")
50

Huh? How can a small integer be 28 bytes? And the one-character string “a” is 50 bytes!? It’s because Python objects have internal bookkeeping, like links to their type, and reference counts for managing memory. That extra bookkeeping is overhead per-object, and sys.getsizeof includes that overhead.

Because sys.getsizeof reports on internal details, it can be baffling:

>>> sys.getsizeof("a")
50
>>> sys.getsizeof("ab")
51
>>> sys.getsizeof("abc")
52
>>> sys.getsizeof("á")
74
>>> sys.getsizeof("áb")
75
>>> sys.getsizeof("ábc")
76
>>> face = "\N{GRINNING FACE}"
>>> len(face)
1
>>> sys.getsizeof(face)
80
>>> sys.getsizeof(face + "b")
84
>>> sys.getsizeof(face + "bc")
88

With an ASCII string, we start at 50 bytes, and need one more byte for each ASCII character. With an accented character, we start at 74, but still only need one more byte for each ASCII character. With an exotic Unicode character (expressed here with the little-used \N Unicode name escape), we start at 80, and then need four bytes for each ASCII character we add! Why? Because Python has a complex internal representation for strings. I don’t know why those numbers are the way they are. PEP 393 has the details if you are curious. The point here is: sys.getsizeof is almost certainly not the thing you want.

The “size” of a thing depends on how the thing is being represented. The in-memory Python data structures are one representation. When the data is serialized to JSON, that will be another representation, with completely different reasons for the size it becomes.

In my co-worker’s case, the real question was, how many bytes will this be when written as CSV? The sum-of-len method would be much closer to the right answer than sys.getsizeof. But even sum-of-len might not be good enough, depending on how accurate the answer has to be. Quoting rules and punctuation overhead change the exact length. It might be that the only way to get an accurate enough answer is to serialize to CSV and check the actual result.

So: know what question you are really asking, and choose the right tool for the job. sys.getsizeof is almost never the right tool.

Comments

[gravatar]
I agree. The sys module is for talking to the interpreter about internal details, which one almost never wants to do in normal code. (The stdxxx streams are the main exception.) In 3.9.1, by my test, the stdlib only uses 'getsizeof' for testing. It does not occur in any lib/*.py file, but occurs 50 times in lib/test/*.py files.
[gravatar]
Rohan Deshpande 9:01 PM on 30 May 2021
Isn't the data 1 or "a" stored as objects and hence the memory usage of each data is increased due to the memory overhead of objects?
[gravatar]
import sys
import numpy as np

def calculate_size(obj):
    size = sys.getsizeof(obj)
    if isinstance(obj, dict):
        size += sum(calculate_size(v) for v in obj.values())
        size += sum(calculate_size(k) for k in obj.keys())
    elif isinstance(obj, (list, tuple, set)):
        size += sum(calculate_size(v) for v in obj)
    elif isinstance(obj, bytes):
        size += len(obj)
    elif isinstance(obj, str):
        size += len(obj.encode(’utf-8’))
    elif isinstance(obj, type(None)):
        size += 0
    elif isinstance(obj, np.ndarray):
        if obj.dtype == np.uint8:
            size += obj.nbytes
        else:
            size += obj.nbytes + sys.getsizeof(obj)
    elif isinstance(obj, (int, float)):
        size += sys.getsizeof(obj)
    else:
        size += sum(calculate_size(getattr(obj, attr)) for attr in dir(obj) if not callable(getattr(obj, attr)) and not attr.startswith(’__’))

    return size
[gravatar]

@MTV: an interesting approach, but it will over-count where objects are shared. For example, calculate_size([1, 1, 1, 1, 1, 1]) counts the size of 1 six times. There’s only one 1 in memory, and it existed before this list was created, so it should be counted zero times.

Add a comment:

Ignore this:
Leave this empty:
Name is required. Either email or web are required. Email won't be displayed and I won't spam you. Your web site won't be indexed by search engines.
Don't put anything here:
Leave this empty:
Comment text is Markdown.