Pickle’s nine flaws

Saturday 20 June 2020

Python’s pickle module is a very convenient way to serialize and de-serialize objects. It needs no schema, and can handle arbitrary Python objects. But it has problems. This post briefly explains the problems.

Some people will tell you to never use pickle because it’s bad. I won’t go that far. I’ll say, only use pickle if you are OK with its nine flaws:

  • Insecure
  • Old pickles look like old code
  • Implicit
  • Over-serializes
  • __init__ isn’t called
  • Python only
  • Unreadable
  • Appears to pickle code
  • Slow

The flaws

Here is a brief explanation of each flaw, in roughly the order of importance.

Insecure

Pickles can be hand-crafted that will have malicious effects when you unpickle them. As a result, you should never unpickle data that you do not trust.

The insecurity is not because pickles contain code, but because they create objects by calling constructors named in the pickle. Any callable can be used in place of your class name to construct objects. Malicious pickles will use other Python callables as the “constructors.” For example, instead of executing “models.MyObject(17)”, a dangerous pickle might execute “os.system(‘rm -rf /’)”. The unpickler can’t tell the difference between “models.MyObject” and “os.system”. Both are names it can resolve, producing something it can call. The unpickler executes either of them as directed by the pickle.

More details, including an example, are in Supakeen’s post Dangers in Python’s standard library.

Old pickles look like old code

Because pickles store the structure of your objects, when they are unpickled, they have the same structure as when you pickled them. This sounds like a good thing and is exactly what pickle is designed to do. But if your code changes between the time you made the pickle and the time you used it, your objects may not correspond to your code. The objects will still have the structure created by the old code, but they will be running with the new code.

For example, if you’ve added an attribute since the pickle was made, the objects from the pickle won’t have that attribute. Your code will be expecting the attribute, causing problems.

Implicit

The great convenience of pickles is that they will serialize whatever structure your object has. There’s no extra work to create a serialization structure. But that brings problems of its own. Do you really want your datetimes serialized as datetimes? Or as iso8601 strings? You don’t have a choice: they will be datetimes.

Not only don’t you have to specify the serialization form, you can’t specify it.

Over-serializes

Pickles are implicit: they serialize everything in your objects, even data you didn’t want to serialize. For example, you might have an attribute that is a cache of computation that you don’t want serialized. Pickle doesn’t have a convenient way to skip that attribute.

Worse, if your object contains an attribute that can’t be pickled, like an open file object, pickle won’t skip it, it will insist on trying to pickle it, and then throw an exception.

__init__ isn’t called

Pickles store the entire structure of your objects. When the pickle module recreates your objects, it does not call your __init__ method, since the object has already been created.

This can be surprising, since nowhere else do objects come into being without calling __init__. The logic here is that __init__ was already called when the object was first created in the process that made the pickle.

But your __init__ method might perform some essential work, like opening file objects. Your unpickled objects will be in a state that is inconsistent with your __init__ method. Or your __init__ might log information about the object being created. Unpickled objects won’t appear in the log.

Python only

Pickles are specific to Python, and are only usable by other Python programs. This isn’t strictly true, you can find packages for other languages that can use pickles, but they are rare. They will naturally be limited to the cross-language generic list/dict object structures, at which point you might as well just use JSON.

Unreadable

A pickle is a binary data stream (actually instructions for an abstract execution engine.) If you open a pickle as a plain file, you cannot read its contents. The only way to know what is in a pickle is to use the pickle module to load it. This can make debugging difficult, since you might not be able to search your pickle files for data you are interested in:

>>> pickle.dumps([123, 456])
b'\x80\x03]q\x00(K{M\xc8\x01e.'

Appears to pickle code

Functions and classes are first-class objects in Python: you can store them in lists, dicts, attributes, and so on. Pickle will gladly serialize objects that contain callables like functions and classes. But it doesn’t store the code in the pickle, just the name of the function or class.

Pickles are not a way to move or store code, though they can appear to be. When you unpickle your data, the names of the functions are used to find existing code in your running process.

Slow

Compared to other serialization techniques, pickle can be slow as Ben Frederickson demonstrates in Don’t pickle your data.

But but..

Some of these problems can be addressed by adding special methods to your class, like __getstate__ or __reduce__. But once you start down that path, you might as well use another serialization method that doesn’t have these flaws to begin with.

What’s better?

There are lots of other ways to serialize objects, ranging from plain-old JSON to fancier alternatives like marshmallow, cattrs, protocol buffers, and more.

I don’t have a strong recommendation for any one of these. The right answer will depend on the particulars of your problem. It might even be pickle...

Comments

[gravatar]
Keith E Lazarus 9:52 PM on 23 Jun 2020
Just wanted to say, I stumbled across this post and saw your name as author at the top. I did Notes (particularly API) dev for a good number of years and your name jumped right out at me. It was a great product whose end came too soon - there are so many times I find myself fretting over having to use Sharepoint when the same thing in Notes/Domino would've been a breeze - and better. :)
[gravatar]
Great pros and cons summary! For arbitrary structures, pickle works in many cases out of the box. In others, using the state callbacks goes a long way. Zope’s ZODB made very good use of pickling.

Most other methods require some kind of schema declarations and processing (I’ve compared some in https://ict.swisscom.ch/2017/12/python-schema/; nowadays I would probably consider Pydantic).

In that sense, pickling corresponds to dynamic, type-less Python, whereas other methods correspond to typed Python.
[gravatar]
The referenced report for pickle performance is 6 years old and out of date. Pickle has improved performance over the years. I suggest these two references as more modern benchmarks showing much better performance:
https://link.medium.com/Z3hOpWCvC7
https://voidfiles.github.io/python-serialization-benchmark/
[gravatar]
@Dave Trollope: Thanks for the links to the performance benchmarks.

I believe pickle got a reputation for bad performance because in Python 2, some people didn't realize that you had to use cPickle, and you had to explicitly specify protocol; as otherwise the worst performing protocol would be used.
[gravatar]
@Dave Trollope: Seconded, before I found your comment, I even went to the trouble of updating the benchmark for Python 3, and pickle is now faster than JSON on that same benchmark (with Python 3.8) and competitive in terms of size. See https://github.com/dlukes/pickle-json-benchmark (details in charts in the readme).

Great writeup otherwise, I have a much clearer idea of what (not) to expect from pickle after reading this!
[gravatar]
I would recommend checking out the Apache Arrow python library for serializing your data. The arrow format is also inter-operable.

https://arrow.apache.org/docs/python/ipc.html
[gravatar]
There’s a tenth flaw that surprises many: equal objects may be serialized to unequal byte sequences using Pickle (even with the same optimization level and interpreter/platform). I’m not totally sure why but it seems to happen fairly easily with tuples. Depending on how the tuple was constructed, Pickle may serialize it differently. This means that two tuples that are equal with identical contents may have different serialized byte representations. There are details and an example at https://github.com/grantjenks/python-diskcache/issues/181#issuecomment-765851549

The problem is less likely when using pickletools.optimize() but is still a big gotcha.

The best solution for equal byte representations of equal objects is to use a different serialization protocol, like JSON.
[gravatar]

I think I had a good use-case for pickle. I was writing a script that had a long computation phase (several minutes), followed by a graphing phase to analyze the data and to render fancy graphics.

I finished the code for the first phase first, and I was iterating on variants of the second phase. To save time, I used pickle as a temporary serialization. It let me store the exact data structures to later be reused on future interpreter executions. After the code for the second phase was completed, both the pickled files and the pickle-related code became obsolete.

Pickle is good as a temporary short-term data storage across multiple executions of the same-ish source-code on the same interpreter version. It’s very unlikely to be a good solution for a production environment. It’s definitely not suitable for any long-term storage.

Add a comment:

Ignore this:
Leave this empty:
Name is required. Either email or web are required. Email won't be displayed and I won't spam you. Your web site won't be indexed by search engines.
Don't put anything here:
Leave this empty:
Comment text is Markdown.