PEP 603: Adding a frozenmap type to collections

PEP: 603
Title: Adding a frozenmap type to collections
Version: $Revision$
Last-Modified: $Date$
Author: Yury Selivanov yury@edgedb.com
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 12-Sep-2019
Python-Version: 3.9
Post-History: 12-Sep-2019

Abstract

A persistent data structure is defined as a data structure that
preserves the previous version of the data when the data is modified.
Such data structures are effectively immutable, as operations on
them do not update the structure in-place, but instead always yield
a new updated structure (see [0]_ for more details.)

This PEP proposes to add a new fully persistent and immutable mapping
type called frozenmap to the collections module.

The bulk of frozenmap’s reference implementation is already
used in CPython to implement the contextvars module.

Rationale

Python has two immutable collection types: tuple and
frozenset. These types can be used to represent immutable lists
and sets. However, a way to represent immutable mappings does not yet
exist, and this PEP proposes a frozenmap to implement an
immutable mapping.

The proposed frozenmap type:

  • implements the collections.abc.Mapping protocol,
  • supports pickling, and
  • provides an API for efficient creation of “modified” versions.

The following use cases illustrate why an immutable mapping is
desirable:

  • Immutable mappings are hashable which allows their use
    as dictionary keys or set elements.

    This hashable property permits functions decorated with
    @functools.lru_cache() to accept immutable mappings as
    arguments. Unlike an immutable mapping, passing a plain dict
    to such a function results in error.

  • Immutable mappings can hold complex state. Since immutable mappings
    can be copied by reference, transactional mutation of state can be
    efficiently implemented.

  • Immutable mappings can be used to safely share dictionaries across
    thread and asynchronous task boundaries. The immutability makes it
    easier to reason about threads and asynchronous tasks.

Lastly, CPython [1]_ already contains the main portion of the C code
required for the frozenmap implementation. The C code already
exists to implement the contextvars module (see :pep:567 for
more details.) Exposing this C code via a public collection type
drastically increases the number of users of the code. This leads to
increased code quality by discovering bugs and improving performance
which without a frozenmap collection would be very challenging
because most programs use the contextvars module indirectly.

Specification

A new public immutable type frozenmap is added to the
collections module.

Construction

frozenmap implements a dict-like construction API:

  • frozenmap() creates a new empty immutable mapping;

  • frozenmap(**kwargs) creates a mapping from **kwargs, e.g.
    frozenmap(x=10, y=0, z=-1)

  • frozenmap(collection) creates a mapping from the passed
    collection object. The passed collection object can be:

    • a dict,
    • another frozenmap,
    • an object with an items() method that is expected to return
      a series of key/value tuples, or
    • an iterable of key/value tuples.

Data Access

frozenmap implements the collection.abc.Mapping protocol.
Therefore, getters, membership checks, and iteration work the same
way that they would for a dict::

m = frozenmap(foo=‘bar’)

assert m[‘foo’] == ‘bar’
assert m.get(‘foo’) == ‘bar’
assert ‘foo’ in m

assert ‘baz’ not in m
assert m.get(‘baz’, ‘missing’) == ‘missing’

assert m == m
assert m != frozenmap() # m is not equal to an empty frozenmap

assert len(m) == 1

etc.

Mutation

frozenmap instances are immutable. That said, it is possible
to efficiently produce mutated copies of the immutable instance.

The complexity of mutation operations is O(log N) and the resulting
frozenmap copies often consume very little additional memory due
to the use of structural sharing (read [6]_ for more details.)

frozenmap.including(key, value)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The method creates a new frozenmap copy with a new key / value
pair::

m = frozenmap(foo=1)
m2 = m.including(‘bar’, 100)

print(m) # will print frozenmap({‘foo’: 1})
print(m2) # will print frozenmap({‘foo’: 1, ‘bar’: 100})

frozenmap.excluding(key)
^^^^^^^^^^^^^^^^^^^^^^^^

The method produces a copy of the frozenmap which does not
include a deleted key::

m = frozenmap(foo=1, bar=100)

m2 = m.excluding(‘foo’)

print(m) # will print frozenmap({‘foo’: 1, ‘bar’: 100})
print(m2) # will print frozenmap({‘bar’: 1})

m3 = m.excluding(‘spam’) # will throw a KeyError(‘spam’)

frozenmap.union(mapping=None, **kw)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The method produces a copy of the frozenmap and adds or modifies
multiple key/values for the created copy. The signature of
the method matches the signature of the frozenmap constructor::

m = frozenmap(foo=1)

m2 = m.union({‘spam’: ‘ham’})
print(m2) # will print frozenmap({‘foo’: 1, ‘spam’: ‘ham’})

m3 = m.union(foo=100, y=2)
print(m3) # will print frozenmap({‘foo’: 100, ‘y’: 2})

print(m) # will print frozenmap({‘foo’: 1})

Calling the union() method to add/replace N keys is more efficient
than calling the including() method N times.

frozenmap.mutating()
^^^^^^^^^^^^^^^^^^^^

The method allows efficient copying of a frozenmap instance with
multiple modifications applied. This method is especially useful
when the frozenmap in question contains thousands of key/value pairs
and there’s a need to update many of them in a performance-critical
section of the code.

The frozenmap.mutating() method returns a mutable dict-like
copy of the frozenmap object: an instance of
collections.FrozenMapCopy.

The FrozenMapCopy objects:

  • are copy-on-write views of the data of frozenmap instances
    they were created from;

  • are mutable, although any mutations on them do not affect the
    frozenmap instances they were created from;

  • can be passed to the frozenmap constructor; creating a
    frozenmap from a FrozenMapCopy object is an O(1)
    operation;

  • have O(log N) complexity for get/set operations; creating
    them is an O(1) operation;

  • have a FrozenMapCopy.close() method that prevents any
    further access/mutation of the data;

  • can be used as a context manager.

The below example illustrates how mutating() can be used with
a context manager::

numbers = frozenmap((i, i ** 2) for i in range(1_000_000))

with numbers.mutating() as copy:
for i in numbers:
if not (numbers[i] % 997):
del copy[i]

 numbers_without_997_multiples = frozenmap(copy)

 # at this point, *numbers* still has 1_000_000 key/values, and
 # *numbers_without_997_multiples* is a copy of *numbers* without
 # values that are multiples of 997.

 for i in numbers:
     if not (numbers[i] % 593):
         del copy[i]

 numbers_without_593_multiples = frozenmap(copy)

 print(copy[10])  # will print 100.

print(copy[10]) # This will throw a ValueError as copy
# has been closed when the “with” block
# was executed.

Iteration

As frozenmap implements the standard collections.abc.Mapping
protocol, so all expected methods of iteration are supported::

assert list(m) == [‘foo’]
assert list(m.items()) == [(‘foo’, ‘bar’)]
assert list(m.keys()) == [‘foo’]
assert list(m.values()) == [‘bar’]

Iteration in frozenmap, unlike in dict, does not preserve the
insertion order.

Hashing

frozenmap instances can be hashable just like tuple objects::

hash(frozenmap(foo=‘bar’)) # works
hash(frozenmap(foo=)) # will throw an error

Typing

It is possible to use the standard typing notation for frozenmaps::

m: frozenmap[str, int] = frozenmap()

Implementation

The proposed frozenmap immutable type uses a Hash Array Mapped
Trie (HAMT) data structure. Functional programming languages,
like Clojure, use HAMT to efficiently implement immutable hash tables,
vectors, and sets.

HAMT

The key design contract of HAMT is the guarantee of a predictable
value when given the hash of a key. For a pair of key and value,
the hash of the key can be used to determine the location of
value in the hash map tree.

Immutable mappings implemented with HAMT have O(log N) performance
for set() and get() operations. This efficiency is possible
because mutation operations only affect one branch of the tree,
making it possible to reuse non-mutated branches, and, therefore,
avoiding copying of unmodified data.

Read more about HAMT in [5]. The CPython implementation [1] has a
fairly detailed description of the algorithm as well.

Performance

… figure:: pep-0603-hamt_vs_dict.png
:align: center
:width: 100%

Figure 1. Benchmark code can be found here: [3]_.

The above chart demonstrates that:

  • frozenmap implemented with HAMT displays near O(1) performance
    for all benchmarked dictionary sizes.

  • dict.copy() becomes less efficient when using around
    100-200 items.

… figure:: pep-0550-lookup_hamt.png
:align: center
:width: 100%

Figure 2. Benchmark code can be found here: [4]_.

Figure 2 compares the lookup costs of dict versus a HAMT-based
immutable mapping. HAMT lookup time is ~30% slower than Python dict
lookups on average. This performance difference exists since traversing
a shallow tree is less efficient than lookup in a flat continuous array.

Further to that, quoting [6]_: “[using HAMT] means that in practice
while insertions, deletions, and lookups into a persistent hash array
mapped trie have a computational complexity of O(log n), for most
applications they are effectively constant time, as it would require
an extremely large number of entries to make any operation take more
than a dozen steps.”

Design Considerations

Why “frozenmap” and not “FrozenMap”

The lower-case “frozenmap” resonates well with the frozenset
built-in as well as with types like collections.defaultdict.

Why “frozenmap” and not “frozendict”

“Dict” has a very specific meaning in Python:

  • a dict is a concrete implementation of abc.MutableMapping with
    O(1) get and set operations (frozenmap has O(log N) complexity);

  • Python dicts preserve insertion order.

The proposed frozenmap does not have these mentioned
properties. Instead, frozenmap has an O(log N) cost of set/get
operations, and it only implements the abc.Mapping protocol.

Implementation

The full implementation of the proposed frozenmap type is
available at [2]_. The package includes C and pure Python
implementations of the type.

See also the HAMT collection implementation as part of the
CPython project tree here: [1]_.

References

… [0] Persistent data structure - Wikipedia

… [1] cpython/hamt.c at 3.8 · python/cpython · GitHub

… [2] GitHub - MagicStack/immutables: A high-performance immutable mapping type for Python.

… [3] bench.py · GitHub

… [4] HAMT lookup · GitHub

… [5] Hash array mapped trie - Wikipedia

… [6] Persistent data structure - Wikipedia

Acknowledgments

I thank Carol Willing, Łukasz Langa, Larry Hastings, and
Guido van Rossum for their feedback, ideas, edits, and discussions
around this PEP.

Copyright

This document is placed in the public domain or under the
CC0-1.0-Universal license, whichever is more permissive.

14 Likes

I don’t have terribly strong feelings about the choice of name, but I don’t think your logic for rejecting “frozendict” hangs together. To the average user it’s going to quack like an immutable dict, and once you know about abc the fact that it implements abc.Mapping rather than abs.MutableMapping is just implied by the frozen part.

There’s no great need for consistency here – I’m not going to campaign to rename tuples as frozenlists – I just thought it should be pointed out.

3 Likes

Any plans to mention PEP 416, or address the reasons it was rejected?

4 Likes

I’m a fan of the idea broadly and it definitely has value (as seen in the widely used System.Collections.Immutable package for .NET). Just a few comments, mostly about renaming surface API for consistency:

Why “frozenmap” and not “FrozenMap”

You cite defaultdict, but why would we want to end up with ChainMap, OrderedDict and frozenmap? My vote is for FrozenMap on this basis - but frozendict if we went that way…

Why “frozenmap” and not “frozendict”

The lack of ordering does indeed seem to sink this idea (one of the things I was concerned about while opposing the redefinition of dict as ordered), though we always had functional differences between dict and OrderedDict.

If we could efficiently (or optionally) keep iteration ordered, that would be enough for me to want to call this frozendict, as that parallels set/frozenset nicely (and the list/frozenlist idea we discussed in person).

Make mutation clear through the APIs

Right now, the primary APIs are very unlike everything Python already has. While it may not be academically most correct, aligning with something will make it more clear for users.

For example, I would lead with the .union method, which matches set in both definition and semantics. Similarly, having intersect and difference methods gives a consistency that is easy to follow.

The construction description could clearly indicate the valuable use cases as well - if a primary use of this is to wrap an existing dict with a read-only wrapper (regardless of what’s going on in the background), then show that. That’s consistent with how people use set/frozenset and often list/tuple. The fact that you can also update individual values is handy, but probably not going to be the primary use - another reason to promote union above including

(Is including even necessary in the presence of union(**kw)? Seems like we could just drop it and have one less API to have to explain.)

Simplify FrozenMapCopy

This idea is great, but the description seems unnecessarily complex. I believe the same thing could be implemented as:

def mutating(self):
    return ChainMap({}, self)

The return value from mutating() already has the same semantics as described in the PEP for FrozenMapCopy, apart from the .close() method (but then “rewrapping” in a FrozenMap could be specialised for ChainMap). Also the context manager is lost, but why would you use that anyway? That seems to be mutating this immutable collection, which may be valid, but doesn’t make sense.

Hashing

Presumably hashing is based on keys? Or is it based on key/value pairs? The example seems to imply that values need to be hashable as well - remembering that hash equality is weaker than real equality, and hashability is a type property not a value property (with tuple being an acknowledged exception), it would probably be less efficient but more useful to only hash keys and require full comparisons to determine true equality.

Typing

Does this mean that one of FrozenMap’s base classes will be from the typing module? Or will there be a typing.FrozenMap type added as well? None of the other collection types support class-level indexing.

3 Likes

Alright, now I get what you had in mind :slight_smile:

The answer why the ChainMap trick isn’t what you want is because frozenmap(ChainMap({...}, frozemap)) is an O(N), and frozenmap(frozenmap.mutating()) is O(1).

When we hash a frozenmap we hash both keys and values. This is what makes frozenmap a value type. I don’t think that tuple is an exception, it’s just the only built-in collection type that doesn’t require its elements to be hashable. Similarly, in frozenmap, we only require keys to be hashable.

Requiring values to be hashable might make frozenmap less usable in some scenarios. Putting a list into a tuple and mutating it is a pattern that I saw more than once.

Right now it’s implemented by defining __class_getitem__ which just always returns self. It’s enough for tools like mypy & pytype to work. I personally don’t think involving typing.py for this would be a good idea.

2 Likes

PEP 416 simply implements an immutable version of a dict, this proposal implements a persistent dict. This proposal essentially covers all the use cases of PEP 416, but also enables many other use cases.

1 Like

It’s the only built-in collection type that might raise TypeError depending on its contents, whereas the rest are either always going to be hashable or are never hashable.

This proposal would make FrozenMap a second collection that might raise TypeError, unless you’re going to require that values are hashable when being added. But that’s never going to work :slight_smile:

I agree, though it does help generality when the type annotations outside of the typing module are defined in terms of the types in the typing module. Otherwise type checkers are going to have to specialize your new type rather than relying on their existing handling of aliasing/inheritance.

1 Like

It would still be valuable to discuss it in the PEP, even briefly. They are similar enough proposals that I think PEP 416’s reasons for rejection deserve to be addressed explicitly.

Just to be clear, I am currently a fan of this proposal. But I don’t want it to get shut down just based on 416’s precedent.

5 Likes

Why would someone do that though? If I call dict(stuff) I expect it to be O(N) why would that be different for frozenmap? If frozenmap.mutating gives me an immutable mapping why would I convert that to a frozenmap?

I think I can see what Steve is getting it and it is the same thought I had reading the PEP: the proposed methods seem ad-hoc/obscure and ISTM the same usability effect can be achieved with more understandable primitives. Specifically I would propose:

  1. frozenmap (Mapping, same constructors as dict)
  2. frozenchainmap (Mapping, like chain map but frozen with immutable, hashable arguments)
  3. complementmap (Mapping, also frozen and requires frozen arguments)

Hopefully frozenmap and fronzenchainmap are easy enough to understand. The complementmap type would be complementmap(A, B): immutable mapping with keys/values from A unless the key is in B (which could be frozenset/map).

Those primitives can be composed any way the user likes and the result can always be an immutable, hashable Mapping. There would be no need to use frozenmap(frozenchainmap(...)) because the frozenchainmap is already an immutable Mapping so it should be usable directly (unless you want to incur the O(N) cost to flatten the data structure).

Nice work on the PEP!

Suggest that the virtues of immutability be emphasized. Consider a link to something like: https://queue.acm.org/detail.cfm?id=2884038

Suggest you add several more questions and answers to the “Design Considerations” section.

  1. Why should this go directly into the standard library without first living on PyPI where its value can be proved and where the API can be fully hashed-out without the constraints applied to standard library modules?

    When dataclasses were added, you expressed a similar concern. The answer given was that there was adequate API precedent an existing third-party module. The frozenmap() proposal needs to have an answer as well.

  2. Would the overall structure be hashable? Other immutable collections like bytes, tuple, and frozenset are usable as set elements and dictionary keys.

  3. One premise of the PEP is that fast copies are more important than O(1) lookups. Can you describe the use cases where frequent, fast copies are essential (and perhaps point to some real-world Python code that would benefit)?

  4. There should be a “Why not ChainMap” question and answer as well. The ChainMap was design to avoid copy costs by not copying at all (this gave a nice speed-up to ConfigParser when introduced). Instead, it provides a zero-copy view of a hierarchy of dicts. David Beazley demonstrated its effective use in managing nested scopes in a compiler (a demanding application that heavily exercises the new_child() / parents API as it steps in and out of lexically scoped function and class definitions). Terence Parr demonstrated a similar API for ANTLR is “Language Design Patterns”. Given the efficacy of the ChainMap() API in those challenging applications, the question is what new or better capability would be afforded by frozenmap().

  5. The core HAMT code is already implemented and tested (IMO, the code is of high quality and shows great craftsmanship). However, the code is also somewhat complex and won’t be easy to replicate without substantial effort. Please add a question/answer pair for “How difficult will this be for other implementations (PyPy, CircuitPython, MicroPython, IronPython, Jython)?”

  6. Does this need to go into collections? This tool is different enough from everything else (especially in how it treats keys) that it may warrant its own module, much like we did with dataclasses. This will be doubly true if, like dataclasses, it is likely to grow an ecosystem of supporting functions.

  7. AFAICT, there has never been a user request for this functionality. Can you elaborate on why we think that there will be broad enough demand to warrant in inclusion in the standard library. An explanation here will help avoid “solution looking for a problem” concerns.

4 Likes

I have an interesting observation. Data scientists use dicts and lists of strings and numbers as the main data structure very often. They don’t want to invent classes but happy with Python built-in collections at their abstraction layer.

The problem is: working with mutable collections produce very unsafe complex code, e.g. if you pass a dict into a function you never know without hard reading of the function code if the dict is modified in-place in a wild or not.
Pure-functional style where a function doesn’t modify it’s arguments but returns a new copy of passed collection if needed is very safe and practical solution for controlling the complexity of mathematician code.

There are frozen versions of list and set called tuple and frozenset correspondingly but a frozen alternative for dict doesn’t exist in the standard library.
On my former job people used pyrsistent.PMap for this task very effectively in implementations their algorithms.

4 Likes

Did you consider to add it to the builtins module?

It may be interesting to replace:

def func(map={}): ...

with:

def func(map=frozenmap()): ...

IMHO you need to elaborate this section.

dict and frozenmap implementation are completely different and so have different properties:

  • frozenmap is not a drop-in replacement for read-only dict
  • IMHO the main difference is the API: frozenmap has more methods than dict: including(), excluding(), union(), mutating()
  • dict preserves insertion order, frozenmap is unordered
  • dict uses a “compact dict” internal structure, frozenmap is implemented with HAMT
  • dict.copy() is O(n), frozenmap.copy() is O(1)
  • etc.

Did you consider to add a singleton for frozenmap(): the empty mapping? tuple and frozenset have a singleton for empty collection.

4 Likes

frozenmap.including(key, value) and frozenmap.union(key=value) are equivalent, no?

With the VECTORCALL (an FASTCALL) calling convention, frozenmap.union(key=value) doesn’t need to build a temporary dictionary anymore: it should be as efficient than frozenmap.including(key, value).

Is performance the only motivation to add a specialized method for a single item?

Why excluding(key) instead of excluding(*keys)? What if I want to exclude 3 keys? Should I call:

frozenmap.excluding(‘key1’).excluding(‘key2’).excluding(‘key3’)? Or use frozenmap.mutating() for that?

With such API, you cannot set the “mapping” key. You need the PEP 570 syntax here:

>>> def union(mapping=None, /, **kw): print(mapping, kw)
... 
>>> union()
None {}
>>> union(mapping=3)
None {'mapping': 3}
>>> union({'key': 'value'})
{'key': 'value'} {}

Does frozenmap(copy) call copy.close()?

Would it be possible to even avoid a memory allocation on frozenmap(copy)? I always wanted to avoid memory allocation on list(tuple) or bytes(bytearray), but it’s not possible because of the structure of these types.


If your implementation adds a public C API, can you please put it in the PEP as well?

4 Likes

frozenmap.union(key=value) only works for string keys.

1 Like

Right, but I expect that keys are only strings in the common case. And union({key_obj: value}) can be used for non-string keys.

1 Like

Overall I think it’s a useful addition. I’m against calling it “frozendict” for the same reasons as @yselivanov. The implementation characterics are sufficiently different to avoid putting “dict” in the name.

I’ll let other people hash out the details :wink:

2 Likes

I like it, but.
that section on frozenmap.mutating() seemed too low level. Isn’t there a way to use a comprehension to create a new frozen dict from a filter of items in a current frozen dict? That with statement seems inelegant, but I can see how it gives an easy access to the underlying code.

Current use of the word persistent on docs.python.org seems to be with reference to shelve etc. It’s confusing to introduce a slightly different use of the word.

It might be best to just describe frozen - as being accessible but not changeable once created: Although methods exist for the efficient creation of new, independent frozen objects based on the original frozen object but allowing alterations at the time of creation of this new frozen object.

I do like your mention of how freezing aids comprehension, nice one👌

+1 on the PEP overall. However, I think there’s a couple of parts that could use some further elaboration.

Could you elaborate somewhat on the internal differences between the union() and including() methods, to explain why it’s more efficient to to use a single call to union() containing multiple keys instead of multiple calls to including()?

For providing a summary, it may be useful to add to this slightly. This will allow readers to have a rough idea of how it scales without having to delve into the metrics:

dict.copy() becomes less efficient when using around
100-200 items, reaching over a 400% performance difference with 1000 items.

(based on bench.py ¡ GitHub)

I’m also in favor of naming the type frozenmap rather than frozendict. Naming the type frozendict would likely prove to be misleading, and users may falsely assume that the container is simply an immutable and hashable version of dict, similar to the relationship between set and frozenset.

This would an interesting prospect, but I think it should be implemented in collections first. If there’s significant user demand and usage, it could be added to builtins at a later point in time.

1 Like

Well, just use the frozenmap constructor. It accepts any iterable of (key, value) tuples.

2 Likes