Ordered dict surprises

Monday 12 October 2020

Since Python 3.6, regular dictionaries retain their insertion order: when you iterate over a dict, you get the items in the same order they were added to the dict. Before 3.6, dicts were unordered: the iteration order was seemingly random.

Here are two surprising things about these ordered dicts.

You can’t get the first item

Since the items in a dict have a specific order, it should be easy to get the first (or Nth) item, right? Wrong. It’s not possible to do this directly. You might think that d[0] would be the first item, but it’s not, it’s the value of the key 0, which could be the last item added to the dict.

The only way to get the Nth item is to iterate over the dict, and wait until you get to the Nth item. There’s no random access by ordered index. This is one place where lists are better than dicts. Getting the Nth element of a list is an O(1) operation. Getting the Nth element of a dict (even if it is ordered) is an O(N) operation.

OrderedDict is a little different

If dicts are ordered now, collections.OrderedDict is useless, right? Well, maybe. It won’t be removed because that would break code using that class, and it has some methods that regular dicts don’t. But there’s also one subtle difference in behavior. Regular dicts don’t take order into account when comparing dicts for equality, but OrderedDicts do:

>>> d1 = {"a": 1, "b": 2}
>>> d2 = {"b": 2, "a": 1}
>>> d1 == d2
True
>>> list(d1)
['a', 'b']
>>> list(d2)
['b', 'a']

>>> from collections import OrderedDict
>>> od1 = OrderedDict([("a", 1), ("b", 2)])
>>> od2 = OrderedDict([("b", 2), ("a", 1)])
>>> od1 == od2
False
>>> list(od1)
['a', 'b']
>>> list(od2)
['b', 'a']
>>>

BTW, this post is the result of a surprisingly long and contentious discussion in the Python Discord.

Comments

Simon Willison 8:41 PM on 12 Oct 2020

This seems to get back the first item in a dictionary reasonably efficiently, am I missing something here?

first_item = next(iter(d))

Ned Batchelder 8:53 PM on 12 Oct 2020

Yes, that works, but it counts as "iterate over the dict, and wait until you get to the Nth item." Since it's the first item, you don't have to wait long! :)

Simon Willison 8:59 PM on 12 Oct 2020

Ah gotcha - I was thinking O(N) in terms of size of the dictionary, not how-far-along-you-need-to-go - so I was thinking that accessing the first item was O(1) and any item after that was O(N).

Julian Berman 3:12 PM on 13 Oct 2020

The second gotcha is the source of even more surprising and unfortunate behavior, namely that OrderedDict will give you non-transitive equality :/...

Specifically:

OrderedDict([("a", 1), ("b", 2)]) == {"a": 1, "b": 2} == OrderedDict([("b", 2), ("a", 1)])

but OrderedDict([("a", 1), ("b", 2)]) != OrderedDict([("b", 2), ("a", 1)])

Mikhail 12:13 PM on 14 Oct 2020

It's not that difficult, actually.

d[tuple(d.keys())[0])

Ned Batchelder 12:20 PM on 14 Oct 2020

@Mikhail It's a simple line of code, but it's iterating the dict, as I mentioned. If your dict has 1000 elements, you will create a 1000-element tuple and then take the first element.

Mikhail 12:28 PM on 14 Oct 2020

Are you sure that converting dict.keys to tuple involves iteration? Please demonstrate.

Mikhail 12:35 PM on 14 Oct 2020

According to the docs dict.keys() returns a PyListObject, so converting it to tuple doesn't involve iteration.

Ned Batchelder 12:44 PM on 14 Oct 2020

Yes, it involves iteration. Try it: you get a 1000-element tuple. That means it must be an O(N) operation.

Mikhail 12:55 PM on 14 Oct 2020

And to answer your 1000 keys question here's what I ran on my PHONE. The result was 0.0466268063 seconds for a.dict with 1 MILLION keys

import time

d = {f"key_{i}": f"value_{i}" for i in range(1000000)}

t1 = time.time()
got_value = d[tuple(d.keys())[500]]
t2 = time.time()

exec_time = t2 - t1
print(f"It took {exec_time:.10f} seconds for a.dict with {len(d)} keys")

>>>It took 0.0466268063 seconds for a.dict with 1000000 keys

[Program finished]

Mikhail 12:56 PM on 14 Oct 2020

Well, it involves as much iteration, as just looking up a value by key ?

Mikhail 12:57 PM on 14 Oct 2020

Just admit you didn't think of my way ?

Mikhail 1:07 PM on 14 Oct 2020

Ooh, I found a better way, stupid me :)

list(d)[500]

Ned Batchelder 1:17 PM on 14 Oct 2020

@Mikhail: please stop. All of these involve iterating over the dictionary. Looking up a value by key is an O(1) operation. Your ways are all O(N). They may be very fast, but they are still O(N).

Mikhail 6:30 PM on 14 Oct 2020

Ok, according to your own logic, I quote the second comment from the top written by you:

"Yes, that works, but it counts as "iterate over the dict, and wait until you get to the Nth item." Since it's the first item, you don't have to wait long! :) "

So, let's check that. According to you getting the 999,999th item in the dictionary should take at least a little longer than getting the 0th item in the dictionary.

So I checked that. And to make sure, that I checked it right I did 100 runs and aggregated the results. Here they are. Left column is the index of the value that we're getting. Right column: mean of the time it took to get it.

          
0             0.018350
1000          0.018611
10000         0.018211
100000        0.018274
500000        0.018855
999999        0.018140

As you can see, there is no difference in retrieval time. And for once it would be nice to hear some argumentation from your side, apart from "oh it's iterating" and "please stop". Here's my code used for testing:

from time import time
import pandas as pd

dict_size = 1_000_000
experimental_dict = {f"key_{i}": f"value_{i}" for i in range(dict_size)}

check_index = [0, 1_000, 10_000, 100_000, 500_000, 999_999]
result = []
for attempt in range(100):
    for i in check_index:
        t1 = time()
        value = experimental_dict[tuple(experimental_dict.keys())[i]]
        t2 = time()
        result.append(
            {
                "i": i,
                "execution_time": t2 - t1
            }
        )

df = pd.DataFrame(result).groupby(by="i").mean()

By the way, just for the laughs, I did the same test on a dictionary with TEN MILLION records. That's a lot of records for a simple dict, but who cares. And guess what: it took exactly the same amount of time to retrieve the 10-millionth record as it took to retrieve the 0th.

0              0.183136
1000           0.181712
10000          0.182742
100000         0.182070
500000         0.182609
999999         0.182580
9999999        0.182191

If in some way I'm wrong, and I just fail to see the iteration and growing execution time -- by all means enlighten me, and I will admit I was wrong. Otherwise be a gentleman ;)

Ned Batchelder 6:40 PM on 14 Oct 2020

In this case, it's O(N) where N isn't the index you are looking for, but the size of the dict you are looking at.

import time

for exp in range(3, 8):
    num = 10 ** exp
    d = {f"key_{i}": f"value_{i}" for i in range(num)}

    t1 = time.time()
    got_value = d[tuple(d.keys())[500]]
    t2 = time.time()

    exec_time = t2 - t1
    print(f"It took {exec_time:.10f} seconds for a.dict with {len(d)} keys")

This produces:

It took 0.0000109673 seconds for a.dict with 1000 keys
It took 0.0002467632 seconds for a.dict with 10000 keys
It took 0.0029292107 seconds for a.dict with 100000 keys
It took 0.0338559151 seconds for a.dict with 1000000 keys
It took 0.4842569828 seconds for a.dict with 10000000 keys

Each time the dict grew ten times larger, and the time took ten times longer. O(N).

Ned Batchelder 6:48 PM on 14 Oct 2020

The reason it's O(N) in the size of the dict is that you are iterating over the entire dict in order to build your list or tuple.

Dicts don't offer numeric indexing, you have to iterate them to get it. The N could either mean the index you are looking for, if you iterate until you find what you want, or it could mean the size of the dict, if you iterate the entire dict as you are doing.

Your experiment demonstrates that once you have a list or tuple, accessing the element by index is a constant-time operation regardless of the index. This is the exact point being made in the post: lists are better than dicts at random access by numeric index. Your timing experiment shows that. But you've paid the price of iterating the entire dict first, and that total iteration is O(N) in the size of the dict.

Mikhail 7:01 PM on 14 Oct 2020

Thanks for a proper response, I appreciate that. I'm convinced, you're right. The time it takes to convert dict to list increases the longer the dictionary is.

Even if there is iteration, though, it's handled on the level of Python's C engine, so in practicality it won't matter. Especially since you can cache the result of

tuple(dict.keys())

if you need to acces items by index.

But I agree that you're better off with a list for that.

Alexander Loftus 3:57 PM on 15 Oct 2020

Mikhael, just for the record, my reaction to reading these posts is that you were being fairly obnoxious.

Mikhail 6:25 PM on 15 Oct 2020

Thanks for sharing, Alexander

Corbin 12:48 PM on 29 Oct 2020

It's worth considering why accessing the first/last element is interesting. Sometimes, it's because we've expected/proven the collection to only have one element, and we want that element uniquely. Other times, it's because we're performing some iteration, and we want the first/last instance of the iteration to have special behavior.

Paddy McCarthy 5:21 PM on 30 Oct 2020

Mikhael, just for the record, my reaction to reading these posts is that you were keen, involving, and gracious when Ned could finally convince you. Your code showed that you had worked on your point rather than just mouthing off. Ned took the time to enlighten you. And I had a lesson in good debates.

Mikhail 10:19 AM on 31 Oct 2020

Thank you Paddy, it's very kind of you!