May 10, 2019

A more Pythonic dictionary

Dictionaries are versatile, fast, and efficient. This post will cover two dictionary related features that I feel don’t get enough attention: setdefault and defaultdict. They’re presented together to highlight both the differences and the similarities between them.

Use case: how many views did each article get?

Here’s a simplified real-world scenario: a call to Google Analytics’ API returns the following list of lists where each sub-list represents an article: the first item is the article’s ID and the second one is its view count. Some article IDs may appear in more than one sub-list, and we want to sum the view counts for each distinct article:

received_list = [
    [1678, 30],  # 1678 is the ID, 30 is the view count
    [1987, 99],
    [1822, 50],
    [1678, 22],  # ID already appears
    [2299, 30],
    [1987, 100],  # ID already appears
]

If you know some Python, this should be pretty simple:

articles_and_views = {}

for each_list in received_list:
    article_id = each_list[0]
    article_views = each_list[1]

    if articles_and_views.get(article_id):
        articles_and_views[article_id] += article_views
    else:
        articles_and_views[article_id] = article_views

This if block is your standard check whether some key is in dictionary” code. If it is, then we increment its corresponding value by article_views; if the key isn’t already in the dictionary, we create it by assignment.

The output is correct as article 1678 appeared twice, first with 30 views and then with 22:

{1678: 52, 1987: 199, 1822: 50, 2299: 30}

The example above is a simple one. This is so this post can focus more on what setdefault and defaultdict do, and less on the underlying data-structures. In other scenarios you may be operating inside nested dictionaries, nested lists, and even more complicated structures. That’s where these two will often come handy.

setdefault

setdefault is a dictionary method, just like get. In fact, you can think of it as a get that combines a conditional set: get a key’s value, but if the key isn’t present in the dictionary, create it with the default value provided:

my_dict.setdefault(k, v_if_not_k)
# k: the key to search for
# v_if_not_k (optional): value to assign to the previously non-existent key after creating it

In our case, we can utilize this to get rid of the if clause:

articles_and_views = {}

for each_list in received_list:
    article_id = each_list[0]
    article_views = each_list[1]
    articles_and_views.setdefault(article_id, 0)
    articles_and_views[article_id] += article_views

That’s because there’s a hidden if inside of setdefault. We’re asking the dictionary articles_and_views: did you see this article_id in your keys before? if so, give us that key’s value. If not, create this key and set its value to 0”. The default value can of course be a number other than 0, a list, or any other object. If you don’t provide this second argument at all, the default value will be None.

Using setdefault makes sure that when we get to this next line:

articles_and_views[article_id] += article_views

article_id is undoubtedly an existing key in the dictionary. Either we just initialized it with a value of 0, or it had already existed before, so setdefault did not alter it. In any case, we can now increment its value safely.

In this case, we’re not using the value returned by setdefault, but it’s good to keep in mind it is available if needed.

While it’s not unique to setdefault, there’s one important thing to stress about this method: you can’t assign to its return value. Meaning, this won’t work:

articles_and_views.setdefault(article_id, 0) += article_views
# SyntaxError: can't assign to function call

If you’re confused by this, remember that it’s a method (function), and you can’t assign (=) to functions. The above snippet is comparable to this one (which is hopefully more obviously incorrect):

# a function/method on the left?!
n = -50
abs(n) += 25

# SyntaxError: can't assign to function call

However, you certainly can do something like this with setdefault if you wanted to simply append each article_views to a list instead of adding them up:

# notice the default value is now a list
articles_and_views = {}

for each_list in received_list:
    article_id = each_list[0]
    article_views = each_list[1]
    articles_and_views.setdefault(article_id, []).append(article_views)

This code will work and articles_and_views ends up looking like this:

{1678: [30, 22], 1987: [99, 100], 1822: [50], 2299: [30]}

Every time you need a default value inside of a dictionary, consider setdefault. It will save you time and logical overhead. I found it especially useful for unifying external data:

employees_from_api = [
    {"name": "Britney", "age": 32, "bonus": 1500},
    {"name": "Jeff", "age": 32, "bonus": 2400},
    {"name": "Benjamin", "age": 21}, # no bonus
]

for employee in employees_from_api:
    bonus = employee.setdefault("bonus", 500)
    print(f"{employee['name']}'s yearly bonus is {employee['bonus']}")

Output:

Britney's yearly bonus is 1500
Jeff's yearly bonus is 2400
Benjamin's yearly bonus is 500

defaultdict

defaultdict is a subclass of dict and can be imported from the built-in collections module:

from collections import defaultdict 

For the most part, defaultdict behaves just like dict, but it has one distinct feature: if provided with a valid callable as its first argument (more on this later), it never raises a KeyError when accessing non-existing keys; instead, it creates those.

A little code block should help demonstrate this:

>>> regular_dict = {}
>>> regular_dict['non_existent_key']
KeyError: 'non_existent_key'

>>> from collections import defaultdict
>>> int_defaultdict = defaultdict(int)
>>> int_defaultdict['non_existent_key']
0

>>> list_defaultdict = defaultdict(list)
>>> list_defaultdict["non_existent_key"]
[]

>>> dict_defaultdict = defaultdict(dict)
>>> dict_defaultdict["non_existent_key"]
{}

To apply it to our example:

from collections import defaultdict

articles_and_views = defaultdict(int)

for each_list in received_list:
    article_id = each_list[0]
    article_views = each_list[1]
    articles_and_views[article_id] += article_views

We’ve eliminated 3/6 lines compared to the same implementation with the if block. The code is cleaner, not less readable, and a lot more Pythonic.

Only, and only if a key doesn’t already exist in a dictionary, defaultdict will create it, and use the callable to set its value. In this case the callable is int, which returns 0 when invoked (and remember it will be invoked only when article_id does not exist as key in the dictionary).

>>> print(articles_and_views)
defaultdict(<class 'int'>, {1678: 52, 1987: 199, 1822: 50, 2299: 30})

As you can see, the representation of a defaultdict is different from that of a regular dictionary. The former also specifies the callable it uses, or as the Python docs define it: the default_factory (in this case: int). We can always get the default representation or convert back to a dict:

>>> print(dict(articles_and_views))
{1678: 52, 1987: 199, 1822: 50, 2299: 30}

Default factory must be a callable

Say we now wanted to boost our ego (or avoid getting fired) and start each article’s view count at 1,000:

articles_and_views = defaultdict(1000)

The above line will return an error:

TypeError: first argument must be callable or None

defaultdict needs a callable default factory”, and we gave it 1000. Can you call 1000()? I’m yet to meet someone who can. Maybe Guido.

So let’s give it a callable:

def return_one_thousand():
    return 1000 

articles_and_views = defaultdict(return_one_thousand)

Notice that we are not calling the function return_one_thousand (no curly braces), because that will defeat the purpose. Instead, it’s articles_and_views that will call it each time it needs to create a missing key. That function we wrote is of course a callable, so defaultdict doesn’t complain.

If we only want to return a simple value we don’t have to define a function and can simply use a lambda:

articles_and_views = defaultdict(lambda: 1000)

Both the return_one_thousand and the lambda implementations will return the following:

{1678: 1052, 1987: 1199, 1822: 1050, 2299: 1030}

So how come int, dict, and list worked? try invoking int and see what you get:

>>> int()
0 

When you need a certain default behavior with a dictionary, consider defaultdict. It will often yield cleaner code than setdefault.

Summary

setdefault and defaultdicts usages can overlap, but they are different tools: the former is a method that works on a key-by-key basis and the latter is a subclass of the regular” Python dict class. It’s good to remember that the convenience offered by defaultdict — never raising a KeyError — can be a double-edged sword.


Previous post
Grouping in Django templates I’ve recently deployed a tiny changelog app in one of my Django projects. The file looks like this: Nothing special so far. The only
Next post
No more cd ../../ Certain tools and frameworks dictate that you work in deeply-nested folder hierarchies. Ansible, used for provisioning and deployment, is one such