The Sharat's

Guide to Comprehensions in Python

Comprehensions are a syntax construct used for applying some form of transformations and filtering over streams of data. The problems comprehensions solve can be done without them, using plain old for-loops, but where possible, comprehensions can improve readability and show the intent very well.

This article assumes some familiarity with Python (and comprehensions as well). I will go over the basics of comprehensions quickly and jump into the meat of the article. Most of this article applies for Python 3, unless otherwise specified.

If you’re here for the live converter or comprehension ⇔ for-loop code, it’s further down in the page.

Basic Syntax

Let’s go over the basic syntax for starters. It can be divided into three parts. The result expression, the looping construct(s) and the filter expression. Of these, the filter expression is optional, but the other two are required. Let’s look at a simple example to get an idea:

>>> [n ** 2 for n in range(4)]
[0, 1, 4, 9]

This is a list comprehension with no filtering (i.e., no if clause). Here, the n ** 2 part is the result expression and the for n in range(4) is the looping construct. This comprehension expression is the same as the following piece of code, written without comprehensions:

>>> squares = []
>>> for n in range(4):
...     squares.append(n ** 2)
...
>>> squares
[0, 1, 4, 9]

Comprehensions also support conditions on the looping variables. For instance, in the example above, if we only wanted squares of even numbers, we could do:

>>> [n ** 2 for n in range(4) if n % 2 == 0]
[0, 4]

In this case, the result expression is not evaluated when the n % 2 == 0 turns out to be False.

The keen Pythonista might note that this can be accomplished more simply by using the step argument of the range builtin, but please excuse me for lacking in creativity for the examples!

Different Collectors

In addition to list comprehensions, Python supports set and dict comprehensions as well. Where list comprehensions collect the result values in a list, the latter two collect them in sets and dicts respectively.

The syntax is almost exactly same as that of the list comprehensions. The only difference is that we use braces for set and dict comprehensions, where we use square brackets for list comprehensions. The looping and filtering constructs behave the same way. The result expression behaves the same way for set comprehensions, but for dict comprehensions, we have to provide two expressions, the key and the value, separate by a colon. Let’s look at some examples:

>>> [color.lower() for color in ['Blue', 'Red', 'blue', 'yellow']]
['blue', 'red', 'blue', 'yellow']

>>> {color.lower() for color in ['Blue', 'Red', 'blue', 'yellow']}
{'blue', 'red', 'yellow'}

The first expression in the above REPL session is a list comprehension and the second is a set comprehension. Notice that the only difference in the first and third lines is the surrounding bracket type.

>>> {color.lower(): len(color) for color in ['Blue', 'Red', 'blue', 'yellow']}
{'blue': 4, 'red': 3, 'yellow': 6}

This is a dictionary comprehension. Notice here, the result expression is a key-value pair of expressions, as opposed to a single expression for list and set comprehensions.

Note that these two forms of comprehensions have been introduced in Python 2.7 & 3. In the previous versions, we could replicate this by calling the set and dict builtins over list comprehensions. Here’s an example:

>>> set([color.lower() for color in ['Blue', 'Red', 'blue', 'yellow']])
{'blue', 'red', 'yellow'}

>>> dict([(color.lower(), len(color)) for color in ['Blue', 'Red', 'blue', 'yellow']])
{'blue': 4, 'red': 3, 'yellow': 6}

For dictionaries, we create a list of 2-tuples (key-value pairs) and pass that to dict.

Multiple Looping Constructs

In the previous examples, we’ve only used one looping construct. However, it is possible to use more than one looping construct. This works very similar to a nested for-loop. Let’s look at an example:

>>> [(i, j) for i in range(0, 3) for j in range(10, 13)]
[(0, 10), (0, 11), (0, 12), (1, 10), (1, 11), (1, 12), (2, 10), (2, 11), (2, 12)]

This output is easy to visualize if you see the two for-loops nested. The following is a reproduction of the above, without comprehensions:

>>> result = []
>>> for i in range(0, 3):
...     for j in range(10, 13):
...         result.append((i, j))
...
>>> result
[(0, 10), (0, 11), (0, 12), (1, 10), (1, 11), (1, 12), (2, 10), (2, 11), (2, 12)]

This can go further levels of nesting, although if you have comprehensions with more three levels of nesting, you should probably rethink your data structures or the way you’re working with them.

Multiple looping constructs work just fine for set and dict comprehensions as well. Here’s some examples with set comprehensions and using a condition expression as well:

>>> {(i, j) for i in range(0, 3) for j in range(10, 13)}
{(1, 12), (2, 11), (0, 12), (2, 10), (0, 11), (0, 10), (2, 12), (1, 10), (1, 11)}

>>> {(i, j) for i in range(0, 3) for j in range(10, 13) if j - i > 10}
{(0, 11), (1, 12), (0, 12)}

A subtle point here that’s not easy to notice in the comprehensions is that the range(10, 13) call in the above examples is called three times, whereas the range(0, 3) is called once. This becomes obvious if you visualize this as the nested for-loop illustrated above. This is important when using generators or iterators that work single-pass, like map objects, or file objects (for which, we’ll need .seek). Check out the following example to see what I mean:

>>> range_for_i = map(str, range(0, 3))
>>> range_for_j = map(str, range(10, 13))

>>> [(i, j) for i in range_for_i for j in range_for_j]
[('0', '10'), ('0', '11'), ('0', '12')]

In this example, the map objects are destroyed once they have yielded all their results. That is why the range_for_j only produced the three numbers only once, which were enough to pair with just '0', and there’s no more to be paired with '1' and '2'.

You’re not likely to encounter this in real-world code, but it’s good to know lest we end up facing it.

Zipping instead of Cross Product

Using multiple for loops like above creates a sort-of cross-product. This is by nature of the nested loop structure. But what if we’re looking for a sort-of dot-product like result? Python provides the zip builtin for this purpose. It is so specific to this problem, that using a comprehension looks like unnecessary ceremony:

>>> [(i, j) for i, j in zip(range(0, 3), range(10, 13))]
[(0, 10), (1, 11), (2, 12)]

>>> list(zip(range(0, 3), range(10, 13)))
[(0, 10), (1, 11), (2, 12)]

Of course, if we’re doing some operation with i and j instead of just creating tuples, the comprehension would still be very useful.

>>> [i * j for i, j in zip(range(0, 3), range(10, 13))]
[0, 11, 24]

Rewriting Comprehensions map & filter Builtins

Comprehensions can usually be a more-readable alternative to code written using map and/or filter functions.

I’ve discussed the map builtin in more detail in a previous article. Not all features of a comprehension can be translated with just the map function. In particular, there’s no way to apply a condition like we can in comprehensions, when using the map function alone. It can be done if we also make use of the filter builtin. Here’s an example of how such a comprehension can be rewritten with map and filter.

>>> [n ** 2 for n in range(10) if n % 2 == 0]
[0, 4, 16, 36, 64]

>>> list(map(lambda n: n ** 2, filter(lambda n: n % 2 == 0, range(10))))
[0, 4, 16, 36, 64]

Obviously, the comprehension reads much better, but I’d urge you to not just throw away the map and filter builtins. They have their place and sometimes, code using them can read much better than comprehensions. Check out my article on map function for such examples and other rationales.

Reducing with Assignment Expressions

I’ve actually stumbled on a version of this idea on Reddit. Unfortunately I don’t have the source, so, wherever you are, thank you!

The functools module from the standard library provides the reduce callable which can be used to systematically aggregate values in collections. I won’t go into details of how this can be used, but I will show how such an affect can be reproduced with comprehensions.

Let’s look at an example of using the functools.reduce:

>>> import functools
>>> functools.reduce(lambda acc, item: acc * item, range(1, 5), 1)
24

A simple implementation of the reduce function is provided at the official documentation and it’s a better explanation that I can provide here. Instead, we’ll try and reproduce this with comprehensions.

For this, we have to first familiarize ourselves with the walrus operator. This is a new feature in Python 3.8, that lets us do assignments in expressions. This means we’ll now be able to do assignment operations in places where only expressions (and not statements) are allowed, like the result expression spot in comprehensions.

By the power of the gray walrus, we can reproduce functools.reduce:

>>> acc = 1
>>> [acc := acc * item for item in range(1, 5)]
[1, 2, 6, 24]
>>> _[-1]
24

Although that works, and is quite nice, I’m not sure how readable that is. But I can attribute my discomfort to the fact that this is uses a new language feature and like anything in life, needs some getting used to. Also since it’s new in version 3.8, it’s probably best to stay away from it in production code for a little while.

Set Operations with Comprehensions

Comprehensions lend themselves quite well for set operations like intersection and difference. They’ll probably be less performant (and even less obvious to readers of such code), but nonetheless, it’s a nice example to play with:

>>> rgb_colors = {"red", "green", "blue"}
>>> ryb_colors = {"red", "yellow", "blue"}

>>> intersection = {c for c in rgb_colors if c in ryb_colors}
>>> intersection
{'red', 'blue'}

>>> difference = {c for c in rgb_colors if c not in ryb_colors}
>>> difference
{'green'}

These are the same results we’d get if we used the standard set operators / methods:

>>> rgb_colors & ryb_colors
{'red', 'blue'}

>>> rgb_colors - ryb_colors
{'green'}

Again, use the standard set functionalities for this, not the comprehension based methods I illustrated above. If you do use the comprehension method of doing this in production, don’t point to me or this article as inspiration.

Generator Expressions

When comprehensions are wrapped in square brackets or braces, the result is a fully realized collection, like a list or a set. However, when not wrapped as such, or when wrapped with just parentheses, the result is a generator expression, with none of result items realized. The result items are realized as needed, like for example, if it’s used in a for-loop.

Consider the following example session:

>>> [n ** 2 for n in range(4)]
[0, 1, 4, 9]

>>> n ** 2 for n in range(4)
<generator object <genexpr> at 0x0000000005768DC8>

We can use this generator object in a for-loop or, perhaps more typically, in an aggregation function, like sum or max etc.

>>> squares = n ** 2 for n in range(4)
>>> sum(squares)
14

Of course since this is a generator expression, it can be iterated over only once. If you want to iterate over it multiple times, just turn it into a list.

Generator expressions were introduced in PEP-289, which contains a lot of examples. I recommend reviewing it for some cool use cases, which I won’t reproduce here.

One small note regarding passing generator expressions as an argument to functions is that, make it a best practice to always wrap them with parentheses. The reason is, when using a generator expression as an argument to a function, and when it is not the only argument to the function, we may get an error that the generator expression is not parenthesized. Check out the following example if that doesn’t make sense:

In the following call to sorted, we pass in a generator expression as the sole argument, and we get the expected result.

>>> sorted(word.lower() for word in "We are from planet Earth, what's up?".split())
['are', 'earth,', 'from', 'planet', 'up?', 'we', "what's"]

Now to the same call, we add the key argument hoping to sort by the string lengths. Instead, we get a SyntaxError because our generator expression is not parenthesized.

>>> sorted(word.lower() for word in "We are from planet Earth, what's up?".split(), key=len)
  File "<stdin>", line 1
SyntaxError: Generator expression must be parenthesized

So, if we add parentheses to the generator, it works fine and we get the expected result.

>>> sorted((word.lower() for word in "We are from planet Earth, what's up?".split()), key=len)
['we', 'are', 'up?', 'from', 'planet', 'earth,', "what's"]

The key Argument for sorted

The sorted builtin provides the key argument that can be set to a function. This function is applied to each item in the given list and the list items are sorted according to the sorting order of the results of these function calls. This is a very convenient feature of sorted.

While this is probably a horrible thing to do, we could use comprehensions to recreate this effect without using the key argument. The idea is that we first create a sequence of 2-tuples, where the first items are the results of the key function and the second items are the original list items. We then sort this sequence of tuples, extract the second items in each tuple and return that. Here’s an example implementation doing just that:

def sad_sorted_with_key(items, key_fn):
    return [item for _, item in sorted((key_fn(item), item) for item in items)]


print(sad_sorted_with_key(
    (word.lower() for word in "We are from planet Earth, what's up?".split()),
    len,
))

This script would produce the following output:

['we', 'are', 'up?', 'from', 'earth,', 'planet', "what's"]

As usual, don’t do this in production. This is just a sad experiment.

No Side Effects Please

As best practice, please strive to have no side effects in your comprehension result expressions. Check out the following example to see what I mean:

>>> [print(n ** 2) for n in range(4)]
0
1
4
9
[None, None, None, None]

While this solves the purpose of printing the squares one per line, it also builds a list of Nones. It’s also counter-intuitive when we treat comprehensions as applying a transformation over each item in a collection. Calling print is not a transformation, it’s a side effect.

For use cases like this, it’s best to use a traditional for-loop:

>>> for n in range(4):
...     print(n ** 2)
0
1
4
9

The intent here is clearer, which is to print each square, not to make a list of some results.

Looking Inside

As another likely-pointless exercise, let’s look at these comprehensions as Python bytecode, and compare it with the same solution written using traditional for-loop.

First, let’s define two functions that solve the same problem, but one uses comprehensions, and the other doesn’t.

1
2
3
4
5
6
7
8
9
def loop_squares():
    result = []
    for n in range(4):
        result.append(n ** 2)
    return result


def comp_squares():
    return [n ** 2 for n in range(4)]

Let’s make sure they produce the same output:

>>> loop_squares()
[0, 1, 4, 9]
>>> comp_squares()
[0, 1, 4, 9]

Now let’s get the dis module and disassemble both of these functions:

>>> import dis
>>> dis.dis(loop_squares)
  2           0 BUILD_LIST               0
              2 STORE_FAST               0 (result)

  3           4 SETUP_LOOP              30 (to 36)
              6 LOAD_GLOBAL              0 (range)
              8 LOAD_CONST               1 (4)
             10 CALL_FUNCTION            1
             12 GET_ITER
        >>   14 FOR_ITER                18 (to 34)
             16 STORE_FAST               1 (n)

  4          18 LOAD_FAST                0 (result)
             20 LOAD_METHOD              1 (append)
             22 LOAD_FAST                1 (n)
             24 LOAD_CONST               2 (2)
             26 BINARY_POWER
             28 CALL_METHOD              1
             30 POP_TOP
             32 JUMP_ABSOLUTE           14
        >>   34 POP_BLOCK

  5     >>   36 LOAD_FAST                0 (result)
             38 RETURN_VALUE

>>> dis.dis(comp_squares)
  2           0 LOAD_CONST               1 (<code object <listcomp> at 0x7f3958a76c00, file "<stdin>", line 2>)
              2 LOAD_CONST               2 ('comp_squares.<locals>.<listcomp>')
              4 MAKE_FUNCTION            0
              6 LOAD_GLOBAL              0 (range)
              8 LOAD_CONST               3 (4)
             10 CALL_FUNCTION            1
             12 GET_ITER
             14 CALL_FUNCTION            1
             16 RETURN_VALUE

Disassembly of <code object <listcomp> at 0x7f3958a76c00, file "<stdin>", line 2>:
  2           0 BUILD_LIST               0
              2 LOAD_FAST                0 (.0)
        >>    4 FOR_ITER                12 (to 18)
              6 STORE_FAST               1 (n)
              8 LOAD_FAST                1 (n)
             10 LOAD_CONST               0 (2)
             12 BINARY_POWER
             14 LIST_APPEND              2
             16 JUMP_ABSOLUTE            4
        >>   18 RETURN_VALUE

I won’t discuss each instruction in the above outputs, check out the official documentation of the dis module for that. But just skimming over the above, we can see one striking difference. The comprehension function seems to have created a code object, which is doing the work of the comprehension and passing (returning) the result to our comp_squares function. That sounds like the comp_squares function is using an extra layer in the stack frame. We can confirm this by changing the functions to the following:

1
2
3
4
5
6
7
8
9
10
11
12
import traceback

def loop_squares():
    traceback.print_stack()
    result = []
    for n in range(4):
        result.append(n ** 2)
    return result


def comp_squares():
    return [[traceback.print_stack() if n == 0 else None, n ** 2][1] for n in range(4)]

Let’s see the stack they print and make sure they still produce the same result:

>>> loop_squares()
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 2, in loop_squares
[0, 1, 4, 9]
>>> comp_squares()
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 2, in comp_squares
  File "<stdin>", line 2, in <listcomp>
[0, 1, 4, 9]

The stack shows the file as "<stdin>" because I defined the functions within a REPL session. If they were in an actual file, we’d obviously get the file name there.

As we suspected, the comprehension function adds another layer to the stack frame, the <listcomp>, which is doing the work of the comprehension.

Live Code Converter

Here’s a little tool that converts your code written in the form of a list/set/dict comprehension, into one that is written using traditional for-loops.

It’s powered by an extremely light parser (doesn’t even qualify to be called that), but it can help illustrate the point. It can also be helpful for visualizing nested loops and comprehensions with multiple for statements.

Here’s some examples to try this with:

Comprehension Code (click to put in converter)
[n ** 2 for n in range(4)]
[n ** 2 for n in range(4) if n % 2 == 0]
{n ** 2 for n in range(4) if n % 2 == 0}
[r"abc def" for n in range(4)]
[(1, 2) for n in range(4)]
[n * m for n in range(4) for m in range(3) if n % 2 == 0]
{n * m for n in range(4) for m in range(3) if n % 2 == 0}
{n: n ** 2 for n in range(4) if n % 2 == 0}

Conclusion

Comprehensions are a powerful feature in Python that can create very readable code when used correctly. However, like everything else, they have a place and time and it’s not everywhere and all-the-time. It’s important to understand them well if you’re doing more than the trivial list comprehension.

Do check out the official documentation on List Comprehensions, which contains a lot of good examples and ideas I didn’t discuss here.

Additionally, at the expense of repeating the same thing, there’s some experiments on this page that are only intended for learning. Please do not use them in production code. Have pity on your future self.

Discuss on: Reddit.