Goal
I was trying to transform the following structure:
rows_by_category = { "a": [ ["col1.1", "col1.2", "col1.3"], ["a1.1", "a1.2", "a1.3"], ["a2.1", "a2.2", "a2.3"], ["a3.1", "a3.2", "a3.3"] ], "b": [ ["col2.1", "col2.2", "col2.3"], ["b1.1", "b1.2", "b1.3"], ["b2.1", "b2.2", "b2.3"], ["b3.1", "b3.2", "b3.3"] ], "c": [ ["col3.1", "col3.2", "col3.3"], ["c1.1", "c1.2", "c1.3"], ["c2.1", "c2.2", "c2.3"], ["c3.1", "c3.2", "c3.3"] ] }
The format is such that each value (rows_by_category['a']
) is a list of rows, where the first row is the header, and the rest of the rows should be treated as the content.
My goal was to convert this to a list of dictionaries, so that each dictionary corresponds to one “content” row (i.e. not including the header row), where the keys are the header row, and the values corresponds to the relevant column. In other words, the above structure should be translated into (order is not relevant):
[{'col3.1': 'c1.1', 'col3.2': 'c1.2', 'col3.3': 'c1.3'}, {'col3.1': 'c2.1', 'col3.2': 'c2.2', 'col3.3': 'c2.3'}, {'col3.1': 'c3.1', 'col3.2': 'c3.2', 'col3.3': 'c3.3'}, {'col2.1': 'b1.1', 'col2.2': 'b1.2', 'col2.3': 'b1.3'}, {'col2.1': 'b2.1', 'col2.2': 'b2.2', 'col2.3': 'b2.3'}, {'col2.1': 'b3.1', 'col2.2': 'b3.2', 'col2.3': 'b3.3'}, {'col1.1': 'a1.1', 'col1.2': 'a1.2', 'col1.3': 'a1.3'}, {'col1.1': 'a2.1', 'col1.2': 'a2.2', 'col1.3': 'a2.3'}, {'col1.1': 'a3.1', 'col1.2': 'a3.2', 'col1.3': 'a3.3'}]
Naive Solution
Quite obvious that we can do something like that (it’s naive, but is relatively optimized with iterator usage:
from itertools import islice, izip def naive_approach(rows_by_category): transformed_rows = [] for category in rows_by_category.itervalues(): header = category[0] for row in islice(category, 1, None): transformed_rows.append(dict(izip(header, row))) return transformed_rows
Can we do better? Definitely!
Nesting list comprehesion statement is a little confusing (at least to me).
Initially I thought that this would work:
# wrong! from itertools import islice, izip [dict(izip(category[0], row)) for row in islice(category, 1, None) for category in rows_by_category.itervalues()]
I read it as “create a dict from zipping the header(category[0]) with the row, and do this for every row, in the following slice of category, where category is an element in rows_by_category“.
That made some sense and I even thought it was working! BUT, when tested it in different situations (interactive shell vs. PyCharm), it didn’t seem to always work (category
was undefined). category
had leaked from another list-comprehension statement (see more at the Conclusion section, below).
The correct solution is (switched the highlighted lines below):
# correct from itertools import islice, izip [dict(izip(category[0], row)) for category in rows_by_category.itervalues() for row in islice(category, 1, None)]
And can be read, quite intuitively, like the original for-each above, with the understanding that Python reads list comprehension from left-to-right, evaluating the “for” statements in order. It’s pretty much a direct translation of the nested for-each implementation.
Important advantage of the list-comprehension
It can be used as an iterator, thus avoiding copying all this data in memory. It’s great when ultimately we wish to further map each row to another type, MyClass
.
My final code was something like that:
# note how the following line produces an iterator, # rather than a materialized list transformed_rows = (dict(izip(category[0], row)) for category in rows_by_category.itervalues() for row in islice(category, 1, None)) my_objects = (MyClass(**row) for row in transformed_rows) # do something with ``my_objects``...
There’s actually a slight copying of data – but that’s one row at a time stored in-memory (as a dictionary).
Note: we’ve used keywords interpretation of each row to instantiate MyClass
object. This could prove useful when deserializing CSV data (which was the trigger to this post).
Conclusion
- List comprehension is awesome!
- That said, it needs to be tested with a small, easily understood, subset of your data, and compared to a naive solution (not too small but representative).
- When using an IDE, don’t ignore it when it warns you for undefined variable.
- Python could be dangerous sometimes – when I tested this, category was accidently defined because I was using it earlier in the interactive Python session, and it “leaked” (in the sense that it found a val that I didn’t intend it to find within the nested list comprehension). Haven’t tried it myself, but I believe Python 3+ doesn’t have this problem.
- As always, you should performance-test your code… there could better alternatives to list-comprehension (like
itertools
) that could be more efficient. Of course, you should always keep in mind code-readability, cleanliness and declartiveness (probably just invented this word – it means that the code should state the intention of what you are trying to achieve).
Here’s what I came up with in C#:
Not that bad I think.. also you could create another extension method overload for ToDictionary that takes Tuple<T,V> and then it would be just:
Or even create a simple extension method like ZipToDictoinary:
To my eye, these would be more readable than that Python magic you used, but then again I’m probably biased 😉