Class 9: Advanced and Performant Python#

One of the best things about Python is how easy it is to get started with. The syntax is clear, it has all the basic features, things work in a relatively intuitive way, life is good. But Python also supports very advanced features, which make coding with Python an enjoyable experience even after you think you’ve learned everything there is to know of the language. You can always dive a little deeper. You’re not required to, but you can.

Generators#

In simplistic terms, generators are iterators. Meaning, a generator is always an object you can iterate over. In Python you can iterate over most data structures, including dictionaries, lists, tuples and more - and so in this sense generators are similar. However, when we iterate over a list, for example, we’re iterating over an existing data structures with existing items. Same goes for dictionaries, when we iterate over them, Python “hands over” the dictionary’s keys and values.

a_list = [0, 10, 20]
for item in a_list:
    print(item)
0
10
20
a_dict = dict(a=0, b=10, c=20)
for key in a_dict:
    print(key, a_dict[key])
a 0
b 10
c 20

This is the first major difference between a generator and the other iterators. A generator is a recipe to create the next item in the chain. A generator is a piece of code telling the Python interpreter how to create the next item, but it doesn’t hold this item in memory yet. A simple example might be a list containing values from 0 to 1000. A generator of this list will not have 1000 cells with their values - it would have instructions on the number of cells, and how to calculate the next value.

We’ve already met a kind of generator: the range() function. When we tell Python to give us a range of number between 0 and 1000 by writing range(1000) - we’re not actually generating the 1000 “cells” of values, but only the recipe. Let’s see it in “action”:

range(1000)  # a "range" object
range(0, 1000)
items = range(1000)
items
range(0, 1000)
import sys
sys.getsizeof(items)
48
sys.getsizeof(list(items)) 
8056

A simple 1000-element list isn’t that heavy for a computer (even an Arduino or similar), but when lists get longer, with bigger arrays and massive data structures inside them, it’s very inefficient to hold this amount of unused data in memory.

Generator Creation and Iteration#

Let’s define our own generator:

def my_range(n):
    """
    Returns a list of items from 0 to n.
    
    Parameters
    ----------
    n : int
        Number of items to generate
    """
    num = 0
    while num < n:
        yield num
        num += 1

When we create a generator, the code is executed until the first yield statement. This reserved keyword is what makes a function into a generator.

When the code reaches the yield it holds, or “saves” its current state, until called by Python’s next() function:

new_range = my_range(3)

print(next(new_range))

print(next(new_range))

print(next(new_range))

print(next(new_range))
0
1
2
---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
Cell In[8], line 9
      5 print(next(new_range))
      7 print(next(new_range))
----> 9 print(next(new_range))

StopIteration: 

Each time next() is used, the line with the yield is executed, and the function keeps going until it reaches another yield statement. In the my_range function, while the index is smaller than n the code will reach a yield. When we don’t satisfy this condition anymore, the code skips the loop and reaches the end of the function. This results in a special StopIteration exception, used only in these special cases. This means you can catch this exception and know that your generator went through all of its items.

But calling next() multiple times isn’t practical. Luckily, for loops implement this exact interface automatically, allowing us to use them instead of the tedious, repetitive next() calls:

loop_range = my_range(10)

for item in loop_range:
    print(item)
0
1
2
3
4
5
6
7
8
9

The for loop is also smart enough to catch the StopIteration exception and terminate the loop, without raising any “visible” exceptions. A for loop is the common way to iterate over generators.

Note that “holding off” the generation of the values means we can’t access the complete data structure. If we try to print it, we will simply get the generator instance’s representation string:

range2 = my_range(10)
print(range2)
<generator object my_range at 0x7fe768707f90>

The same applies for indexing:

range2[3]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[11], line 1
----> 1 range2[3]

TypeError: 'generator' object is not subscriptable

Once used, generators are “depleted”, you can’t reuse them. This is a major difference between a generator and a list. For example, you’re not limited by the number of times you can iterate over a list.

for item in loop_range:
    print(item)

Using a for loop doesn’t return anything, because we already depleted loop_range.

Let’s try something more explicit:

next(loop_range)
---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
Cell In[13], line 1
----> 1 next(loop_range)

StopIteration: 

It’s important to stress that all functions can become generators if they contain the yield statement:

def print_stuff():
    print("Hello, ")
    yield True
    print("World")
    yield False
stuff_printer = print_stuff()
stuff = next(stuff_printer)
print(stuff)
Hello, 
True
more_stuff = next(stuff_printer)
print(more_stuff)
World
False

Generator Expressions#

Another way to create generators is “genexps”, or generator expressions, which are very similar to list comprehensions:

nums = (2 * n for n in range(10))
nums
<generator object <genexpr> at 0x7fe768707cf0>
for num in nums:
    print(num)
0
2
4
6
8
10
12
14
16
18

The round brackets (()) tell the interpreter that we’re creating a generator.

Use cases#

When should we use generators? Many libraries use them almost ubiquitously. For example, pathlib uses them to iterate over the content of directories - it’s not “cheap” to get the full directory’s content and then iterate over it, so pathlib uses a generator to yield the next item every time.

In our field of work, generators can either be great to have in order to improve the performance of your code, or an absolute necessity if you’re working with very large data structures that simply cannot be handled in-memory. In any case, implementing and using generators is easy and incredibly beneficial for various computations.

Exercise: Fibonacci Generator

Write a generator function returning the first n Fibonacci numbers.

Solution
def generate_n_fibonacci(n):
    """
    Generates the first n Fibonacci numbers.
    
    Parameters
    ----------
    n : int
        Length of the generated Fibonacci sequence
    """
    index, a, b = 0, 0, 1
    while index < n:
        yield a
        a, b = b, a + b
        index += 1

Bonus: Modify your solution to create an infinite Fibonacci generator (infinite in the sense that as long as iteration over it is continued, the next number will be generated).

Solution
def generate_fibonacci():
    """
    Fibonacci sequence generator.
    """
    a, b = 0, 1
    while True:
        yield a
        a, b = b, a + b

*args, **kwargs#

You use *args and **kwargs when you would like to enable a flexible number of arguments that may be passed to a function. Actually, the syntax is only * and ** - the words args and kwargs are simply used by convention. args is obviously arguments, or unnamed arguments given to a function, and kwargs is keyword arguments, or arguments given as key=value pairs. Let’s see a simple example:

def f(required_argument, *args, **kwargs):
    print(required_argument)

    if args:
        print("I found something in args!")
        print(args)
    
    if kwargs:
        print("I found something in kwargs!")
        for key, value in kwargs.items():
            print(key, value)
f()  # doesn't work - we have one required argument to the function
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[20], line 1
----> 1 f()  # doesn't work - we have one required argument to the function

TypeError: f() missing 1 required positional argument: 'required_argument'
f('required')
required
f('required', 1, 2, 3)  # the second printed row is the args
required
I found something in args!
(1, 2, 3)
f('required', 1, 2, 3, kw1='a', kw2='b')
required
I found something in args!
(1, 2, 3)
I found something in kwargs!
kw1 a
kw2 b

We see that args is a tuple containing all unnamed parameters that were given to the function in the order they were given, and kwargs is a dictionary.

Essentially, using * in the function’s signature is the inverse of using it for expansion when calling the function:

def f(a=1, b=2):
    print(a, b)

inputs = {'a': 10, 'b': 20}
f(**inputs)

def f2(a, b):
    print(a, b)

my_inputs = (1, 2)
f2(*my_inputs)
10 20
1 2

What we see here is that the function’s signature doesn’t have to contain *args or **kwargs. The ** operator “opens up” the input dictionary, allowing the f() function to use the parameters without any issues.

Decorators#

Decorators are functions that receive functions in their arguments. When you wrap an existing function with another function, you’re creating a decorator. This feature is extensively used in web frameworks, pytest and in other important Python use cases, which means it has a special syntax: @decorator. Let’s look at an example:

Assume I have a large data-processing pipeline script, built out of many smaller functions, which unfortunately takes a long time to run. I wish to understand why it’s taking so long, so I decide to add a printed statement at the start and end of each function, so that I could see with my eyes where the code “hangs”. This is how I implemented it:

def main_pipeline(fname):
    data = load_data(fname)
    processed = process_data(data)
    appended = append_data(processed)
    logged = log_data(appended)

def load_data(fname):
    print("Starting 'load_data'...")
    # ... Code ...
    print("Ending 'load_data'...")

def process_data(data):
    print("Starting 'process_data'...")
    # ... Code ...
    print("Ending 'process_data'...")
    
# And so on...

This is obviously very tedious. Even when I only have four functions, it’s very repetitive and feels wrong. Moreover, it might have not solved my issue. Let’s say my manual examination showed that all four functions take a considerable time to run, so I decide to profile the execution time of each function, to better understand which function is the most costly and optimize it first.

Here’s how I redefined all functions to measure their execution time:

import time


def load_data(fname):
    start_time = time.time()
    print(f"Starting 'load_data' at {start_time}...")    
    # ... Code ...
    end_time = time.time()
    print(f"Ended 'load_data' at {end_time}...")
    duration = end_time - start_time
    print(f"It took the code {duration} milliseconds to run.")    

    
def process_data(data):
    start_time = time.time()
    print(f"Starting 'process_data' at {start_time}...")    
    # ... Code ...
    end_time = time.time()
    print(f"Ended 'process_data' at {end_time}...")
    duration = end_time - start_time
    print(f"It took the code {duration} milliseconds to run.")    
    
# And so on...

This works, but again, it’s very repetitive. Also, if I decide that I want to see the execution time in seconds, and not milliseconds, I have to go through each function and re-implement it. Very tedious.

Consider the following solution instead:

def printer(func):
    def inner_func(a, b):
        print(f"Starting {func.__name__}...")
        result = func(a, b)
        print(f"Ending {func.__name__}...")
        return result
    return inner_func      


def timer(func):
    def inner_func(argument):
        start_time = time.time()
        result = func(argument)
        print(f"It tooks the code {time.time() - start_time} milliseconds to run.")
        return result
    return inner_func

This looks complex at first, but it’s really pretty simple. It uses the fact that functions in Python are objects, like any other element in the language. And because they’re objects, they can be passed around as arguments:

def f(func):
    """ Runs the func functions and prints 'hi' at the end """
    func()
    print("hi")
    
def print_hello():
    print("hello")
    

f(print_hello)
hello
hi

Like all objects, functions have attributes. Namely, they have the __name__ attribute which contains… their name.

print(f.__name__)
print(print_hello.__name__)
f
print_hello

Now we know we can pass functions as arguments to other functions. Let’s try to examine the printer and timer functions again.

They’re both a function that receives a different, “unknown” function, as its argument. So far, so good. Then it defines another function which “wraps” the original function with some actions, like printing or timing. This inner function runs the original function and returns the result. In essence, it created a “new implementation” of that original function that does the exact same thing, but with the wrapping functionality (printing, timing, etc.). This new function (inner_func) can replace any instance of the original function without any troubles, since in essence it just calls it. It’s adds a couple of statements before and after that call, but the essential functionality remained unchanged.

Lastly, the outer function, which we call the decorator, returns the inner function as its return value. So this function receives a function as its argument and return a new, improved function as its output. To use it, we just “rename” the existing functions:

load_data_printer = printer(load_data)
load_data_timed = timer(load_data)

process_data_printer = printer(process_data)
process_data_timed = timer(process_data)

We can obviously use this timer function on any function we wish to time. When we wish to time functions in seconds, rather than milliseconds, we’ll just change this one instance of timer and be done with it, and likewise for printer.

The only small caveat here is the fact that we currently require the function we’re replacing to have a single argument as its argument. This implementation detail is small but very impactful - it means that our decorator will only decorate successfully functions that have a single argument. To remedy this we’ll have to use *args and **kwargs:

def printer(func):
    def inner_func(*args, **kwargs):
        print(f"Starting {func.__name__}...")
        result = func(*args, **kwargs)
        print(f"Ending {func.__name__}...")
        return result
    return inner_func      


def timer(func):
    def inner_func(*args, **kwargs):
        start_time = time.time()
        result = func(*args, **kwargs)
        print(f"It tooks the code {time.time() - start_time} milliseconds to run.")
        return result
    return inner_func

We can now be sure that our functions will always run regardless the number of inputs given to them. However, we still have to redefine all functions as we’ve seen before:

load_data_printer = printer(load_data)
load_data_timed = timer(load_data)

process_data_printer = printer(process_data)
process_data_timed = timer(process_data)

This still requires us to rename all instances of these functions in all places of the code, and when we’re done with the printing and timing, we have to rename them back.

Why not “rename” the function back to its original name?

load_data = printer(load_data)
process_data = timer(process_data)

This idiom is common enough to have a built-in language syntax:

@timer
def load_data(fname):
    # ... Code ...
    pass

load_data('fname.txt')
load_data(fname='fname.txt')
It tooks the code 7.152557373046875e-07 milliseconds to run.
It tooks the code 7.152557373046875e-07 milliseconds to run.

We can use multiple decorators for functions as well:

@printer
@timer
def process_data(data, shape):
    # ... Code ...
    pass

When we wish to stop printing and timing our function, we simply delete this decorator in the relevant places.

Decorators allow more complex calls, like calling them with arguments, but we’ll leave that topic for another day.

More Useful Built-in Modules#

The Python standard library comes with a number of built-in modules that can make your life much easier. As (neuroscience-oriented) data scientists of sorts, you’ll save yourself a lot of hassle by familiarizing yourself with some, namely collections and itertools. Let’s explore a couple of examples.

collections#

namedtuple#

When you want a tiny object with named fields, but without the hassle of creating a fully-fledged class, you actually wish to generate a namedtuple:

from collections import namedtuple

Point = namedtuple('Point', ['x', 'y'])
p1 = Point(0, 0)
p2 = Point(x=0, y=1)
p3 = Point(1, y=2)

print(p2)
print(p3.y)
print(p1[0])
Point(x=0, y=1)
2
0

You can access the data inside a namedtuple using either the positional index ([0]) or the name of that field (x). If all you wish to do is to a keep a small record of something, namedtuple is your best option.

defaultdict#

A defaultdict is a dictionary that resorts to execute a predefined function if it doesn’t find the key. For example:

d = dict(one=1, two=2)
print(d['one'])
print(d['three'])
1
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[37], line 3
      1 d = dict(one=1, two=2)
      2 print(d['one'])
----> 3 print(d['three'])

KeyError: 'three'

Rather than a KeyError, a defaultdict would run a predefined function instead of raising this exception:

from collections import defaultdict

d2 = defaultdict(list, one=1, two=2)
d2
defaultdict(list, {'one': 1, 'two': 2})

However, when we call it with an unknown key:

print(d2['one'])
print(d2['three'])
1
[]

It used the list “factory” to create a new list in that key. This is particularly useful when sorting some key-value pairs.

s = [('yellow', 1), ('blue', 2), ('yellow', 3), ('blue', 4), ('red', 1)]
d2 = defaultdict(list)
for k, v in s:
    d2[k].append(v)

d2
defaultdict(list, {'yellow': [1, 3], 'blue': [2, 4], 'red': [1]})

Itertools#

Chained iterables#

If we wish to iterate over several iterables together, we can use the following method from the itertools module:

import itertools

chained = itertools.chain('abcd', 'efg')
for letter in chained:
    print(letter)
a
b
c
d
e
f
g

Or from an iterable of iterables:

chained_2 = itertools.chain.from_iterable([[1, 2, 3, 4], [5, 6, 7, 8]])
for number in chained_2:
    print(number)
1
2
3
4
5
6
7
8

Note that itertools always creates generators from the items it receives as input.

Permutations#

list(itertools.permutations('ABCD', 2))
[('A', 'B'),
 ('A', 'C'),
 ('A', 'D'),
 ('B', 'A'),
 ('B', 'C'),
 ('B', 'D'),
 ('C', 'A'),
 ('C', 'B'),
 ('C', 'D'),
 ('D', 'A'),
 ('D', 'B'),
 ('D', 'C')]

Combinations#

list(itertools.combinations('ABCD', 2))
[('A', 'B'), ('A', 'C'), ('A', 'D'), ('B', 'C'), ('B', 'D'), ('C', 'D')]

If you’re looking for more advanced iteration recipes, like chunking, running windows and more, take a look at more-itertools package.

Multiprocessing#

There are several ways to utilize parallel processing in Python. The easiest of all is multi-processing, i.e. the use of several CPU cores to run jobs in parallel. This is best used when each process is independent from the others, not having to share data between them.

A typical use case is when we have a list holding data, or filenames to where the data is, and we wish to perform the same computation on each element of that list. If this computation is truly independent, the multiprocessing module has some very easy-to-use solutions.

import multiprocessing

def add_tuple(tup):
    return tup[0] + tup[1]

tups = [(0, 1), (2, 3), (4, 5), (6, 7)]
with multiprocessing.Pool() as pool:  # can also enter the number of processes you wish to use
    result = pool.map(add_tuple, tups)
result  # [1, 5, 9, 13]

The code above doesn’t work in IPython and Jupyter (see ipyparallel for that purpose), but the general idea of using parallel processing in Python is usually something along these lines.

Threading is Python’s weak point because of the GIL, and we’ll not discuss it in this class. Another form of parallel processing is asynchronous programming, which we’ll also not cover, but is actually one of Python’s strongest points.

Numba#

numba is a special library designed to speed-up computations in Python. Let’s jump right into it and then discuss some of the magic afterwards:

from numba import jit
import numpy as np


@jit
def sum2d(arr):
    M, N = arr.shape
    result = 0.0
    for i in range(M):
        for j in range(N):
            result += arr[i,j]
    return result
/tmp/ipykernel_2234/2655857958.py:6: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  def sum2d(arr):
arr = np.ones((10000, 10000))

print("Numpy:")
%timeit arr.sum()
print("Numba:")
%timeit sum2d(arr)
Numpy:
36.5 ms ± 436 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Numba:
93.2 ms ± 79.9 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

The results up there are, for lack of a better word, amazing. Numpy has been optimized for ages, works in bare C, and still only offers a mild improvement over numba, which seemingly just decorates a simple, perhaps simplistic, Python loop, making it amazingly fast.

This magic happens with LLVM, an open-source project that aims at building a very fast, cross-language compiler. numba translates the Python code into LLVM-suitable code and lets it take care of the optimization details.

Numba has more tricks in its sleeve. You can define the input types and specify nopython=True push performance even further:

from numba import jit, float64
import numpy as np


@jit(float64(float64[:, :]), nopython=True)
def sum2d_inps(arr):
    M, N = arr.shape
    result = np.float64(0.0)
    for i in range(M):
        for j in range(N):
            result += arr[i,j]
    return result
print("Numpy:")
%timeit arr.sum()
print("Numba:")
%timeit sum2d_inps(arr)
Numpy:
36.2 ms ± 283 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Numba:
93.4 ms ± 43.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

We can also use parallel looping:

from numba import njit, prange
import numpy as np


@njit([float64(float64[:, :])], parallel=True) # njit is equivalent to jit with nopython=True
def sum2d_p(arr):
    M, N = arr.shape
    result = np.float64(0.0)
    for i in prange(M): # Note range was replaced with prange
        for j in prange(N):
            result += arr[i,j]
    return result
%timeit sum2d_p(arr)  # pretty cool
25 ms ± 536 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

When implementing long computations heavily based on numpy and iterations, numba might be the way to go. For very complicated functions that use fancy linear algebra algorithms, it might be the case that numba doesn’t support these methods yet. In these occasions resort to basic numpy functions and wait till the numba developers implement that method, or do so yourself! numba is completely open-sourced.

Cython#

When you wish to write performant code that utilizes significant parts of the standard library, as well as numpy and the scientific stack, neither numpy nor numba will help you. They require that you work with arrays, which are not as easy to work with as lists, for example. Dictionaries are also very helpful, but using them only with the standard Python interpreter will hinder you performance considerably.

These are the cases where Cython shines. It allows you to write code with Python-like syntax and compile it ahead-of-time to a myfile.c source file, written in C automatically. When your code calls a function that was written in Cython, it will actually turn to the optimized C function and use that function instead.

As stated, Cython requires you to compile your code before running the parent Python script. To do that, you have to create a setup.py file that tells the Cython compiler where to find the files in question.

A Cython file ends with X.pyx, so setup.py should point there. Here’s a basic example of setup.py:

from distutils.core import setup
from Cython.Build import cythonize

setup(
    ext_modules = cythonize('my_file.pyx'),
    # other setup.py options come here
)

Then you navigate with your command line to the folder containing setup.py and write python setup.py build_ext --inplace, which tells Cython to “build”, i.e. compile, the code in the .pyx file and add it inplace, i.e. to this directory.

An example can be found in the cython_demo folder. Let’s see it here in action:

from cython_demo import plain_python
from cython_demo import primes_cython
plain_python.primes_python(20)
[2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71]
primes_cython.primes(20)
[2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71]
%timeit plain_python.primes_python(1000)
%timeit primes_cython.primes(1000)
22.5 ms ± 16.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
997 µs ± 417 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Let’s compare NumPy’s basic performance with the same functionality implemented with numba or cython:

rands = np.random.random((1000000))

NumPy:

%timeit rands[rands < 0.5]
5.63 ms ± 29.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Numba:

@njit(parallel=True)
def filter_larger(rands):
    arr = np.zeros_like(rands)
    thresh = 0.5
    last_idx = 0
    for idx in prange(len(rands)):
        if rands[idx] < 0.5:
            arr[last_idx] = rands[idx]
            last_idx += 1
            
    return arr[:last_idx]
%timeit filter_larger(rands)
1.68 ms ± 343 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

Note that the last_idx variable is probably hindering performance of the parallel loop.

Cython:

from cython_filter_demo import filter_array
/home/runner/.pyxbld/temp.linux-x86_64-3.8/pyrex/cython_filter_demo/filter_array.c:752:10: fatal error: numpy/arrayobject.h: No such file or directory
  752 | #include "numpy/arrayobject.h"
      |          ^~~~~~~~~~~~~~~~~~~~~
compilation terminated.
---------------------------------------------------------------------------
DistutilsExecError                        Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/distutils/unixccompiler.py:117, in UnixCCompiler._compile(self, obj, src, ext, cc_args, extra_postargs, pp_opts)
    116 try:
--> 117     self.spawn(compiler_so + cc_args + [src, '-o', obj] +
    118                extra_postargs)
    119 except DistutilsExecError as msg:

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/distutils/ccompiler.py:910, in CCompiler.spawn(self, cmd)
    909 def spawn(self, cmd):
--> 910     spawn(cmd, dry_run=self.dry_run)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/distutils/spawn.py:36, in spawn(cmd, search_path, verbose, dry_run)
     35 if os.name == 'posix':
---> 36     _spawn_posix(cmd, search_path, dry_run=dry_run)
     37 elif os.name == 'nt':

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/distutils/spawn.py:157, in _spawn_posix(cmd, search_path, verbose, dry_run)
    156             cmd = executable
--> 157         raise DistutilsExecError(
    158               "command %r failed with exit status %d"
    159               % (cmd, exit_status))
    160 elif os.WIFSTOPPED(status):

DistutilsExecError: command 'gcc' failed with exit status 1

During handling of the above exception, another exception occurred:

CompileError                              Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pyximport/pyximport.py:214, in load_module(name, pyxfilename, pyxbuild_dir, is_package, build_inplace, language_level, so_path)
    213         module_name = name
--> 214     so_path = build_module(module_name, pyxfilename, pyxbuild_dir,
    215                            inplace=build_inplace, language_level=language_level)
    216 mod = imp.load_dynamic(name, so_path)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pyximport/pyximport.py:186, in build_module(name, pyxfilename, pyxbuild_dir, inplace, language_level)
    185 from . import pyxbuild
--> 186 so_path = pyxbuild.pyx_to_dll(pyxfilename, extension_mod,
    187                               build_in_temp=build_in_temp,
    188                               pyxbuild_dir=pyxbuild_dir,
    189                               setup_args=sargs,
    190                               inplace=inplace,
    191                               reload_support=pyxargs.reload_support)
    192 assert os.path.exists(so_path), "Cannot find: %s" % so_path

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pyximport/pyxbuild.py:102, in pyx_to_dll(filename, ext, force_rebuild, build_in_temp, pyxbuild_dir, setup_args, reload_support, inplace)
    101 obj_build_ext = dist.get_command_obj("build_ext")
--> 102 dist.run_commands()
    103 so_path = obj_build_ext.get_outputs()[0]

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/distutils/dist.py:966, in Distribution.run_commands(self)
    965 for cmd in self.commands:
--> 966     self.run_command(cmd)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/distutils/dist.py:985, in Distribution.run_command(self, command)
    984 cmd_obj.ensure_finalized()
--> 985 cmd_obj.run()
    986 self.have_run[command] = 1

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/Cython/Distutils/old_build_ext.py:186, in old_build_ext.run(self)
    184     optimization.disable_optimization()
--> 186 _build_ext.build_ext.run(self)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/distutils/command/build_ext.py:340, in build_ext.run(self)
    339 # Now actually compile and link everything.
--> 340 self.build_extensions()

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/Cython/Distutils/old_build_ext.py:195, in old_build_ext.build_extensions(self)
    194 # Call original build_extensions
--> 195 _build_ext.build_ext.build_extensions(self)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/distutils/command/build_ext.py:449, in build_ext.build_extensions(self)
    448 else:
--> 449     self._build_extensions_serial()

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/distutils/command/build_ext.py:474, in build_ext._build_extensions_serial(self)
    473 with self._filter_build_errors(ext):
--> 474     self.build_extension(ext)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/distutils/command/build_ext.py:528, in build_ext.build_extension(self, ext)
    526     macros.append((undef,))
--> 528 objects = self.compiler.compile(sources,
    529                                  output_dir=self.build_temp,
    530                                  macros=macros,
    531                                  include_dirs=ext.include_dirs,
    532                                  debug=self.debug,
    533                                  extra_postargs=extra_args,
    534                                  depends=ext.depends)
    536 # XXX outdated variable, kept here in case third-part code
    537 # needs it.

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/distutils/ccompiler.py:574, in CCompiler.compile(self, sources, output_dir, macros, include_dirs, debug, extra_preargs, extra_postargs, depends)
    573         continue
--> 574     self._compile(obj, src, ext, cc_args, extra_postargs, pp_opts)
    576 # Return *all* object filenames, not just the ones we just built.

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/distutils/unixccompiler.py:120, in UnixCCompiler._compile(self, obj, src, ext, cc_args, extra_postargs, pp_opts)
    119 except DistutilsExecError as msg:
--> 120     raise CompileError(msg)

CompileError: command 'gcc' failed with exit status 1

During handling of the above exception, another exception occurred:

ImportError                               Traceback (most recent call last)
Cell In[60], line 1
----> 1 from cython_filter_demo import filter_array

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pyximport/pyximport.py:459, in PyxLoader.load_module(self, fullname)
    456     module.__path__ = [self.path]
    457 else:
    458     #print "MODULE", fullname
--> 459     module = load_module(fullname, self.path,
    460                          self.pyxbuild_dir,
    461                          build_inplace=self.inplace,
    462                          language_level=self.language_level)
    463 return module

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pyximport/pyximport.py:231, in load_module(name, pyxfilename, pyxbuild_dir, is_package, build_inplace, language_level, so_path)
    228 exc = ImportError("Building module %s failed: %s" % (
    229     name, traceback.format_exception_only(*sys.exc_info()[:2])))
    230 if sys.version_info[0] >= 3:
--> 231     raise exc.with_traceback(tb)
    232 else:
    233     exec("raise exc, None, tb", {'exc': exc, 'tb': tb})

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pyximport/pyximport.py:214, in load_module(name, pyxfilename, pyxbuild_dir, is_package, build_inplace, language_level, so_path)
    212     else:
    213         module_name = name
--> 214     so_path = build_module(module_name, pyxfilename, pyxbuild_dir,
    215                            inplace=build_inplace, language_level=language_level)
    216 mod = imp.load_dynamic(name, so_path)
    217 if is_package and not hasattr(mod, '__path__'):

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pyximport/pyximport.py:186, in build_module(name, pyxfilename, pyxbuild_dir, inplace, language_level)
    183 build_in_temp = sargs.pop('build_in_temp',build_in_temp)
    185 from . import pyxbuild
--> 186 so_path = pyxbuild.pyx_to_dll(pyxfilename, extension_mod,
    187                               build_in_temp=build_in_temp,
    188                               pyxbuild_dir=pyxbuild_dir,
    189                               setup_args=sargs,
    190                               inplace=inplace,
    191                               reload_support=pyxargs.reload_support)
    192 assert os.path.exists(so_path), "Cannot find: %s" % so_path
    194 junkpath = os.path.join(os.path.dirname(so_path), name+"_*") #very dangerous with --inplace ? yes, indeed, trying to eat my files ;)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pyximport/pyxbuild.py:102, in pyx_to_dll(filename, ext, force_rebuild, build_in_temp, pyxbuild_dir, setup_args, reload_support, inplace)
    100 try:
    101     obj_build_ext = dist.get_command_obj("build_ext")
--> 102     dist.run_commands()
    103     so_path = obj_build_ext.get_outputs()[0]
    104     if obj_build_ext.inplace:
    105         # Python distutils get_outputs()[ returns a wrong so_path
    106         # when --inplace ; see http://bugs.python.org/issue5977
    107         # workaround:

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/distutils/dist.py:966, in Distribution.run_commands(self)
    961 """Run each command that was seen on the setup script command line.
    962 Uses the list of commands found and cache of command objects
    963 created by 'get_command_obj()'.
    964 """
    965 for cmd in self.commands:
--> 966     self.run_command(cmd)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/distutils/dist.py:985, in Distribution.run_command(self, command)
    983 cmd_obj = self.get_command_obj(command)
    984 cmd_obj.ensure_finalized()
--> 985 cmd_obj.run()
    986 self.have_run[command] = 1

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/Cython/Distutils/old_build_ext.py:186, in old_build_ext.run(self)
    182 if self.cython_gdb or [1 for ext in self.extensions
    183                              if getattr(ext, 'cython_gdb', False)]:
    184     optimization.disable_optimization()
--> 186 _build_ext.build_ext.run(self)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/distutils/command/build_ext.py:340, in build_ext.run(self)
    337     self.compiler.set_link_objects(self.link_objects)
    339 # Now actually compile and link everything.
--> 340 self.build_extensions()

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/Cython/Distutils/old_build_ext.py:195, in old_build_ext.build_extensions(self)
    193     ext.sources = self.cython_sources(ext.sources, ext)
    194 # Call original build_extensions
--> 195 _build_ext.build_ext.build_extensions(self)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/distutils/command/build_ext.py:449, in build_ext.build_extensions(self)
    447     self._build_extensions_parallel()
    448 else:
--> 449     self._build_extensions_serial()

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/distutils/command/build_ext.py:474, in build_ext._build_extensions_serial(self)
    472 for ext in self.extensions:
    473     with self._filter_build_errors(ext):
--> 474         self.build_extension(ext)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/distutils/command/build_ext.py:528, in build_ext.build_extension(self, ext)
    525 for undef in ext.undef_macros:
    526     macros.append((undef,))
--> 528 objects = self.compiler.compile(sources,
    529                                  output_dir=self.build_temp,
    530                                  macros=macros,
    531                                  include_dirs=ext.include_dirs,
    532                                  debug=self.debug,
    533                                  extra_postargs=extra_args,
    534                                  depends=ext.depends)
    536 # XXX outdated variable, kept here in case third-part code
    537 # needs it.
    538 self._built_objects = objects[:]

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/distutils/ccompiler.py:574, in CCompiler.compile(self, sources, output_dir, macros, include_dirs, debug, extra_preargs, extra_postargs, depends)
    572     except KeyError:
    573         continue
--> 574     self._compile(obj, src, ext, cc_args, extra_postargs, pp_opts)
    576 # Return *all* object filenames, not just the ones we just built.
    577 return objects

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/distutils/unixccompiler.py:120, in UnixCCompiler._compile(self, obj, src, ext, cc_args, extra_postargs, pp_opts)
    117     self.spawn(compiler_so + cc_args + [src, '-o', obj] +
    118                extra_postargs)
    119 except DistutilsExecError as msg:
--> 120     raise CompileError(msg)

ImportError: Building module cython_filter_demo.filter_array failed: ["distutils.errors.CompileError: command 'gcc' failed with exit status 1\n"]
%timeit filter_array.filter_larger_cython(rands)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[61], line 1
----> 1 get_ipython().run_line_magic('timeit', 'filter_array.filter_larger_cython(rands)')

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/IPython/core/interactiveshell.py:2417, in InteractiveShell.run_line_magic(self, magic_name, line, _stack_depth)
   2415     kwargs['local_ns'] = self.get_local_scope(stack_depth)
   2416 with self.builtin_trap:
-> 2417     result = fn(*args, **kwargs)
   2419 # The code below prevents the output from being displayed
   2420 # when using magics with decodator @output_can_be_silenced
   2421 # when the last Python token in the expression is a ';'.
   2422 if getattr(fn, magic.MAGIC_OUTPUT_CAN_BE_SILENCED, False):

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/IPython/core/magics/execution.py:1170, in ExecutionMagics.timeit(self, line, cell, local_ns)
   1168 for index in range(0, 10):
   1169     number = 10 ** index
-> 1170     time_number = timer.timeit(number)
   1171     if time_number >= 0.2:
   1172         break

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/IPython/core/magics/execution.py:158, in Timer.timeit(self, number)
    156 gc.disable()
    157 try:
--> 158     timing = self.inner(it, self.timer)
    159 finally:
    160     if gcold:

File <magic-timeit>:1, in inner(_it, _timer)

NameError: name 'filter_array' is not defined

Memoization (Caching)#

Yet another way to improve performance of your scripts, perhaps a more straight-forward one, is memoization. This essentially means caching (saving) the results of computations done for a given set of parameters. Every time the function is called it first checks whether the result of the operation was already computed earlier, and if so it immediately returns it rather than re-computing it all over again.

Caching is extremely easy to do in Python. The standard library includes a module called functools which contains several important functions that work on other functions, and one of them is lru_cache, which stands for “least recently used”. While it’s not the only way to do memoization in Python (there are multiple 3rd partly libraries that implement fancy memoization techniques), lru_cache is usually good enough.

Using it is extremely simple:

def fib(n: int) -> int:
    """
    Returns the *n*th Fibonacci sequence element.
    
    Parameters
    ----------
    n : int
        Index of the desired Fibonacci number
        
    Returns
    -------
    int
        The value of the *n*th Fibonacci number
    """
    if n < 2:
        return n
    return fib(n-1) + fib(n-2)

The Fibonacci series is a classic example, since every computation of a new element in the sequence is built on previous calculations. The function above is a simple implementation using recursion, but it currently doesn’t cache its result. Meaning that it has to re-compute all values whenever its called.

To cache the result we simply have to add a decorator to it:

import functools


@functools.lru_cache()
def fib(n: int) -> int:
    """
    Returns the *n*th Fibonacci sequence element.
    
    Parameters
    ----------
    n : int
        Index of the desired Fibonacci number
        
    Returns
    -------
    int
        The value of the *n*th Fibonacci number
    """
    if n < 2:
        return n
    return fib(n-1) + fib(n-2)

Let’s look at the timings:

%timeit -n1 -r1 fib(60)
35.9 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
%timeit -n1 -r1 fib(61)
1.92 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

Running fib(61) takes almost no time, since the result of fib(60) is already cached.

“Code Smells”#

We’ll now turn our attention to higher-minded concepts that you should pay attention to when creating software. The term above refers to elements in your code base that represent something which is not wrong, but can probably be better. That’s what the “smell” means - it’s like a bad feeling about the code, but it’s not something which will tear down your application if it remains as is.

Repetitive code#

The rule of thumb here is once you understand that some piece of code will be re-used somewhere else, immediately extract it out into a function and call that function instead. This will help you test it for correctness, document it more thoroughly and improve the readability of the piece of code using this new function.

Consider the following made-up snippets from some fantastic analysis:

import tifffile


# ...
# In the middle of some data analysis script
file_list_type_1 = ['a.tif', 'b.tif', 'c.tif']
for file in file_list_type_1:
    img = tifffile.imread(file)
    img -= img.min()
    img /= img.max()

# ...
file_list_type_2 = ['x.tif', 'y.tif', 'z.tif']
for file in file_list_type_2:
    img = tifffile.imread(file)
    img -= img.min()
    img /= img.max()

I can definitely smell something. Let’s try:

import numpy as np


def normalize_image(image: np.ndarray) -> np.ndarray:
    """ 
    Receives an image in the form of a numpy array, makes 
    it positive and normalizes it between 0 and 1.
    
    Parameters
    ----------
    image : np.ndarray
        Image to be normalized
    
    Returns
    -------
    np.ndarray
        Normalized image
    """
    img -= img.min()
    img /= img.max()
    return img

# ...
file_list_type_1 = ['a.tif', 'b.tif', 'c.tif']
for file in file_list_type_1:
    img = tifffile.imread(file)
    img = normalize_img(img)

# ...
file_list_type_2 = ['x.tif', 'y.tif', 'z.tif']
for file in file_list_type_2:
    img = tifffile.imread(file)
    img = normalize_img(img)

Even though the code in question is only two lines long, I decided to extract it into its own function. Besides the increased readability, I may have noticed in a later stage of my coding that this function isn’t as harmless as it seems due to a possible integer overflow. (ZeroDivisionError) So now the function is longer than two lines, and more tests have to be added. Coding these upgrades in the first case, where we didn’t extract the code snippet, would’ve been double the work with a higher chance for bugs.

Long functions (with “block comments”)#

Ideally, functions should be about 10-20 lines long in total, including documentation. I.e. a single function or method shouldn’t be longer than your screen. There’s no lower-bound limit, meaning that very short functions where the code is 2-3 lines long, as seen above, are absolutely fine.

Long functions are hard to understand, hard to test and will usually contain several blocks with distinct purposes. There’s absolutely no reason to group up blocks that have different “responsibilities” in a single function. On occasions in which we do write these long functions, we sometimes like to add block comments to the function, like:

###################################################
# This part deals with reading the data into memory
###################################################
data = tiffile.imread(...)
# ...

#############################################
# Find the active areas in the processed data
#############################################
# ...

Here’s a contrived example:

def process_data(filename):
    # Checks for validity of data and reads it
    assert pathlib.Path(filename).exists()
    raw = tifffile.imread(filename)
    assert raw.ndim == 3
    assert raw.shape[0] > 1
    
    # Now we process the data
    summed = raw.sum(0)
    summed = (summed - summed.mean())
    summed /= summed.max()
    
    # ...

It should be clear to you that each of the code blocks, annotated by a comment, should be a different function.

Objects that should be functions, functions that should be objects#

A “healthy” code base will contain a mix of objects (with their methods) and functions. Using only one of these programming paradigms in a medium-to-large scale project is probably not the way to go. But how do you decide whether some code has to go into a function, or “deserves” it’s own object? Here are a couple of thumb rules:

Long list of arguments#

Whenever an algorithm has several functions that perform a task back to back, and they all take approximately the same argumets (number of pixels in the image, filename, the data array, etc.) then you should probably turn these functions into methods in an object, and just ditch the arguments by using self.

This refactoring into an object will also let you organize the code and improve its readability. As separate functions you might not remember who do you call first - do I first divide_by_largest() and then find_most_popular() or the other way around? As methods in an object you can sort them in a main(), publically-available method which exposes the only true way to use these functions.

Objects with either one or two methods#

Usually if you have an object which has no more than a couple of methods, it’s best to just turn these methods into functions and use them instead. Objects create more boilerplate and clutter, and testing will be generally harder.

Hard to explain#

This one might sound naive or simplistic, but it’s profoundly true. When things are engineered correctly, they are easier to explain. Try to spend time considering the purpose of each distinguishable building block of your code, how it interacts with the other pieces, what information it requires, what other parts of your code make use of this information, how else are they related, etc. With enough practice, you’ll find that the way you write code in the first place changes and requires less revisions and rewrites every time something in your workflow changes.

Too many nested levels#

Having too many nested levels in your code gives the readers of it a harder time - they have to remember the last condition that was met (or wasn’t), and to understand its relation to the current condition. But how do we do that? We have two main methods: early returns and “switch-like” statements.

Early returns#

BAD CODE BELOW

def f(a, b, c, d):
    if a > b:
        c = func1(a)
        if c:
            print(f"C is {c}")
            for item in c:
                d = [m for m in item if item is not None]
        else:
            return None
        return d
    else:
        c = func1(b)
        if c:
            print(f"C is {c}")
            d = []
            for item in c:
                d.append([m for m in item if item is not None])
        else:
            return None
        return d

It would be better to:

def f(a, b, c, d):
    c = func1(max(a, b))
    if not c:
        return None

    print(f"C is {c}")
    d = []
    for item in c:
        d.append([m for m in item if item is not None])
    return d

There are two main differences here:

  • The use of the built-in max() function to drop the first if statements, since the two code paths are identical.

  • “Early” return. Instead of asking if c: and then having a fully-indented code block below, we reverse the condition, asking if not c: return None, and then we can safely unindent the following code path, since we’re sure that c has the right value for us. It’s also easier to read, since you can remember that for all lines of code below the if not c condition, c is not False or None, there are no else clauses that would make it less obvious was condition are we really checking at this line of code.

Switch-like statement in Python#

Progammers in other languages, including MATLAB, are usually aware of the switch - case operator which allows you to choose what to do based on a specific value of some variable during runtime. For example:

def my_func1():
    return 4

def my_func2(data):
    print(f"Data is {data}")

def my_func3(data):
    pass
data = my_func1()
switch data:
    case 4:
        my_func2(data)
    case 15:
        my_func3(data)
    # etc...

Python doesn’t have a proper switch statement, but you can mimick this behavior using dictionaries! Here’s an equivalent piece of code:

switch = {4: my_func2, 15: my_func3}
data = my_func1()
switch[data](data)
Data is 4

When we access the switch dictionary at the index data, we get back the name of the function which was mapped there. This is like running the following statement:

a = my_func3
a
<function __main__.my_func3(data)>

The variable a is just a reference to the function. Printing it doesn’t call the function, we have to add parenthesis in order for the function to be executed. And this is why we have the (data) part after switch[data] - the parenthesis, with the argument inside them, make the actual function call happen.

Switch statements aren’t too common out in the wild, but sometimes they fit best your mental model of your code. When that is the case, a dictionary is a suitable replacement for the missing switch. By the way, there are libraries which try to mimick a switch in a clearer manner.

Note

Python 3.10 introduced switch statements, so if you really don’t want to use a dictionary go ahead and:

def f(x):
    match x:
        case 'a':
            return 1
        case 'b':
            return 2

Software Design Principles#

The previous part dealt with low-level concepts with very clear “do’s and don’ts”. We’ll now turn our heads to some higher-level concepts when you think of the design of your software. Most of the ideas presented below are from Robert Martin’s, AKA Uncle Bob, lectures and textbooks. He’s one of the founding fathers of object-oriented design.

Object Orthogonality and Encapsulation#

In many cases objects interact with one another. In the case of some ProcessData class, which might process some instances of a Data class, that can contain a couple of Series and metadata, for example, we can see how ProcessData communicates with the data inside the Data class, modifying it further.

A preliminary design might look like the following:

import numpy as np
import pandas as pd


class Data:
    """ Simple container for DataFrames and their metadata """
    def __init__(self, arr1: np.ndarray, arr2: np.ndarray, date: float):
            self.ser1 = pd.Series(arr1, dtype=np.uint8)
            self.ser2 = pd.Series(arr2, dtype=np.int16)
            self.metadata = dict(shape1=self.ser1.shape,
                                 shape2=self.ser2.shape,
                                 total=self.ser1.shape[0] + self.ser2.shape[0],
                                 date=date)

            
class ProcessData:
    """ Pipeline to process twin Data instances """
    def __init__(self, data1: Data, data2: Data):
        self.data1 = data1
        self.data2 = data2
        self.result = []
        self.metadata = dict(columns1=data1.columns,
                             columns2=data2.columns,
                             metadata=data1.metadata)
        
    def process(self):
        self.result.extend([data1.ser1.sum(), data2.ser1.sum()])
        self.result.append([data1.ser1.mean() + data2.ser2.mean()])
        return result

We have here a Data class which serves as a container for two DataFrames that are logically connected. It also simplifies the access to some of the metadata contained with theses DataFrames.

We also have a ProcessData class that uses the Data instances to calculate some statistical properties and keep them for later use.

While this design works (which is important), it’s flawed in the sense that the ProcessData object is highly dependent on the implementation details of the Data class. How would you write tests for ProcessData? Many of the possible tests you may write are reliant on proper Data implementation. When higher-level objects are dependent on specific attributes of some lower-level module, we need to perform “dependency inversion”. This decoupling process can also be called “object orthogonality”.

We’ll do a couple of major changes to our design which will solve, step by step, the design issues we encoutered.

First we’ll create a new DataContainer class that holds Data instances, and redefine the Data class more appropriately:

class Data:
    """ Simple container for DataFrames and their metadata """
    def __init__(self, arr1: np.ndarray, arr2: np.ndarray, date: float):
            self._ser1 = pd.Series(arr1, dtype=np.uint8)
            self._ser2 = pd.Series(arr2, dtype=np.int16)
            self._metadata = dict(shape1=self.df1.shape,
                                 shape2=self.df2.shape,
                                 total=self.df1.shape[0] + self.df2.shape[0],
                                 date=date)
        
    @property
    def data(self):
        """ Returns the actual data variables as an iterable"""
        result = [self._ser1, self._ser2]
        return result
    
    @property
    def metadata(self):
        return self._metadata
    
    def sum(self):
        return [x.sum() for x in self.data]


class DataContainer:
    """ Holds, in order, instances of Data """
    def __init__(self, datas):
        self._data = []
        self._metadata = {}
        try:
            for idx, data in enumerate(datas):
                if isinstance(data, Data):
                    self._data.append(data)
                    self._metadata[idx] = data.metadata
                else:
                    raise TypeError(f"TypeError: Data {data} isn't a 'Data' type.")
        except TypeError as e:
            print(e)
    
    @property
    def data(self):
        return self._data
    
    @property
    def metadata(self):
        return self._metadata
    
    def sum(self):
        result = []
        for data in self._data:
            result.append(data.sum())
        return result

First note the “new technical term”: We introduce here the @property decorators. If we define some method as a property, that keyword can be used like a regular attribute, except for the fact that it’s immutable:

class Trial:
    def __init__(self):
        self.two_as_attr = 2
    
    def two_as_method(self):
        return 2
    
    @property
    def two_as_prop(self):
        return 2

tr = Trial()

# Changing attributes is possible:
print(f"The original attribute: {tr.two_as_attr}")
tr.two_as_attr = 3
print(f"Attributes can be changed: {tr.two_as_attr}")
print("------")

# Using the regular method requires brackets
print(f"Using the method: {tr.two_as_method()}")
print("And of course, it can't be changed (though you could override the function).")
print("------")

# Using a property "feels" like using an attributes:
print(f"As a property: {tr.two_as_prop}")  # no brackets
try:
    tr.two_as_prop = 3  # AttributeError
except AttributeError as e:
    print(f"AttributeError: {e} - properties can't (by default) be changed.")
The original attribute: 2
Attributes can be changed: 3
------
Using the method: 2
And of course, it can't be changed (though you could override the function).
------
As a property: 2
AttributeError: can't set attribute - properties can't (by default) be changed.

But besides this new, exciting feature of Python, what else has changed with the implementation?

Data:#

  1. We redefined Data. The new object doesn’t allow anyone from the outside to change the data it holds, it only allows for a “view” of the data. The use of properties ensure that once the object was created, the internal structure of the instance remains intact. The single underscore before the variable names also prevents direct access to the attribute. This idea is called encapsulation.

  2. Furthermore, if we examine the sum() method, we see that it’s now bound to the Data object itself. If we write it explicitly it makes senes: The sum of the data is a bound method to our data - an intrinsic property of it. If we ever decide to change how our data is stored, the sum() method should change accordingly, but no other object will be affected.

DataContainer:#

  1. The new DataContainer class doesn’t really know what it’s holding. All it cares is that they’re Data instances. It doesn’t peek inside the methods of the different Data instances.

  2. It doesn’t allow access to the list of Data instances itself. It exposes a data property which returns the list. If we decide to change the internal implementation of DataContainer, users of this class wouldn’t care as long as we keep the output of the data property similar. Even if the list is empty, it will always return something.

Let’s see the redefined implementation of the ProcessData class:

class ProcessData:
    """ Pipeline to process twin Data instances """
    def __init__(self, data_container: DataContainer):
        self.data_container = data_container
        self.result = {}
        self.metadata = data_container.metadata
        
    def process(self):
        """ Mock processing pipeline """
        self.result['sum'] = self.data_container.sum()
        means = [x.mean() for x in self.data_container.data]
        self.result['mean'] = means
        return self.result

The code snippet above is now much cleaner than the one we had beforehand. It uses the “API” of the DataContainer in two ways; either using a fully-featured sum() function, or by (securely) accessing the data using the data property and running non-standard processing on it (mean calculation in our case).

The downside is the added class - more code to write, more tests, more imports at the top. But the added value is tremendous. Think how easy it is to add new functionality into the pipeline. Everything is flexible, allowing to create a new median() function in the DataContainer class, for example. We can even change the internal structure of the Data class and still use the downstream class effectively.

Classes as Data Types and Class Methods#

Yet another fairly important use case for classes in Python is their as user-defined types for particular data. Programming languages define for us the basic types of data: floating-point numbers, integres, string and so on. But what if (some of) our data is not composed of these primitive types? Can we construct data types of our own?

For instane, assume I’m collecting data from participants in a study I’m running, and one of the data points I’m gatheting is their age. How should I encode it?

The age of a person is not an integer number. It can be thought of as a floating point number, but then being 41.9 means that your age is 41 years and almost eleven months, which isn’t too obvious from just looking at 41.9, since the 9 could be interpreter as the month of September. We can try to write stuff like ‘41.9’ or ‘41 years and 9 months’ or ‘41.9.14’ but it doesn’t look so good.

Instead, what we should do is to write a class that defines an age:

from typing import Union


class Grade:
    """
    Represenets a single grade.
    """
    def __init__(self, value: Union[int, float]):
        self._value = self._verify_grade(value)

    def _verify_grade(self, value: Union[int, float]):
        """Verifies that the given grade holds up to our standards"""
        if value < 0:
            raise ValueError('too low')
        if value > 100:
            raise ValueError('too high')
        return value

    @property
    def value(self):
        return self._value

    @value.setter
    def value(self, value):
        self._value = self._verify_grade(value)
YAERS_OUT_OF_RANGE = "Years should be a valid integer, received {years}"
MONTHS_OUT_OF_RANGE = "Months should be an integer between 1 and 12, received {months}"
DAYS_OUT_OF_RANGE = "Days should be an integer between 1 and 31, received {days}"

class Age:
    """
    Represents the age of a person.
    """
    def __init__(self, years: int, months: int = 1, days: int = 1):
        self.validate_input(years, months, days)
        self.years = years
        self.months = months
        self.days = days
    
    def validate_input(self, years: int, months: int, days: int):
        valid_years = isinstance(years, int) and 0 <= years < 150
        valid_months = isinstance(months, int) and 0 < months < 13
        valid_days = isinstance(days, int) and 0 < days < 32
        if not valid_years:
            message = OUT_OF_RANGE.format(years=years)
            raise TypeError(message)
        if not valid_months:
            message = MONTHS_OUT_OF_RANGE.format(months=months)
            raise TypeError(message)
        if not valid_days:
            message = DAYS_OUT_OF_RANGE.format(days=days)
            raise TypeError(message)        

Now the DataFrame or array containing our data can have a column of type Age which will contain meaningful data about the persons age. Notice how compact this class is. It doesn’t contain the ID number of the person, nor it’s name. All it does is encode the age. It’s important that each of the class we write will have one specific purpose, and not more.

However, we’re not quite done here. There is another possible “representation” of age and that is the date of birth. It’s quite a natural requirement that given a date of birth, a string or a datetime object, our Age class will know how to generate a proper Age instance. Similarly, given an Age instance, we should be able to generate the person’s date of birth.

The second requirement is pretty easy, we can simply make a get_dob() method that returns the date of birth. But how should we approach the first requirement, of instantiating an Age from a given date? Let’s try to refactor our class:

import datetime


class Age:
    """
    Represents the age of a person.
    """
    
    cur_year = datetime.date.today().year
    
    @classmethod
    def from_str(cls, date_str):
        """ Instantiate from a string containing a date in the standard ISO format. """
        try:
            date = datetime.date.fromisoformat(date_str)
        except ValueError:
            raise
        else:    
            return cls(cls.cur_year - date.year, date.month, date.day)
        
    @classmethod
    def from_dob(cls, dob):
        """ Instatiates from a datetime.date or a datetime.datetime object """
        try:
            years = dob.year
            months = dob.month
            days = dob.day
        except AttributeError:
            print(f"Input should be a datetime.datetime or a datetime.date instance. Received {dob} which is a {type(dob)}.")
            raise
        else:
            return cls(years, months, days)    
    
    def __init__(self, years: int, months: int = 1, days: int = 1):
        self.validate_input(years, months, days)
        self.years = years
        self.months = months
        self.days = days
        
    def __str__(self):
        return f"Age(years={self.years}, months={self.months}, days={self.days})"

    def validate_input(self, years: int, months: int, days: int):
        valid_years = isinstance(years, int) and 0 <= years < 150
        valid_months = isinstance(months, int) and 0 < months < 13
        valid_days = isinstance(days, int) and 0 < days < 32
        if not valid_years:
            message = OUT_OF_RANGE.format(years=years)
            raise TypeError(message)
        if not valid_months:
            message = MONTHS_OUT_OF_RANGE.format(months=months)
            raise TypeError(message)
        if not valid_days:
            message = DAYS_OUT_OF_RANGE.format(days=days)
            raise TypeError(message)       
        
    def get_dob(self):
        """ Returns the date of birth """
        return datetime.date(self.cur_year - self.years, self.months, self.days)
age = Age(42, 11, 1)
age.get_dob()
age2 = Age.from_str('2001-04-05')
print(age2)
Age(years=23, months=4, days=5)

Typestates#

Typestates are a way to enforce the state of our data\application with strict types.

Let’s assume I have 24 human volunteers in a combined fMRI + questionnaire study. I keep them all in a single DataFrame for brevity and ease-of-use, but in effect they’re in different stages of my experiment. A few were just recruited last week, and I haven’t even set a date for our first meeting. A few others were already scanned in the magnet once, but still have to go through my second questionnaire session.

My application monitors these students, alerts me of incoming meeting dates, and (of course) analyzes the results of the questionnaires and scans.

The correctness of this application can be enforced in many ways - tests, mock data, daily use - but here I choose to show another mechanism - typestates. The fact that the current status of each volunteer isn’t specified with a simple string in a table, but is actually a different class altogether, is another way to make sure that I always receive the expected output from each method call.

import datetime
import pandas as pd


# Helper types
class Name:
    """ First and last name """
    # Implementation omitted


class Age:
    """ Special age type """
    # Implementation omitted


class FmriResult:
    """ Results from an fMRI scan """
    # Implementation omitted


# Volunteer types    
class Volunteer:
    """ Base class for all volunteers in my project """
    def __init__(self, name: Name, age: Age, call_date: datetime.time, vol_id: int):
        self.name = name
        self.age = age
        self.call_date = call_date
        self.id = vol_id
        
    def __str__(self):
        return f"{self.name}, age {self.age}, first called at {self.call_date}."
        
    def update_df(self, records: pd.DataFrame):
        """ Add the instance to the dataframe containing the rest of the data """
        record = pd.DataFrame([self.name, self.age, self.call_date, 
                               self.id, self.metadata, type(self), copy.copy(self)])
        records.append(record)
        return records
    
    def remove_from_df(self, records: pd.DataFrame):
        """ Remove the instance from the student records """
        idx = records.id == self.id
        records.drop(idx, inplace=True)
        return records

    
class PreScanOne(Volunteer):
    """ Volunteer before the first session """
    loc = 0  # ordinal place in hierarchy
    
    def __init__(self, name: Name, age: Age, call_date: datetime.time, vol_id: int, 
                 scan_one_date: datetime.time):
        super().__init__(name, age, call_date, vol_id)
        self.metadata = dict(scan_one_date=scan_one_date)
        
    def advance(self, result: FmriResult, next_date: datetime.time):
        """ Advance a PreScanOne to a PostScanOne """
        new = PostScanOne(self, result, next_date)
        return new
    

class PostScanOne(Volunteer):
    """ Volunteer after the first session """
    loc = 1
    
    def __init__(self, pre_volunteer: PreScanOne, scan_one_data: FmriResult, 
                 scan_two_date: datetime.time):
        super().__init__(pre_volunteer.name, pre_volunteer.age, pre_volunteer.call_date, pre_volunteer.id)
        self.metadata = pre_volunteer.metadata
        self.metadata['scan_one_data'] = scan_one_data
        self.metadata['scan_two_date'] = scan_two_date
    
    def advance(self, result: FmriResult, next_date: datetime.time):
        """ Advance a PostScanOne to a PreScanTwo """
        new = PreScanTwo(self, result, next_date)
        return new


# Examples of generic methods that use this interface
def advance_volunteer(old_vol, results: FmriResult, records: pd.DataFrame):
    """ 
    Move volunteer to next step in the experiment, returning the new 
    instance and records.
    """
    old_vol.remove_from_df(records)
    new_vol = old_vol.advance(results, records)
    new_vol.update_df(records)
    return new_vol, records


def process_data(records):
    """ Run the same processing function over all fMRI data """
    results = []
    for vol in records:
        try:
            results.append(vol.process_data())
        except AttributeError:  # instance doesn't have data
            pass
    return results

This is long, but interesting, so let’s try to break it down.

At the beginning we have a few help classes which I merely defined, but not implemented. These shouldn’t look strange to you. We talked during class of how an Age type is an important example of defining our own types in a program, since it’s neither an integer nor a floating point number.

The second part is the most interesting. We have a base class called Volunteer which contains basic information which is common to all experiment volunteers. But it’s actually more than that - it also defines the interfaces between the classes, it forces the classes to have specific attributes that will comply to this protocol, linking their behavior together.

The other two classes inherit from Volunteer and represent the first two steps in the “Volunteer path”. The loc class variable signifies that. From phase one (PreScanOne( a volunteer can only advance forward (or drop out from the experiment) to step 2. And likewise from step 2 to 3 - you’ll always find the same .advance() method that takes you to the next step, even though the implementation is slightly different. To handle the variability in the held data, we have the metadata attribute which can hold different parameters and datapoints.

The last part shows how to use such an interface. We have a function that advances an instance of a class “one step” to the next phase. We have a function that runs some processing on the data held inside the instances, and we can have as many functions (and classes as we wish). It’s completely extensible since the API is well-defined.

Helper Concepts and Libraries#

In practice, good and clear software design can be aided from using unique Python features and packages. We’ll review a few of the more prominent ones:

Code formatting, styling and linting#

As you remember, from day one I insisted that your code should have a very specific look, as defined in PEP8, the official document describing how to style your Python. Happily enough there are a few tools that can automate this work for us, and the most famous one is black, which can be operated from either the command line or directly from VSCode. I’ll show how to do it in after we review a few other libraries of the same type right below.

Type Annotations and MyPy#

Since version 3.6, Python allows this syntax:

from typing import Tuple, Dict

def doer_of_stuffs(a: float, b: int, c: str = 'ccc') -> Tuple[str, Dict[int, float]]:
    """
    Does stuff to a, b, and c.
    Returns: A tuple of a string and a dictionary mapping ints to floats
    """
    a_helper: float = a + 2
    b_helper: float = b / 3
    int_a = int(a_helper)
    c2: str = c + c
    return c2, {b: a_helper, int_a: b_helper}

While a bit more verbose, these type annotations make things clearer when dealing with large codebases. Knowing the defined type of your variables as they bounce around between modules and functions can help with the debugging process of your code tremendously.

Moreover, modern IDEs like PyCharm and VSCode will alert you before you run the code of any possible type errors. For example:

def main():
    a = 3  # integer
    a /= 2  # now it's a float
    arr = np.array([1, 2, 3])
    
    # ... lots of code here
    
    b = arr[a]  # TypeError - cannot index with a float variable

VSCode will mark this arr[a] expression and try to prevent you from running this code.

A more wholesome approach is mypy, which was developed in Dropbox, a company very reliant on its Python-based product. When the Dropbox codebase increased in size, its engineers wanted to keep using Python due to its amazing features, but avoid the problems that come with a dynamically-typed language. Thus, mypy was born. In essence, it’s a command-line tool that runs type checks on the entirety of your code base, verifying the type-correctness of your application. In many places a clean mypy error log is required before committing changes to the code base.

mypy supports both comment-based type annotations for older versions of Python (Dropbox used Python 2.7 until 2019) and the new style of type annotations shown above. It can also generate type annotations on the fly, using PyAnnotate, while you run your application.

Linters (pylint, flake8)#

The last tools we’ll dicuss are linters, which check the correctness of your code in several key aspects. First, they point out violations of simple PEP8 rules, like wrong variable naming and such. But more importantly they’ll make sure that your code is in a runnable state, which means that you’re not using variables you haven’t declared, or libraries which you haven’t imported, and such. Just like black and mypy, these two tools can also be configured to work with VSCode.

Enumerations#

Python added enumeration support in Python 3.4, and it’s starting to pop-up more and more in new code bases. An enumeration is a list of discrete possible values. Assuming I have a simple addition function:

def add_or_sub(a, b, add=True):
    """ Simple addition\subtraction """
    return a + b if add else a - b

The list of possible values for a and b is endless, so these cannot be enumerated. The add keyword is called a “flag”, since it has two possible values - True and False. It’s an enumeration of two possible values.

When we have more than two options, or when our two options aren’t simply booleans, we can use an enumeration. Here’s a simple example:

from enum import Enum


class Color(Enum):
    RED = 'r'
    GREEN = 'g'
    BLUE = 'b'
    BLACK = 'k'

Each of the Enum’s attributes now has a name and a value:

print(Color.RED.name)
print(Color.RED.value)
RED
r

In the “real world” enumerations aren’t too popular due to the fact that they were introduced very late. But a use-case could look like the following:

import pandas as pd


rng = pd.date_range('1/1/2018',periods=100, freq='D')  # 'D' is days
rng
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
               '2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08',
               '2018-01-09', '2018-01-10', '2018-01-11', '2018-01-12',
               '2018-01-13', '2018-01-14', '2018-01-15', '2018-01-16',
               '2018-01-17', '2018-01-18', '2018-01-19', '2018-01-20',
               '2018-01-21', '2018-01-22', '2018-01-23', '2018-01-24',
               '2018-01-25', '2018-01-26', '2018-01-27', '2018-01-28',
               '2018-01-29', '2018-01-30', '2018-01-31', '2018-02-01',
               '2018-02-02', '2018-02-03', '2018-02-04', '2018-02-05',
               '2018-02-06', '2018-02-07', '2018-02-08', '2018-02-09',
               '2018-02-10', '2018-02-11', '2018-02-12', '2018-02-13',
               '2018-02-14', '2018-02-15', '2018-02-16', '2018-02-17',
               '2018-02-18', '2018-02-19', '2018-02-20', '2018-02-21',
               '2018-02-22', '2018-02-23', '2018-02-24', '2018-02-25',
               '2018-02-26', '2018-02-27', '2018-02-28', '2018-03-01',
               '2018-03-02', '2018-03-03', '2018-03-04', '2018-03-05',
               '2018-03-06', '2018-03-07', '2018-03-08', '2018-03-09',
               '2018-03-10', '2018-03-11', '2018-03-12', '2018-03-13',
               '2018-03-14', '2018-03-15', '2018-03-16', '2018-03-17',
               '2018-03-18', '2018-03-19', '2018-03-20', '2018-03-21',
               '2018-03-22', '2018-03-23', '2018-03-24', '2018-03-25',
               '2018-03-26', '2018-03-27', '2018-03-28', '2018-03-29',
               '2018-03-30', '2018-03-31', '2018-04-01', '2018-04-02',
               '2018-04-03', '2018-04-04', '2018-04-05', '2018-04-06',
               '2018-04-07', '2018-04-08', '2018-04-09', '2018-04-10'],
              dtype='datetime64[ns]', freq='D')
rng = pd.date_range('1/1/2018',periods=100, freq='M')  # it can also be 'M'
rng
DatetimeIndex(['2018-01-31', '2018-02-28', '2018-03-31', '2018-04-30',
               '2018-05-31', '2018-06-30', '2018-07-31', '2018-08-31',
               '2018-09-30', '2018-10-31', '2018-11-30', '2018-12-31',
               '2019-01-31', '2019-02-28', '2019-03-31', '2019-04-30',
               '2019-05-31', '2019-06-30', '2019-07-31', '2019-08-31',
               '2019-09-30', '2019-10-31', '2019-11-30', '2019-12-31',
               '2020-01-31', '2020-02-29', '2020-03-31', '2020-04-30',
               '2020-05-31', '2020-06-30', '2020-07-31', '2020-08-31',
               '2020-09-30', '2020-10-31', '2020-11-30', '2020-12-31',
               '2021-01-31', '2021-02-28', '2021-03-31', '2021-04-30',
               '2021-05-31', '2021-06-30', '2021-07-31', '2021-08-31',
               '2021-09-30', '2021-10-31', '2021-11-30', '2021-12-31',
               '2022-01-31', '2022-02-28', '2022-03-31', '2022-04-30',
               '2022-05-31', '2022-06-30', '2022-07-31', '2022-08-31',
               '2022-09-30', '2022-10-31', '2022-11-30', '2022-12-31',
               '2023-01-31', '2023-02-28', '2023-03-31', '2023-04-30',
               '2023-05-31', '2023-06-30', '2023-07-31', '2023-08-31',
               '2023-09-30', '2023-10-31', '2023-11-30', '2023-12-31',
               '2024-01-31', '2024-02-29', '2024-03-31', '2024-04-30',
               '2024-05-31', '2024-06-30', '2024-07-31', '2024-08-31',
               '2024-09-30', '2024-10-31', '2024-11-30', '2024-12-31',
               '2025-01-31', '2025-02-28', '2025-03-31', '2025-04-30',
               '2025-05-31', '2025-06-30', '2025-07-31', '2025-08-31',
               '2025-09-30', '2025-10-31', '2025-11-30', '2025-12-31',
               '2026-01-31', '2026-02-28', '2026-03-31', '2026-04-30'],
              dtype='datetime64[ns]', freq='M')

What are the possible values for the freq keyword? Day is D, month is M, Year will probably be Y. Are there any more keywords? Will d also work, or do I have to use capital D? Actually, checking the official documentation doesn’t result in anything too useful.

This is where enumerations come into play. This could’ve been simpler if we could only choose a value from a list of possible values:

class DateRangeFreq(Enum):
    D = 'days'
    M = 'months'
    Y = 'years'

rng = pd.date_range('1/1/2018',periods=100, freq=pd.DateRangeFreq.D)  # doesn't actually work...

If we were unsure of the available parameters, we could import the DateRangeFreq object and inspect its possible values. As you can see, each key has a value associated with it. This value can be an integer, string or even a Python object.

Enumerations are gaining popularity in the Python ecosystem and are a simple thing that’s great to implement and increase the robustness of your codebase.

attrs - Classes without boilerplate#

Python classes are extremely useful, but they’re also pretty verbose. They require you to write a lot of code for very basic operations.

For example, in the the __init__() method you have to go through each variable in the function signature and assign it to your own value:

class Example:
    def __init__(self, param1, param2, param3='no', param4):
        self.param1 = param1
        self.param2 = param2
        self.param3 = param3
        self.param4 = param4
    
    def my_method(self):
        """ Do stuff """
        pass
  Cell In[89], line 2
    def __init__(self, param1, param2, param3='no', param4):
                 ^
SyntaxError: non-default argument follows default argument

So many lines of repetitive code doing basically nothing. I didn’t assert the types of the variables, I didn’t do some basic pre-processing - this is called “boilerplate” code. Python requires me to write these tedious lines every time I create a class, and when classes get bigger and bigger, these assignments can be a hassle to write.

attrs to the rescue:

import attr
from attr.validators import instance_of


@attr.s
class ExampleTwo:
    param1 = attr.ib(validator=instance_of(int))
    param2 = attr.ib(validator=instance_of(float))
    param3 = attr.ib(default='no')
    param4 = attr.ib(default=attr.Factory(list))
    
    def my_method(self):
        """ Do stuff """
        pass


@attr.s(auto_attribs=True)
class ExampleThree:
    param1: int
    param2: float
    param3 = 'no'
    param4 = []
a = ExampleTwo(1, 2., 'a', [4, 5, 6])
a
ExampleTwo(param1=1, param2=2.0, param3='a', param4=[4, 5, 6])
b = ExampleTwo(1.1, 2., 'a', [4, 5, 6])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[92], line 1
----> 1 b = ExampleTwo(1.1, 2., 'a', [4, 5, 6])

File <attrs generated init __main__.ExampleTwo>:10, in __init__(self, param1, param2, param3, param4)
      8     self.param4 = __attr_factory_param4()
      9 if _config._run_validators is True:
---> 10     __attr_validator_param1(self, __attr_param1, self.param1)
     11     __attr_validator_param2(self, __attr_param2, self.param2)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/attr/validators.py:106, in _InstanceOfValidator.__call__(self, inst, attr, value)
     99 if not isinstance(value, self.type):
    100     msg = "'{name}' must be {type!r} (got {value!r} that is a {actual!r}).".format(
    101         name=attr.name,
    102         type=self.type,
    103         actual=value.__class__,
    104         value=value,
    105     )
--> 106     raise TypeError(
    107         msg,
    108         attr,
    109         self.type,
    110         value,
    111     )

TypeError: ("'param1' must be <class 'int'> (got 1.1 that is a <class 'float'>).", Attribute(name='param1', default=NOTHING, validator=<instance_of validator for type <class 'int'>>, repr=True, eq=True, eq_key=None, order=True, order_key=None, hash=None, init=True, metadata=mappingproxy({}), type=None, converter=None, kw_only=False, inherited=False, on_setattr=None, alias='param1'), <class 'int'>, 1.1)

That’s it. No __init__ is required, each paramX variable is already assigned to self.paramX. It also allows the addition of validators, default values, converter functions (not shown), and it even implements the comparison methods (__eq__, __gt__, etc.) for you. It has a ton of other useful features which I won’t go into right now, but you can be sure that it’s a package worth using.

Dimensionality analysis and units#

When working with numbers that have units, it’s usually a good idea to keep the physical quantity assigned to that value as close as possible.

When you’re measuring the local field potential using some electrode array, it’s good practice to verify that throughout the entirety of your processing pipeline, the voltage values aren’t divided by a number with units of time, because units of [Volts] / [seconds] usually have no physical meaning. It can also help you assert that your dF/F calculation indeed has natural units, and not some other arbitrary units.

There are many options in the Python world for dimensionality analysis. If you’re using Python to write symbolic math and solve equations, I suggest you use SymPy’s physics.units module. Else - use pint.

import pint


ureg = pint.UnitRegistry()
3 * ureg.meter + 4 * ureg.cm
3.04 meter
measures = ureg.Quantity(np.random.random(10), 'volts')
print(measures)
[0.3586641401590277 0.5534316826292836 0.4610886120447638 0.7014110975659558 0.2477853570321078 0.6177401323911661 0.7653688037377275 0.6458052803670696 0.39673205247474685 0.8256946643857462] volt
print(measures * 2)
[0.7173282803180554 1.106863365258567 0.9221772240895276 1.4028221951319115 0.4955707140642156 1.2354802647823322 1.530737607475455 1.2916105607341393 0.7934641049494937 1.6513893287714925] volt
amps = measures / (2 * ureg.ohm)  # I = V/R
amps.dimensionality
<UnitsContainer({'[current]': 1})>
amps.to('seconds')  # DimensionalityError
---------------------------------------------------------------------------
DimensionalityError                       Traceback (most recent call last)
Cell In[97], line 1
----> 1 amps.to('seconds')  # DimensionalityError

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pint/facets/plain/quantity.py:528, in PlainQuantity.to(self, other, *contexts, **ctx_kwargs)
    511 """Return PlainQuantity rescaled to different units.
    512 
    513 Parameters
   (...)
    524 pint.PlainQuantity
    525 """
    526 other = to_units_container(other, self._REGISTRY)
--> 528 magnitude = self._convert_magnitude_not_inplace(other, *contexts, **ctx_kwargs)
    530 return self.__class__(magnitude, other)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pint/facets/plain/quantity.py:477, in PlainQuantity._convert_magnitude_not_inplace(self, other, *contexts, **ctx_kwargs)
    474     with self._REGISTRY.context(*contexts, **ctx_kwargs):
    475         return self._REGISTRY.convert(self._magnitude, self._units, other)
--> 477 return self._REGISTRY.convert(self._magnitude, self._units, other)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pint/facets/plain/registry.py:922, in PlainRegistry.convert(self, value, src, dst, inplace)
    919 if src == dst:
    920     return value
--> 922 return self._convert(value, src, dst, inplace)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pint/facets/context/registry.py:392, in ContextRegistry._convert(self, value, src, dst, inplace)
    388             src = self._active_ctx.transform(a, b, self, src)
    390         value, src = src._magnitude, src._units
--> 392 return super()._convert(value, src, dst, inplace)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pint/facets/nonmultiplicative/registry.py:205, in NonMultiplicativeRegistry._convert(self, value, src, dst, inplace)
    200     raise DimensionalityError(
    201         src, dst, extra_msg=f" - In destination units, {ex}"
    202     )
    204 if not (src_offset_unit or dst_offset_unit):
--> 205     return super()._convert(value, src, dst, inplace)
    207 src_dim = self._get_dimensionality(src)
    208 dst_dim = self._get_dimensionality(dst)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pint/facets/plain/registry.py:954, in PlainRegistry._convert(self, value, src, dst, inplace, check_dimensionality)
    951     # If the source and destination dimensionality are different,
    952     # then the conversion cannot be performed.
    953     if src_dim != dst_dim:
--> 954         raise DimensionalityError(src, dst, src_dim, dst_dim)
    956 # Here src and dst have only multiplicative units left. Thus we can
    957 # convert with a factor.
    958 factor, _ = self._get_root_units(src / dst)

DimensionalityError: Cannot convert from 'volt / ohm' ([current]) to 'second' ([time])

For some projects this can be a pretty big overkill, but for others this can save many “silent” bugs.

Design vs. Productivity#

Before we start exercising, one important note to remember: There’s a thin line between under- and over-engineering. Very small scripting projects require almost no engineering at all. This might mean that after you gain a few extra months of experience in Python, the structure of code for a small scripting job in Python might be obvious for you right from the get-go. You’ll know which data structures you’ll have, whether or not you’ll need a class or two, and how the user interface might go.

On the other hand, large applications which span at least a few thousands lines of code will always need some form of pre-planning. It would be senseless not to write out a diagram of the main modules in your code and their interfaces. One can consider this to be common knowledge, or a simple programmer’s instinct. Just like architects sit down and plan for months in advance the construct what they’re about to create, programmers should spell out the architecture of their own programs. In no way will this guarantee you’ll get the architecture right in the first time, but the design might serve as good building blocks when you start the refactoring process.

Problems mostly occur when you write medium-sized scripts, up to a couple thousand lines. These scripts usually start out small - a few functions that deal with file I/O and display of data - but can grow quite quickly once you start adding functionality. When the script was short you probably didn’t even write tests, since you were sure you’re handling some insignificant piece of code, and now it starts biting back at you.

It’s hard to write rules for these occasions. When someone asks me for improved functionality on some short script I wrote, I sometimes tell them it will take more time than I think it should, since I want to devote time to refactor the code, add tests and make the new functionality feel more natural inside it.

It’s also good practice to use classes to bind data and methods, even when you think they might be an overkill. It’s much easier to expand the functionality of classes than of an assortment of functions.