Class 8a: Testing, Test-Driven Development#

Introduction#

We’ve mentioned how writing object-oriented code might be considered an industry standard (in terms of code quality), and yet in practice it is still a somewhat niche skill in the academic environment. Another such often neglected common practice is testing. In this lesson, we will learn the basic concepts underlying writing tests for your code in Python.

The Implications of Writing Untested Code#

This is a somewhat sensitive subject, and while beyond the scope of the course material, I wanted to share one relatively neutral example:

This article from 2018 checking how a dynamic image of a circle affects the understanding of speech, employed many modern open-science practices, as described here; pre-registration, data availablity, source code availability, and even peer replication. Unfortunately, the article as published in 2018 is completely wrong. The original conclusion was that these dynamic circles improved the understanding of what was said, while the truth is they worsened it. The fault lied in a very simple mistake in the source code, still visible due to the author’s comment (line ~25 in the link above), which resulted in opposite results.

Looking at it after the correction, every responsible programmer would immediately know this code would never be accepted to any code base in any company, i.e. make it to “production”. In many cases, the code very much demonstrates the opposition to numerous basic programming paradigms, e.g.:

  1. Everything is one large script.

  2. Variables are given uninformative and confusing names.

  3. Too many comments, zero functions.

  4. Lots of commented out lines of code.

In my opinion, it is important we learn to value our research and the responsibility assumed in publishing it no less than any company would with its products.

Public Trust in Research#

Another famous example from the COVID-19 outbreak is the code used for official epidemiological simulations in the UK. The code was made public and turned out to be a huge mess. It even prompted a huge discussion all across the internet (and in the repo itself) with people urging officials to not take any action that was suggested by results of this simulation.

The “truth” of the matter, if there is such one, is much more complex. The code, originally a 15k LOC .c file - revolves around a model that was created in 2005 and used in many important scientific publications. These publications were peer-reviewed, and the articles themselves are considered very helpful and important in their respective field. The code had very few tests, but the scientists who wrote it “tested” their code in a more intuitive manner - they gave simple inputs for which they knew the expected answer for. They also visualized many of the results the model generated to make sure nothing suspicious is happening.

The scientists released their code to the public with good intentions, advocating “open science”, but received a very unwelcoming response from the community, especially from software engineers. These types of reactions really don’t encourage more scientists and labs to release their code to the public space, which ultimately is the right thing to do. In this case, many people started helping with the refactoring process, including people like John Carmack. This process uncovered bugs and made the code more readable and hopefully a bit more trustworthy, but there’s still much more work to do.

So, how can we do better? There are basic, simple tools we can acquire as researchers that make a large difference. We’ve touched on we should design our code and mentioned concepts such as parameterization and encapsulation. We’ve also emphasized the importance of readability and including expansive documentation. Another one of the tools is testing, and if you learn to integrate it adequately into your coding process it will dramatically elevate the quality of your work.

Testing#

Tests are (usually) short pieces of code designed to assert that a small portion of your program does what you intend it to do.

For example, if I have a class designed to perform calculations on some dataframe that was created somewhere else, perhaps by adding some of its columns together, averaging them out and displaying the result, then I wish for my code to be correct and deterministic, in a sense that a single, defined input will always give the same correct output.

This isn’t trivial even for the most basic functions in Python. Let’s look at a simple normalization function:

import numpy as np

def normalize_array(a):
    """Normalizes array"""
    minimum = min(a)
    no_min = a - minimum
    return no_min / max(no_min)

We’ll even check to see that it works:

normalize_array(np.array([1, 2, 3]))
array([0. , 0.5, 1. ])
normalize_array(np.array([0, 100, 250]))
array([0. , 0.4, 1. ])

Amazing! We can move on to our next function.

…Or can we?

What issues does this function have? We’ll start from the more basic ones, and slowly dive deeper.

  1. The docstring is lacking, to say the least.

  2. Variable names are awful.

  3. It uses the Pythonic min and max functions instead of numpy’s min and max. This will lead to a reduction in performance, but also it has different behavior for certain inputs.

  4. Many different inputs values will raise warning, exceptions, or worse - won’t raise any error and simply return the wrong result.

  5. It doesn’t take data types into consideration.

  6. This functionality exists in a few pretty popular libraries - why not use them instead of re-inventing the wheel?

Now, you might have written this function differently, perhaps you’ve remembered to use numpy’s version for min and max, and perhaps you weren’t so lazy and managed to write a decent docstring, but I’m pretty sure that most of you wouldn’t have taken all of the points above into consideration when writing this function.

The importance of this function is also pretty clear, it’s probably located somewhere in the heart of our data processing pipeline, and if there’s a chance it will return the wrong result sometimes, then our resulting computation is in jeopardy. The only I know of solving this issue is by writing tests, so let’s do that.

You’ve already seen unit tests in your homework assignments. You were asked to make sure that your solution passes all tests before submitting. This exemplifies a way through which we can be more certain that our program does what we thought it was doing.

Tests are important to us for two reasons. The first one is that even simple programs are more complicated than we think. The normalization function is a good example, but you might imagine that as soon as we add interfaces between classes, methods and functions, things might get a bit messy. For example, in the aviation industry, for each line of code a software has, you may find about 8 lines dedicated to testing it.

Moreover, when we deal with user input (e.g. data files, parameters for some script, etc.) we should expect the unexpected, even if the main user is us. Our future self in a few months will probably not remember the type of every parameter it has to enter.

The second reason is Python’s dynamic nature, or duck-typing. If you want to enforce a function to only accept inputs of a single type, you must be the one writing these assertions, either outside or inside the function. For example, a function that adds two numbers needs a isinstance(value, (int, float)) somewhere near its top to avoid these mistakes. Statically-typed languages, like C, define a type for each variable. A function adding two integers simply cannot accept a non-integer input.

Python’s dynamic nature is a blessing on many occasions, but it can sometimes be a real pain. This nature is the second important reason to write tests to our code. Many cases that in other programming languages would’ve resulted in a simple TypeError, can cause major bugs in Python due to wrong input types.

QA Engineer

The Solution#

As I’m sure you’ve gathered by now, we want you to write tests. Let’s examine the iterative process that produces code and tests. We’ll go back to the min-max scaling example later, we’ll start with a more simple function. For now, let’s assume we have two positive integers we wish to add together, perhaps the result of some complicated pipeline, which means we would also like to validate that this condition is in fact satisfied.

What would our code look like? What about something like:

NEGATIVE_INPUT_MESSAGE = "Invalid input: ({value_1}, {value_2})! Negative result detected."


def combine_results(value_1: int, value_2: int) -> int:
    """
    Returns the result of adding the two preceding analyses results together.
    
    Parameters
    ----------
    value_1, value_2 : int
        Positive integers
    
    Returns
    -------
    int
        Resulting positive integer
    
    Raises
    ------
    ValueError
        If either number is negative 
    """
    if value_1 < 0 or value_2 < 0:
        message = NEGATIVE_INPUT_MESSAGE.format(value_1=value_1, value_2=value_2)
        raise ValueError(message)
    return value_1 + value_2

That’s a good start! We defined the inputs, output type and even added a check that sees that the numbers are indeed positive. The next step, after I have a working skeleton of my implementation, is to test it. To do so I’ll start writing a few tests:

def test_basic_addition():
    result = combine_results(2, 4)
    expected = 6
    assert result == expected

    
def test_large_number_additions():
    value_1 = 1_000_000
    value_2 = 2_000_001
    expected = value_1 + value_2
    result = combine_results(value_1, value_2)
    assert result == expected
    
test_basic_addition()
test_large_number_additions()

Note

Test functions always start with test_.

Above you see an example of using assertions in tests, which works seamlessly with pytest, the library we’ve been using to run tests in the homework assignments. Pytest looks for these asserts and executes them in a smart manner, which helps us in writing very concise and clear tests. We can easily run a Python file or module containing test functions by calling the pytest command-line interface.

Short Demo#

After running these two simple tests and seeing they passed we remembered we also coded another functionality to our combine_results() function - the way it raises a ValueError when it encounters negative numbers. We would like to verify that it indeed raises an exception (and the correct one) when it encounters these illegal inputs.

import pytest


def test_negative_argument_1():
    with pytest.raises(ValueError):
        result = combine_results(-1, 1)


def test_negative_argument_2():
    with pytest.raises(ValueError):
        result = combine_results(1, -1)
        
test_negative_argument_1()
test_negative_argument_2()

Again, pytest excels at brevity and readability - we use a context manager (with statement) to verify that calling the function with these illegal inputs indeed raises the correction exception.

During the time it tooks us to write the original combine_results() implementation and to test it, we thought of another case: what if the inputs are floating point numbers? That’s also illegal, and it may tell us that there’s a problem with our computation which resulted in a float instead of an integer. Let’s add this functionality and matching tests:

NEGATIVE_INPUT_MESSAGE = "Invalid input: ({value_1}, {value_2})! Negative result detected."
FLOAT_INPUT_MESSAGE = "Invalid input: ({value_1}, {value_2})! Floating point result detected."


def combine_results(value_1: int, value_2: int) -> int:
    """
    Returns the result of adding the two preceding analyses results together.
    
    Parameters
    ----------
    value_1, value_2 : int
        Positive integers
    
    Returns
    -------
    int
        Resulting positive integer
    
    Raises
    ------
    ValueError
        If either number is negative 
    TypeError
        If either number isn't an integer        
    """
    if value_1 < 0 or value_2 < 0:
        message = NEGATIVE_INPUT_MESSAGE.format(value_1=value_1, value_2=value_2)
        raise ValueError(message)
    if isinstance(value_1, float) or isinstance(value_2, float):
        message = FLOAT_INPUT_MESSAGE.format(value_1=value_1, value_2=value_2)
        raise TypeError(message)        
    return value_1 + value_2
def test_float_argument():
    with pytest.raises(TypeError):
        combine_results(1.0, 1)
    with pytest.raises(TypeError):
        combine_results(1, 1.0)

test_float_argument()

Great, we have floating points covered. When the test cases are really simple and clearly connected, it’s OK if we include a couple assert expressions in the same test. Otherwise it’s always better to have only one assert per test, especially for these simple tests, which we also call unit tests.

We’re almost all set, but again, I thought of another edge case. We’re testing floating points, but what if our input isn’t a floating point number and is also not an integer? The current tests only check for positive integers and floats, but the input could be strings, or None, or many other things. We need a more general validation that precedes the more specific positive integer validation:

BAD_TYPE_MESSAGE = "Invalid input: ({value_1}, {value_2})! Only integers are accepted."
NEGATIVE_INPUT_MESSAGE = "Invalid input: ({value_1}, {value_2})! Negative result detected."


def combine_results(value_1: int, value_2: int) -> int:
    """
    Returns the result of adding the two preceding analyses results together.
    
    Parameters
    ----------
    value_1, value_2 : int
        Positive integers
    
    Returns
    -------
    int
        Resulting positive integer
    
    Raises
    ------
    ValueError
        If either number is negative
    TypeError
        If either number isn't an integer       
    """
    integer_input = isinstance(value_1, int) and isinstance(value_2, int)
    if not integer_input:
        message = BAD_TYPE_MESSAGE.format(value_1=value_1, value_2=value_2)
        raise TypeError(message)
    elif value_1 < 0 or value_2 < 0:
        message = NEGATIVE_INPUT_MESSAGE.format(value_1=value_1, value_2=value_2)
        raise ValueError(message)
    return value_1 + value_2

Input Type Testing Exercise 1

Write three functions to test that a TypeError is raised for strings, lists and None values.

Solution
def test_str_input():
    with pytest.raises(TypeError):
        combine_results('s', 0)

        
def test_none_input():
    with pytest.raises(TypeError):
        combine_results(None, 0)

        
def test_list_input():
    with pytest.raises(TypeError):
        combine_results([], 1)

We should probably add a few more cases, but it’s getting tedious. The test itself is short, but we’re repeating ourselves. pytest is a great library, and luckily it allows us to parameterize over the input, leaving the skeleton of the function intact. It uses a special syntax called decorators which we haven’t met yet. Until we discuss it, which will happen shortly, let’s just use the feature without really understanding what’s going on:

TYPEERROR_INPUTS = [
    ('s', 0), 
    (3, None), 
    ([], 1),
    ([], "!"),
    (10, ()), 
    ({}, 2), 
    (20, {1}),
    (True, 2)
]

@pytest.mark.parametrize("value_1, value_2", TYPEERROR_INPUTS)
def test_invalid_input_raises_typeerror(value_1, value_2):
    with pytest.raises(TypeError):
        combine_results(value_1, value_2)

The decorator function, starting with @, receives two inputs:

  1. The names of all function arguments you wish to parametrize.

  2. The arguments that should be given to the function, as a list of tuples, where each tuple has the length of the needed arguments.

Then pytest runs this function for all given inputs and asserts that the given condition holds. If one of them fails it wouldn’t stop running the remaining inputs, but will point us at the specific input that didn’t pass the test.

We’re almost there, however, there’s one failing test - the one with the boolean values. Apparently it’s not raising a TypeError. It does that since a boolean is a subtype of an integer in Python, so the isinstance check passes. We have to correct it “manually”.

Input Type Testing Exercise 2

  • Fix combine_results() to raise the expected TypeError for boolean input values.

Solution
BAD_TYPE_MESSAGE = "Invalid input: ({value_1}, {value_2})! Only integers are accepted."
NEGATIVE_INPUT_MESSAGE = "Invalid input: ({value_1}, {value_2})! Negative result detected."


def combine_results(value_1: int, value_2: int) -> int:
    """
    Returns the result of adding the two preceding analyses results together.
    
    Parameters
    ----------
    value_1, value_2 : int
        Positive integers
    
    Returns
    -------
    int
        Resulting positive integer
    
    Raises
    ------
    ValueError
        If either number is negative
    TypeError
        If either number isn't an integer       
    """
    boolean_input = isinstance(value_1, bool) and isinstance(value_2, bool)
    integer_input = isinstance(value_1, int) and isinstance(value_2, int)
    if boolean_input or not integer_input:
        message = BAD_TYPE_MESSAGE.format(value_1=value_1, value_2=value_2)
        raise TypeError(message)
    elif value_1 < 0 or value_2 < 0:
        message = NEGATIVE_INPUT_MESSAGE.format(value_1=value_1, value_2=value_2)
        raise ValueError(message)
    return value_1 + value_2
  • Use the pytest.mark.parametrize() decorator to write concise tests for valid inputs, inputs that are expected to raise a ValueError, and inputs that are expected to raise a TypeError.

Solution
TYPEERROR_INPUTS = [('s', 0), (3, None), ([], 1), ([], "!"), (10, ()), ({}, 2),
                    (20, {1}), (True, 2)]
VALID_INPUTS = [(1, 2, 3), (100000, 200000, 100000 + 200000)]
VALUEERROR_INPUTS = [(-1, 1), (1, -2), (-2, -3)]


@pytest.mark.parametrize('value_1, value_2, expected', VALID_INPUTS)
def test_valid_inputs(value_1, value_2, result):
    result = combine_results(value_1, value_2)
    assert result == expected


@pytest.mark.parametrize('value_1, value_2', VALUEERROR_INPUTS)
def test_negative_input_raises_valueerror(value_1, value_2):
    with pytest.raises(ValueError):
        combine_results(value_1, value_2)
    with pytest.raises(ValueError):        
        combine_results(value_2, value_1)


@pytest.mark.parametrize("value_1, value_2", TYPEERROR_INPUTS)
def test_invalid_input_raises_typeerror(value_1, value_2):
    with pytest.raises(TypeError):
        combine_results(value_1, value_2)
    with pytest.raises(TypeError):
        combine_results(value_2, value_1)

We could perhaps continue a bit more, but I think that the principle is clear.

Before moving on to a more difficult example, let’s examine a few more features of tests:

  1. The test functions have no docstring, very long names, and a concise body. This allows them to be very readable, fast to execute, and very low chance that they themselves will contain any bugs.

  2. Writing tests forces us to design an “API” of the function. API stands for application programming interface, and it’s usually a property of large libraries, but here we mean that after writing the tests, the function can only output something meaningful only if you use it in a very strict way, which is great. This makes sure that bugs won’t just spread throughout the execution, but will stop at some point.

  3. There are many more lines of of testing code than of the actual function that’s being tested. This is usually a sign of good coding practices.

  4. Some people write the tests before the actual function, known as test-driven development (TDD). In this method you first think of edge cases and possible inputs, write a tests that fails, and then corrects the function until the test passes.

  5. Pytest has a ton of other features which we won’t cover. Here is one link for the basics, and here you can find a clear blog post giving more information about parameterization.

  6. Tests should be run as often as possible, to make sure that your changes aren’t affecting the existing behavior of your code.

  7. You should try very hard to “translate” bugs into tests. These might be the most important tests you’ll write.

Back to the normalization example#

def normalize_array(a):
    """Normalizes array"""
    minimum = min(a)
    no_min = a - minimum
    return no_min / max(no_min)

Above, we have the previous normalization function that we wrote. It’s less obvious how to test it due to a couple of reasons:

  1. It’s very hard to calculate the correct result for any arbitrary input, at least without effectively rewriting the tested function, since the mathematical operation here isn’t as trivial as in the previous combine_results example.

  2. It’s not as easy to create the inputs, and what exactly are the allowed input shapes and types?

As often repeated during this course, we shouldn’t really write it ourselves since smarter people already wrote it in great libraries such as scikit-learn. Let’s just look at their final implementation and tests. Note that their function can work with arbitrary normalization ranges, not only (0, 1), but it’s the same idea. Here’s the class and here are its tests. It’s clear that a lot of effort was put into writing and testing this function, so our humble attempt will not be as complete as theirs, but the exercise is useful nonetheless.

To start off our attempt we’ll re-write the function itself using 100% numpy-based code:

def normalize_array(arr: np.ndarray) -> np.ndarray:
    """
    Normalizes an array to the range [0, 1].
    
    The maximal element will be 1 and the minimal 0, while the 
    scale remains the same.
    
    Paramaters
    ----------
    arr : np.ndarray
        Input array to normalize
    
    Returns
    -------
    np.ndarray
        Normalized array
    """
    arr -= arr.min()
    return arr / arr.max()

The best way to know if it’s working is not to run it, but to run it inside a test environment. Let’s do that:

def test_norm_normalized():
    inp = np.array([0., 0.5, 1.])
    result = normalize_array(inp)
    truth = inp.copy()
    np.testing.assert_array_equal(truth, result)
    
test_norm_normalized()

A couple of things to note:

  1. We specifically wrote an array that should remain exactly the same, this is the easiest example.

  2. To test array equality we actually want to traverse the array and test that each element is identical to its partner. We also want to make sure that the shape and data types are identical. Numpy helps with that using its built-in testing module.

  3. Floating point arrays may differ by a tiny bit (on the order of 1e-6) and still be considered equal for many use cases. In the above function we’re asserting true equality, but often times (like below) we’ll use test_array_almost_equal instead.

What else can we do? Let’s write a few more tests:

def test_norm_simple():
    inp = np.array([0., 1., 2.])
    result = normalize_array(inp)
    truth = np.array([0, 0.5, 1])
    np.testing.assert_array_almost_equal(truth, result)
    

def test_norm_simple_negative():
    inp = np.array([0., -1., -2.])
    result = normalize_array(inp)
    truth = np.array([1, 0.5, 0])
    np.testing.assert_array_almost_equal(truth, result)
    
    
test_norm_simple()
test_norm_simple_negative()

Thinking about the possible case where the input is negative lead to us figuring the interface, or API, of our function. Meaning that we now know how it behaves with (basic) negative inputs, and we also “documented” it by writing a test for it.

In the combine_results() example we had, we could write pretty complex inputs and know that everything is working, but here it’s a bit harder to “compute” by ourselves. We also don’t have any exceptions we wish to raise. So how can we test a function too complex for us to know the “true” result?

What we’ll do as a first step is to assert some properties of the resulting array. For example, the resulting arrays should be of floating point type (which float?), since our functions has division in it. We also know that the maximal value has to be 1, and the minimal 0. We’d also like to conserve the shape of the input array. That’s already not bad at all!

Naive MinMaxScaler Tests Exercise

Create three tests to test for the following expected properties of the returned array:

  • Minimal and maximal values are 0 and 1.

  • Returned data type matches the input array’s data type.

  • Returned shape matched the input array’s shape.

Solution

SHAPES = [(1,), (10,), (1, 2), (5, 1), (20, 20, 30)]

def test_norm_min_max():
    rand_arr = np.random.randint(0, 100, (100, 20))
    result = normalize_array(rand_arr)
    assert result.max() == 1.0
    assert result.min() == 0.0

@pytest.mark.parametrize('dtype', [np.float32, np.float64])
def test_norm_dtype(dtype):
    a = np.array([1., 2, 3])
    result = normalize_array(a, dtype=dtype)
    assert result.dtype == dtype

@pytest.mark.parametrize('shape', SHAPES)
def test_norm_shapes(shape):
    rand_arr = np.random.randint(10, 20, shape)
    result = normalize_array(rand_arr)
    assert result.shape == shape

Now that we have the basic properties of the returned array covered, we should think of edge cases, i.e. inputs that could mean trouble. Three types of inputs come to mind: Negative numbers (which we already dealt with, maybe), zero, and NaNs.

Let’s write a test with 0 in the input and see what happens:

def test_norm_zero_is_max():
    arr = np.array([0, 0])
    result = normalize_array(arr)
    np.testing.assert_array_almost_equal(result, np.array([0, 0]))
    
test_norm_zero_is_max()
/tmp/ipykernel_2112/2147493839.py:19: RuntimeWarning: invalid value encountered in divide
  return arr / arr.max()
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[15], line 6
      3     result = normalize_array(arr)
      4     np.testing.assert_array_almost_equal(result, np.array([0, 0]))
----> 6 test_norm_zero_is_max()

Cell In[15], line 4, in test_norm_zero_is_max()
      2 arr = np.array([0, 0])
      3 result = normalize_array(arr)
----> 4 np.testing.assert_array_almost_equal(result, np.array([0, 0]))

File /opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/contextlib.py:79, in ContextDecorator.__call__.<locals>.inner(*args, **kwds)
     76 @wraps(func)
     77 def inner(*args, **kwds):
     78     with self._recreate_cm():
---> 79         return func(*args, **kwds)

File /opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/numpy/_utils/__init__.py:85, in _rename_parameter.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
     83             raise TypeError(msg)
     84         kwargs[new_name] = kwargs.pop(old_name)
---> 85 return fun(*args, **kwargs)

    [... skipping hidden 1 frame]

File /opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/contextlib.py:79, in ContextDecorator.__call__.<locals>.inner(*args, **kwds)
     76 @wraps(func)
     77 def inner(*args, **kwds):
     78     with self._recreate_cm():
---> 79         return func(*args, **kwds)

    [... skipping hidden 1 frame]

File /opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/numpy/testing/_private/utils.py:745, in assert_array_compare.<locals>.func_assert_same_pos(x, y, func, hasval)
    738 if np.bool(x_id == y_id).all() != True:
    739     msg = build_err_msg(
    740         [x, y],
    741         err_msg + '\n%s location mismatch:'
    742         % (hasval), verbose=verbose, header=header,
    743         names=names,
    744         precision=precision)
--> 745     raise AssertionError(msg)
    746 # If there is a scalar, then here we know the array has the same
    747 # flag as it everywhere, so we should return the scalar flag.
    748 if isinstance(x_id, bool) or x_id.ndim == 0:

AssertionError: 
Arrays are not almost equal to 6 decimals

nan location mismatch:
 ACTUAL: array([nan, nan])
 DESIRED: array([0, 0])

We expected the test to return the original array since you can’t really normalize an array made of zeros, but this clearly didn’t happen. This is the first real bug of our function, and it also exposes a weakness in the API of that function - how should we deal with these types of inputs? Division by 0 is a big no no, and it’s very possible that without tests we wouldn’t have thought of such edge cases.

How can we correct this behavior? Currently, if the minimal value is 0 we’ll have no issues, and if the maximal value is 0 we also won’t have any issues, since we subtract the minimal value from the entire array which will change the maximal value. So let’s add a check that see whether we’re about to divide by zero.

def normalize_array(arr: np.ndarray) -> np.ndarray:
    """Normalizes an array to the range [0, 1].
    
    The maximal element will be 1 and the minimal 0, while the 
    scale remains the same.
    
    An array containing only identical values will be returned as a zeroed array.
    
    Paramaters
    ----------
    arr : np.ndarray
        Input array to normalize
    
    Returns
    -------
    np.ndarray
        Normalized array
    """
    arr -= arr.min()
    max_ = arr.max()
    return arr if np.isclose(max_, 0.) else arr / max_

We used a terneray operation to quickly express the new logic. As a reminder, this expression will return arr if np.isclose(max_, 0.) returns True and arr / max_ if it returns False. The np.isclose function compares floating point numbers correctly.

How does the test run now?

def test_norm_zero_is_max():
    arr = np.array([0, 0])
    result = normalize_array(arr)
    np.testing.assert_array_almost_equal(result, arr)


def test_norm_identical_is_zeroed():
    arr = np.array([10, 10, 10.], dtype=np.float32)
    result = normalize_array(arr)
    expected = np.array([0., 0., 0.], dtype=np.float32)
    np.testing.assert_array_almost_equal(expected, result)

test_norm_zero_is_max()
test_norm_identical_is_zeroed()

Great! Another potential bug squashed.

One more special input type remains - NaNs. What do we expect to happen? There are three ways to deal with NaNs, as we know:

  1. Using NaN-aware functions instead of the standard ones, and then NaNs can live happily with other values.

  2. Replace them with a sentinel value (0, some large number, etc.).

  3. Remove them altogether.

Removing them (3) is usually not a good option, as it changes the shapes of the arrays. Replacing is an option, but it does change the statistics of the array, so it might not work for all cases. We can use NaN-aware functions, but they show degraded performance which is less-than-ideal many times. The “correct” answer really depends on the data. If we don’t intend to calculate global statistics for it, like its total mean or STD, for example when working with EEG recordings, then it might be fine to simply replace them. Otherwise we’d have to use NaN-aware functions and leave the NaNs as-is.

Due to the origin of our data we believe it’s fine to simply replace NaNs, infs and such with 0, so we’ll add that line to our function and then write a couple more tests.

def normalize_array(arr: np.ndarray) -> np.ndarray:
    """Normalizes an array to the range [0, 1].
    
    The maximal element will be 1 and the minimal 0, while the 
    scale remains the same.
    
    Notes
    -----
    * An array containing only identical values will be returned as a zeroed array.
    * NaNs are automatically converted to 0's before computation.
    
    Paramaters
    ----------
    arr : np.ndarray
        Input array to normalize
    
    Returns
    -------
    np.ndarray
        Normalized array
    """
    arr = np.nan_to_num(arr)
    arr -= arr.min()
    max_ = arr.max()
    return arr if np.isclose(max_, 0.) else arr / max_


def test_norm_with_nan():
    arr = np.array([np.nan, 1])
    result = normalize_array(arr)
    truth = np.array([0, 1.])
    np.testing.assert_array_almost_equal(result, truth) 

    
def test_norm_all_nans():
    arr = np.array([[np.nan, np.nan]])
    result = normalize_array(arr)
    truth = np.array([[0., 0.]])
    np.testing.assert_array_almost_equal(result, truth)
test_norm_with_nan()
test_norm_all_nans()

One of the benefit of testing is that we’re feeling very confident in our code. In this case I’m pretty sure that once all tests pass the function is doing what’s I think it’s doing. This security is rare, and if you feel secure about untested code you wrote then I believe this normalization example shows you the number of pitholes that you can fall into when writing “super-simple” scientific code.

If you think about it a bit more, I haven’t really tested the function for most inputs. I looked for edge cases and simple ones, but I didn’t really do anything with the typical inputs I should actually handle. This should come at no surprise - I can’t really manually normalize large arrays, so how could I get my “truth” array?

The way to get these “truth” inputs varies. In statistics code, for example, each new model they add to their libraries is verified with a few other implementations, to make sure that both SPSS, R and JMP output the same results given the same inputs. Simpler cases, like the one we have here, may create their ground truths by using “snapshot testing”, which simply means taking a snapshot of the current output the code is giving, and making sure that new code we write doesn’t give different results from the old code.

The process is simple: Run your code for some random input, and save both input and result. Then write a test that reads the input, runs the function on it and tests that the new result is identical to the one saved to disk.

# Let's create mock data
shape1 = (100, 100, 3)
input1 = np.random.randint(20, 100, shape1)
output1 = normalize_array(input1)

shape2 = (10, 4, 1, 3)
input2 = np.random.randn(*shape2) * 12
output2 = normalize_array(input2)

np.savez('tests_demo/normalize_array/snapshot_05_2020.npz', input1=input1, output1=output1, input2=input2, output2=output2)
def test_snapshots():
    fname = 'tests_demo/normalize_array/snapshot_05_2020.npz'
    data = np.load(fname)
    num_of_snaps = len(data) // 2
    for snap in range(1, num_of_snaps + 1):
        inp = data[f"input{snap}"]
        old_output = data[f"output{snap}"]
        new_output = normalize_array(inp)
        np.testing.assert_array_almost_equal(old_output, new_output)

At this point we can say we’re done. 16 tests for this function seems about right. We can do better (spoiler alert! wait for a couple of lessons from now!), but I believe that the importance of tests came across.

Integration Testing#

Unit tests repeatedly tests functions or methods your write under different inputs, and are the backbone of any reliable test suit. However, unit tests are not enough, since they don’t check the interface between the different functions and classes in your application.

Integration tests are larger, heavier tests that take at least two components, or units, of your application, and makes sure that they interact well with each other.

Obviously, if we start taking each two consecutive functions and write an integration test for that pair, and then continue with all three consecutive functions and write these integration tests, and so on - we’ll never finish writing the damn application. That’s why integration testing is used at crucial junctions of our application, between major classes for example.

Prime-Fibonacci Difference Exercise

Write a function that returns the difference between the Fibonacci series and the prime numbers series for the n first numbers in both series. The output should be an array of numbers, \(n\)-items long, and a plot to accompany it. As an example, for \(n=3\), the Fibonacci sequence is [0, 1, 1] while the primes are [2, 3, 5], so I expect the returned array to be [-2, -2, -4]. Write the class in a test-driven development style.

Hint

  • The Fibonacci series is a series of numbers starting from (0, 1), with its next element being the sum of the previous two numbers: 0, 1, 1, 2, 3, 5, … .

  • Prime numbers start from 2 and are only divisible by themselves and 1 without a remainder.

Don’t try to implement things in a performant way with fancy algorithms. The focus here is the unit tests and test-driven development.

Solution
def generate_fibonacci(n: int):
    """
    Returns the Fibonacci series up to some n.
    
    Parameters
    ----------
    n : int
        Number of Fibonacci elements to return
    """
    vals = (0, 1)
    result = [0, 1]
    for idx in range(self.n):
        vals = (vals[1], vals[0] + vals[1])
        result.append(vals[1])
    return np.array(result[:self.n])

def generate_primes(n: int):
    """
    Returns the prime numbers series up to some n.
    
    Parameters
    ----------
    n : int
        Number of prime numbers to return
    """
    num = 3
    result = [2]
    primes_found = 0
    while primes_found < (self.n - 1):
        for divisor in range(num-1, 1, -1):
            if num % divisor == 0:
                break
        else:
            primes_found += 1
            result.append(num)
        num += 1
    return np.array(result)

def generate_differences(n: int):
    """
    Run the comparison of the series.

    Parameters
    ----------
    n : int
        Number of prime numbers to return
    """
    if n == 0:
        return np.array([])
    fib_arr = gen_fib_arr(n)
    prime_arr = gen_prime_arr(n)
    return fib_arr - prime_arr

Add tests for your new function.

Solution
class TestFib:
    fib = np.array([0, 1, 1, 2, 3, 5, 8, 13])
    primes = np.array([2, 3, 5, 7, 11, 13, 17, 19])

    @pytest.mark.parametrize('inp', [-1, 1., 'a', {},])
    def test_invalid_inp(self, inp):
        with pytest.raises(TypeError):
            generate_differences(inp)

    def test_fib_valid(self):
        result = generate_fibonacci(4)
        assert np.array_equal(result, self.fib[:4])

    def test_fib_single(self):
        result = generate_fibonacci(1)
        assert np.array_equal(result, np.array([0]))

    def test_primes_valid(self):
        result = generate_primes(6)
        assert np.array_equal(result, self.primes[:6])

    def test_primes_single(self):
        result = generate_primes(1)
        assert result == np.array([2])

    # Integration tests
    def test_valid(self):
        result = generate_differences(8)
        assert np.array_equal((self.fib - self.primes), result)

    def test_zero(self):
        assert len(generate_differences(0)) == 0

Class 8b: Linting & Formatting#

(Or how to keep your code up-to-standards)

Linting#

The word “lint” means fluff (fibers and dirt), hinting at the action being performed – “cleaning” and optimizing the code. In general, a linter is a tool designed to identify logical (execution issues, “bugs”) and stylistic errors in a way that allows for the identification and correction of code parts likely to cause errors or that are not written according to the conventions of the programming language being used.

For example, in Python, a linter can identify unused variables (which won’t cause a runtime error), unmatched parentheses, variables that are used but not defined (which will cause runtime errors), etc. The linter also highlights these errors in a way that hints at the type of error by emphasizing the parts of the code causing it (moreover, a message describing the type of error will appear in a small window if you hover over the highlighted piece of code).

There are many types of linters for Python, differing in the types of errors they identify and fix. For example, some linters identify both logical and stylistic errors (like Pylint), while others identify only one type of error (logical or stylistic, like PyFlakes or pycodestyle, respectively). Additionally, there are combo-linters (multi-linters?) that combine the functionalities of several different linters (like Flake8 and Pylama). From my experience, Pylint and Flake8 are the most popular.

Formatting:#

As the name suggests, formatting organizes the code. Its main purpose is to arrange the code uniformly so that all code related to a project is written under the same stylistic conventions (spaces between operators and variables, number of spaces defining a “tab”, maximum number of characters per line before splitting it, etc.). Generally, it’s important to use a formatter to style the code because it allows us to maintain uniformity and readability, which helps when collaborating with fellow programmers on the same piece of code or project.

There are several tools available for formatting as well (formatters?), each offering a different approach to styling. For example, Black, one of the most popular formatters, does not arrange imported packages in a specific order, although PEP8 (the official style guide for Python code) has several rules for ordering imports. Conversely, isort focuses solely on the task of arranging imports in the code. In fact, there has been an almost 4 years-long discussion about integrating import sorting into Black’s functionality, but it was decided to keep them apart at the end.

Additionally, there are other formatters like autopep8 and YAPF (short for Yet Another Python Formatter, kudos for the name), which are more similar to Black but operate under different assumptions about the code and focus on different aspects of it. Finally, although the dry definition of the formatter’s task falls under the linter’s tasks, each excels in its field and is sensitive to different types of stylistic issues in the code. So, rather than differing from each other, they complement each other and help us write code that works properly (linting), in a readable and efficient manner (linting and formatting).