Class 8a: Testing, Test-Driven Development#
Introduction#
We’ve mentioned how writing object-oriented code might be considered an industry standard (in terms of code quality), and yet in practice it is still a somewhat niche skill in the academic environment. Another such often neglected common practice is testing. In this lesson, we will learn the basic concepts underlying writing tests for your code in Python.
The Implications of Writing Untested Code#
This is a somewhat sensitive subject, and while beyond the scope of the course material, I wanted to share one relatively neutral example:
This article from 2018 checking how a dynamic image of a circle affects the understanding of speech, employed many modern open-science practices, as described here; pre-registration, data availablity, source code availability, and even peer replication. Unfortunately, the article as published in 2018 is completely wrong. The original conclusion was that these dynamic circles improved the understanding of what was said, while the truth is they worsened it. The fault lied in a very simple mistake in the source code, still visible due to the author’s comment (line ~25 in the link above), which resulted in opposite results.
Looking at it after the correction, every responsible programmer would immediately know this code would never be accepted to any code base in any company, i.e. make it to “production”. In many cases, the code very much demonstrates the opposition to numerous basic programming paradigms, e.g.:
Everything is one large script.
Variables are given uninformative and confusing names.
Too many comments, zero functions.
Lots of commented out lines of code.
In my opinion, it is important we learn to value our research and the responsibility assumed in publishing it no less than any company would with its products.
Public Trust in Research#
Another famous example from the COVID-19 outbreak is the code used for official epidemiological simulations in the UK. The code was made public and turned out to be a huge mess. It even prompted a huge discussion all across the internet (and in the repo itself) with people urging officials to not take any action that was suggested by results of this simulation.
The “truth” of the matter, if there is such one, is much more complex. The code, originally a 15k LOC .c
file - revolves around a model that was created in 2005 and used in many important scientific publications. These publications were peer-reviewed, and the articles themselves are considered very helpful and important in their respective field. The code had very few tests, but the scientists who wrote it “tested” their code in a more intuitive manner - they gave simple inputs for which they knew the expected answer for. They also visualized many of the results the model generated to make sure nothing suspicious is happening.
The scientists released their code to the public with good intentions, advocating “open science”, but received a very unwelcoming response from the community, especially from software engineers. These types of reactions really don’t encourage more scientists and labs to release their code to the public space, which ultimately is the right thing to do. In this case, many people started helping with the refactoring process, including people like John Carmack. This process uncovered bugs and made the code more readable and hopefully a bit more trustworthy, but there’s still much more work to do.
So, how can we do better? There are basic, simple tools we can acquire as researchers that make a large difference. We’ve touched on we should design our code and mentioned concepts such as parameterization and encapsulation. We’ve also emphasized the importance of readability and including expansive documentation. Another one of the tools is testing, and if you learn to integrate it adequately into your coding process it will dramatically elevate the quality of your work.
Testing#
Tests are (usually) short pieces of code designed to assert that a small portion of your program does what you intend it to do.
For example, if I have a class designed to perform calculations on some dataframe that was created somewhere else, perhaps by adding some of its columns together, averaging them out and displaying the result, then I wish for my code to be correct and deterministic, in a sense that a single, defined input will always give the same correct output.
This isn’t trivial even for the most basic functions in Python. Let’s look at a simple normalization function:
import numpy as np
def normalize_array(a):
"""Normalizes array"""
minimum = min(a)
no_min = a - minimum
return no_min / max(no_min)
We’ll even check to see that it works:
normalize_array(np.array([1, 2, 3]))
array([0. , 0.5, 1. ])
normalize_array(np.array([0, 100, 250]))
array([0. , 0.4, 1. ])
Amazing! We can move on to our next function.
…
…
…Or can we?
What issues does this function have? We’ll start from the more basic ones, and slowly dive deeper.
The docstring is lacking, to say the least.
Variable names are awful.
It uses the Pythonic
min
andmax
functions instead of numpy’smin
andmax
. This will lead to a reduction in performance, but also it has different behavior for certain inputs.Many different inputs values will raise warning, exceptions, or worse - won’t raise any error and simply return the wrong result.
It doesn’t take data types into consideration.
This functionality exists in a few pretty popular libraries - why not use them instead of re-inventing the wheel?
Now, you might have written this function differently, perhaps you’ve remembered to use numpy’s version for min
and max
, and perhaps you weren’t so lazy and managed to write a decent docstring, but I’m pretty sure that most of you wouldn’t have taken all of the points above into consideration when writing this function.
The importance of this function is also pretty clear, it’s probably located somewhere in the heart of our data processing pipeline, and if there’s a chance it will return the wrong result sometimes, then our resulting computation is in jeopardy. The only I know of solving this issue is by writing tests, so let’s do that.
You’ve already seen unit tests in your homework assignments. You were asked to make sure that your solution passes all tests before submitting. This exemplifies a way through which we can be more certain that our program does what we thought it was doing.
Tests are important to us for two reasons. The first one is that even simple programs are more complicated than we think. The normalization function is a good example, but you might imagine that as soon as we add interfaces between classes, methods and functions, things might get a bit messy. For example, in the aviation industry, for each line of code a software has, you may find about 8 lines dedicated to testing it.
Moreover, when we deal with user input (e.g. data files, parameters for some script, etc.) we should expect the unexpected, even if the main user is us. Our future self in a few months will probably not remember the type of every parameter it has to enter.
The second reason is Python’s dynamic nature, or duck-typing. If you want to enforce a function to only accept inputs of a single type, you must be the one writing these assertions, either outside or inside the function. For example, a function that adds two numbers needs a isinstance(value, (int, float))
somewhere near its top to avoid these mistakes. Statically-typed languages, like C, define a type for each variable. A function adding two integers simply cannot accept a non-integer input.
Python’s dynamic nature is a blessing on many occasions, but it can sometimes be a real pain. This nature is the second important reason to write tests to our code. Many cases that in other programming languages would’ve resulted in a simple TypeError
, can cause major bugs in Python due to wrong input types.
The Solution#
As I’m sure you’ve gathered by now, we want you to write tests. Let’s examine the iterative process that produces code and tests. We’ll go back to the min-max scaling example later, we’ll start with a more simple function. For now, let’s assume we have two positive integers we wish to add together, perhaps the result of some complicated pipeline, which means we would also like to validate that this condition is in fact satisfied.
What would our code look like? What about something like:
NEGATIVE_INPUT_MESSAGE = "Invalid input: ({value_1}, {value_2})! Negative result detected."
def combine_results(value_1: int, value_2: int) -> int:
"""
Returns the result of adding the two preceding analyses results together.
Parameters
----------
value_1, value_2 : int
Positive integers
Returns
-------
int
Resulting positive integer
Raises
------
ValueError
If either number is negative
"""
if value_1 < 0 or value_2 < 0:
message = NEGATIVE_INPUT_MESSAGE.format(value_1=value_1, value_2=value_2)
raise ValueError(message)
return value_1 + value_2
That’s a good start! We defined the inputs, output type and even added a check that sees that the numbers are indeed positive. The next step, after I have a working skeleton of my implementation, is to test it. To do so I’ll start writing a few tests:
def test_basic_addition():
result = combine_results(2, 4)
expected = 6
assert result == expected
def test_large_number_additions():
value_1 = 1_000_000
value_2 = 2_000_001
expected = value_1 + value_2
result = combine_results(value_1, value_2)
assert result == expected
test_basic_addition()
test_large_number_additions()
Note
Test functions always start with test_
.
Above you see an example of using assertions in tests, which works seamlessly with pytest
, the library we’ve been using to run tests in the homework assignments. Pytest looks for these asserts and executes them in a smart manner, which helps us in writing very concise and clear tests. We can easily run a Python file or module containing test functions by calling the pytest
command-line interface.
Short Demo#
After running these two simple tests and seeing they passed we remembered we also coded another functionality to our combine_results()
function - the way it raises a ValueError
when it encounters negative numbers. We would like to verify that it indeed raises an exception (and the correct one) when it encounters these illegal inputs.
import pytest
def test_negative_argument_1():
with pytest.raises(ValueError):
result = combine_results(-1, 1)
def test_negative_argument_2():
with pytest.raises(ValueError):
result = combine_results(1, -1)
test_negative_argument_1()
test_negative_argument_2()
Again, pytest excels at brevity and readability - we use a context manager (with
statement) to verify that calling the function with these illegal inputs indeed raises the correction exception.
During the time it tooks us to write the original combine_results()
implementation and to test it, we thought of another case: what if the inputs are floating point numbers? That’s also illegal, and it may tell us that there’s a problem with our computation which resulted in a float instead of an integer. Let’s add this functionality and matching tests:
NEGATIVE_INPUT_MESSAGE = "Invalid input: ({value_1}, {value_2})! Negative result detected."
FLOAT_INPUT_MESSAGE = "Invalid input: ({value_1}, {value_2})! Floating point result detected."
def combine_results(value_1: int, value_2: int) -> int:
"""
Returns the result of adding the two preceding analyses results together.
Parameters
----------
value_1, value_2 : int
Positive integers
Returns
-------
int
Resulting positive integer
Raises
------
ValueError
If either number is negative
TypeError
If either number isn't an integer
"""
if value_1 < 0 or value_2 < 0:
message = NEGATIVE_INPUT_MESSAGE.format(value_1=value_1, value_2=value_2)
raise ValueError(message)
if isinstance(value_1, float) or isinstance(value_2, float):
message = FLOAT_INPUT_MESSAGE.format(value_1=value_1, value_2=value_2)
raise TypeError(message)
return value_1 + value_2
def test_float_argument():
with pytest.raises(TypeError):
combine_results(1.0, 1)
with pytest.raises(TypeError):
combine_results(1, 1.0)
test_float_argument()
Great, we have floating points covered. When the test cases are really simple and clearly connected, it’s OK if we include a couple assert
expressions in the same test. Otherwise it’s always better to have only one assert
per test, especially for these simple tests, which we also call unit tests.
We’re almost all set, but again, I thought of another edge case. We’re testing floating points, but what if our input isn’t a floating point number and is also not an integer? The current tests only check for positive integers and floats, but the input could be strings, or None, or many other things. We need a more general validation that precedes the more specific positive integer validation:
BAD_TYPE_MESSAGE = "Invalid input: ({value_1}, {value_2})! Only integers are accepted."
NEGATIVE_INPUT_MESSAGE = "Invalid input: ({value_1}, {value_2})! Negative result detected."
def combine_results(value_1: int, value_2: int) -> int:
"""
Returns the result of adding the two preceding analyses results together.
Parameters
----------
value_1, value_2 : int
Positive integers
Returns
-------
int
Resulting positive integer
Raises
------
ValueError
If either number is negative
TypeError
If either number isn't an integer
"""
integer_input = isinstance(value_1, int) and isinstance(value_2, int)
if not integer_input:
message = BAD_TYPE_MESSAGE.format(value_1=value_1, value_2=value_2)
raise TypeError(message)
elif value_1 < 0 or value_2 < 0:
message = NEGATIVE_INPUT_MESSAGE.format(value_1=value_1, value_2=value_2)
raise ValueError(message)
return value_1 + value_2
Input Type Testing Exercise 1
Write three functions to test that a TypeError
is raised for strings, lists and None
values.
Solution
def test_str_input():
with pytest.raises(TypeError):
combine_results('s', 0)
def test_none_input():
with pytest.raises(TypeError):
combine_results(None, 0)
def test_list_input():
with pytest.raises(TypeError):
combine_results([], 1)
We should probably add a few more cases, but it’s getting tedious. The test itself is short, but we’re repeating ourselves. pytest
is a great library, and luckily it allows us to parameterize over the input, leaving the skeleton of the function intact. It uses a special syntax called decorators which we haven’t met yet. Until we discuss it, which will happen shortly, let’s just use the feature without really understanding what’s going on:
TYPEERROR_INPUTS = [
('s', 0),
(3, None),
([], 1),
([], "!"),
(10, ()),
({}, 2),
(20, {1}),
(True, 2)
]
@pytest.mark.parametrize("value_1, value_2", TYPEERROR_INPUTS)
def test_invalid_input_raises_typeerror(value_1, value_2):
with pytest.raises(TypeError):
combine_results(value_1, value_2)
The decorator function, starting with @
, receives two inputs:
The names of all function arguments you wish to parametrize.
The arguments that should be given to the function, as a list of tuples, where each tuple has the length of the needed arguments.
Then pytest runs this function for all given inputs and asserts that the given condition holds. If one of them fails it wouldn’t stop running the remaining inputs, but will point us at the specific input that didn’t pass the test.
We’re almost there, however, there’s one failing test - the one with the boolean values. Apparently it’s not raising a TypeError. It does that since a boolean is a subtype of an integer in Python, so the isinstance
check passes. We have to correct it “manually”.
Input Type Testing Exercise 2
Fix
combine_results()
to raise the expectedTypeError
for boolean input values.
Solution
BAD_TYPE_MESSAGE = "Invalid input: ({value_1}, {value_2})! Only integers are accepted."
NEGATIVE_INPUT_MESSAGE = "Invalid input: ({value_1}, {value_2})! Negative result detected."
def combine_results(value_1: int, value_2: int) -> int:
"""
Returns the result of adding the two preceding analyses results together.
Parameters
----------
value_1, value_2 : int
Positive integers
Returns
-------
int
Resulting positive integer
Raises
------
ValueError
If either number is negative
TypeError
If either number isn't an integer
"""
boolean_input = isinstance(value_1, bool) and isinstance(value_2, bool)
integer_input = isinstance(value_1, int) and isinstance(value_2, int)
if boolean_input or not integer_input:
message = BAD_TYPE_MESSAGE.format(value_1=value_1, value_2=value_2)
raise TypeError(message)
elif value_1 < 0 or value_2 < 0:
message = NEGATIVE_INPUT_MESSAGE.format(value_1=value_1, value_2=value_2)
raise ValueError(message)
return value_1 + value_2
Use the
pytest.mark.parametrize()
decorator to write concise tests for valid inputs, inputs that are expected to raise aValueError
, and inputs that are expected to raise aTypeError
.
Solution
TYPEERROR_INPUTS = [('s', 0), (3, None), ([], 1), ([], "!"), (10, ()), ({}, 2),
(20, {1}), (True, 2)]
VALID_INPUTS = [(1, 2, 3), (100000, 200000, 100000 + 200000)]
VALUEERROR_INPUTS = [(-1, 1), (1, -2), (-2, -3)]
@pytest.mark.parametrize('value_1, value_2, expected', VALID_INPUTS)
def test_valid_inputs(value_1, value_2, result):
result = combine_results(value_1, value_2)
assert result == expected
@pytest.mark.parametrize('value_1, value_2', VALUEERROR_INPUTS)
def test_negative_input_raises_valueerror(value_1, value_2):
with pytest.raises(ValueError):
combine_results(value_1, value_2)
with pytest.raises(ValueError):
combine_results(value_2, value_1)
@pytest.mark.parametrize("value_1, value_2", TYPEERROR_INPUTS)
def test_invalid_input_raises_typeerror(value_1, value_2):
with pytest.raises(TypeError):
combine_results(value_1, value_2)
with pytest.raises(TypeError):
combine_results(value_2, value_1)
We could perhaps continue a bit more, but I think that the principle is clear.
Before moving on to a more difficult example, let’s examine a few more features of tests:
The test functions have no docstring, very long names, and a concise body. This allows them to be very readable, fast to execute, and very low chance that they themselves will contain any bugs.
Writing tests forces us to design an “API” of the function. API stands for application programming interface, and it’s usually a property of large libraries, but here we mean that after writing the tests, the function can only output something meaningful only if you use it in a very strict way, which is great. This makes sure that bugs won’t just spread throughout the execution, but will stop at some point.
There are many more lines of of testing code than of the actual function that’s being tested. This is usually a sign of good coding practices.
Some people write the tests before the actual function, known as test-driven development (TDD). In this method you first think of edge cases and possible inputs, write a tests that fails, and then corrects the function until the test passes.
Pytest has a ton of other features which we won’t cover. Here is one link for the basics, and here you can find a clear blog post giving more information about parameterization.
Tests should be run as often as possible, to make sure that your changes aren’t affecting the existing behavior of your code.
You should try very hard to “translate” bugs into tests. These might be the most important tests you’ll write.
Back to the normalization example#
def normalize_array(a):
"""Normalizes array"""
minimum = min(a)
no_min = a - minimum
return no_min / max(no_min)
Above, we have the previous normalization function that we wrote. It’s less obvious how to test it due to a couple of reasons:
It’s very hard to calculate the correct result for any arbitrary input, at least without effectively rewriting the tested function, since the mathematical operation here isn’t as trivial as in the previous
combine_results
example.It’s not as easy to create the inputs, and what exactly are the allowed input shapes and types?
As often repeated during this course, we shouldn’t really write it ourselves since smarter people already wrote it in great libraries such as scikit-learn
. Let’s just look at their final implementation and tests. Note that their function can work with arbitrary normalization ranges, not only (0, 1), but it’s the same idea. Here’s the class and here are its tests. It’s clear that a lot of effort was put into writing and testing this function, so our humble attempt will not be as complete as theirs, but the exercise is useful nonetheless.
To start off our attempt we’ll re-write the function itself using 100% numpy-based code:
def normalize_array(arr: np.ndarray) -> np.ndarray:
"""
Normalizes an array to the range [0, 1].
The maximal element will be 1 and the minimal 0, while the
scale remains the same.
Paramaters
----------
arr : np.ndarray
Input array to normalize
Returns
-------
np.ndarray
Normalized array
"""
arr -= arr.min()
return arr / arr.max()
The best way to know if it’s working is not to run it, but to run it inside a test environment. Let’s do that:
def test_norm_normalized():
inp = np.array([0., 0.5, 1.])
result = normalize_array(inp)
truth = inp.copy()
np.testing.assert_array_equal(truth, result)
test_norm_normalized()
A couple of things to note:
We specifically wrote an array that should remain exactly the same, this is the easiest example.
To test array equality we actually want to traverse the array and test that each element is identical to its partner. We also want to make sure that the shape and data types are identical. Numpy helps with that using its built-in
testing
module.Floating point arrays may differ by a tiny bit (on the order of 1e-6) and still be considered equal for many use cases. In the above function we’re asserting true equality, but often times (like below) we’ll use
test_array_almost_equal
instead.
What else can we do? Let’s write a few more tests:
def test_norm_simple():
inp = np.array([0., 1., 2.])
result = normalize_array(inp)
truth = np.array([0, 0.5, 1])
np.testing.assert_array_almost_equal(truth, result)
def test_norm_simple_negative():
inp = np.array([0., -1., -2.])
result = normalize_array(inp)
truth = np.array([1, 0.5, 0])
np.testing.assert_array_almost_equal(truth, result)
test_norm_simple()
test_norm_simple_negative()
Thinking about the possible case where the input is negative lead to us figuring the interface, or API, of our function. Meaning that we now know how it behaves with (basic) negative inputs, and we also “documented” it by writing a test for it.
In the combine_results()
example we had, we could write pretty complex inputs and know that everything is working, but here it’s a bit harder to “compute” by ourselves. We also don’t have any exceptions we wish to raise. So how can we test a function too complex for us to know the “true” result?
What we’ll do as a first step is to assert some properties of the resulting array. For example, the resulting arrays should be of floating point type (which float?), since our functions has division in it. We also know that the maximal value has to be 1, and the minimal 0. We’d also like to conserve the shape of the input array. That’s already not bad at all!
Naive MinMaxScaler Tests Exercise
Create three tests to test for the following expected properties of the returned array:
Minimal and maximal values are 0 and 1.
Returned data type matches the input array’s data type.
Returned shape matched the input array’s shape.
Solution
SHAPES = [(1,), (10,), (1, 2), (5, 1), (20, 20, 30)]
def test_norm_min_max():
rand_arr = np.random.randint(0, 100, (100, 20))
result = normalize_array(rand_arr)
assert result.max() == 1.0
assert result.min() == 0.0
@pytest.mark.parametrize('dtype', [np.float32, np.float64])
def test_norm_dtype(dtype):
a = np.array([1., 2, 3])
result = normalize_array(a, dtype=dtype)
assert result.dtype == dtype
@pytest.mark.parametrize('shape', SHAPES)
def test_norm_shapes(shape):
rand_arr = np.random.randint(10, 20, shape)
result = normalize_array(rand_arr)
assert result.shape == shape
Now that we have the basic properties of the returned array covered, we should think of edge cases, i.e. inputs that could mean trouble. Three types of inputs come to mind: Negative numbers (which we already dealt with, maybe), zero, and NaNs.
Let’s write a test with 0 in the input and see what happens:
def test_norm_zero_is_max():
arr = np.array([0, 0])
result = normalize_array(arr)
np.testing.assert_array_almost_equal(result, np.array([0, 0]))
test_norm_zero_is_max()
/tmp/ipykernel_2112/2147493839.py:19: RuntimeWarning: invalid value encountered in divide
return arr / arr.max()
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
Cell In[15], line 6
3 result = normalize_array(arr)
4 np.testing.assert_array_almost_equal(result, np.array([0, 0]))
----> 6 test_norm_zero_is_max()
Cell In[15], line 4, in test_norm_zero_is_max()
2 arr = np.array([0, 0])
3 result = normalize_array(arr)
----> 4 np.testing.assert_array_almost_equal(result, np.array([0, 0]))
File /opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/contextlib.py:79, in ContextDecorator.__call__.<locals>.inner(*args, **kwds)
76 @wraps(func)
77 def inner(*args, **kwds):
78 with self._recreate_cm():
---> 79 return func(*args, **kwds)
File /opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/numpy/_utils/__init__.py:85, in _rename_parameter.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
83 raise TypeError(msg)
84 kwargs[new_name] = kwargs.pop(old_name)
---> 85 return fun(*args, **kwargs)
[... skipping hidden 1 frame]
File /opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/contextlib.py:79, in ContextDecorator.__call__.<locals>.inner(*args, **kwds)
76 @wraps(func)
77 def inner(*args, **kwds):
78 with self._recreate_cm():
---> 79 return func(*args, **kwds)
[... skipping hidden 1 frame]
File /opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/numpy/testing/_private/utils.py:745, in assert_array_compare.<locals>.func_assert_same_pos(x, y, func, hasval)
738 if np.bool(x_id == y_id).all() != True:
739 msg = build_err_msg(
740 [x, y],
741 err_msg + '\n%s location mismatch:'
742 % (hasval), verbose=verbose, header=header,
743 names=names,
744 precision=precision)
--> 745 raise AssertionError(msg)
746 # If there is a scalar, then here we know the array has the same
747 # flag as it everywhere, so we should return the scalar flag.
748 if isinstance(x_id, bool) or x_id.ndim == 0:
AssertionError:
Arrays are not almost equal to 6 decimals
nan location mismatch:
ACTUAL: array([nan, nan])
DESIRED: array([0, 0])
We expected the test to return the original array since you can’t really normalize an array made of zeros, but this clearly didn’t happen. This is the first real bug of our function, and it also exposes a weakness in the API of that function - how should we deal with these types of inputs? Division by 0 is a big no no, and it’s very possible that without tests we wouldn’t have thought of such edge cases.
How can we correct this behavior? Currently, if the minimal value is 0 we’ll have no issues, and if the maximal value is 0 we also won’t have any issues, since we subtract the minimal value from the entire array which will change the maximal value. So let’s add a check that see whether we’re about to divide by zero.
def normalize_array(arr: np.ndarray) -> np.ndarray:
"""Normalizes an array to the range [0, 1].
The maximal element will be 1 and the minimal 0, while the
scale remains the same.
An array containing only identical values will be returned as a zeroed array.
Paramaters
----------
arr : np.ndarray
Input array to normalize
Returns
-------
np.ndarray
Normalized array
"""
arr -= arr.min()
max_ = arr.max()
return arr if np.isclose(max_, 0.) else arr / max_
We used a terneray operation to quickly express the new logic. As a reminder, this expression will return arr
if np.isclose(max_, 0.)
returns True and arr / max_
if it returns False. The np.isclose
function compares floating point numbers correctly.
How does the test run now?
def test_norm_zero_is_max():
arr = np.array([0, 0])
result = normalize_array(arr)
np.testing.assert_array_almost_equal(result, arr)
def test_norm_identical_is_zeroed():
arr = np.array([10, 10, 10.], dtype=np.float32)
result = normalize_array(arr)
expected = np.array([0., 0., 0.], dtype=np.float32)
np.testing.assert_array_almost_equal(expected, result)
test_norm_zero_is_max()
test_norm_identical_is_zeroed()
Great! Another potential bug squashed.
One more special input type remains - NaNs. What do we expect to happen? There are three ways to deal with NaNs, as we know:
Using NaN-aware functions instead of the standard ones, and then NaNs can live happily with other values.
Replace them with a sentinel value (0, some large number, etc.).
Remove them altogether.
Removing them (3) is usually not a good option, as it changes the shapes of the arrays. Replacing is an option, but it does change the statistics of the array, so it might not work for all cases. We can use NaN-aware functions, but they show degraded performance which is less-than-ideal many times. The “correct” answer really depends on the data. If we don’t intend to calculate global statistics for it, like its total mean or STD, for example when working with EEG recordings, then it might be fine to simply replace them. Otherwise we’d have to use NaN-aware functions and leave the NaNs as-is.
Due to the origin of our data we believe it’s fine to simply replace NaNs, inf
s and such with 0, so we’ll add that line to our function and then write a couple more tests.
def normalize_array(arr: np.ndarray) -> np.ndarray:
"""Normalizes an array to the range [0, 1].
The maximal element will be 1 and the minimal 0, while the
scale remains the same.
Notes
-----
* An array containing only identical values will be returned as a zeroed array.
* NaNs are automatically converted to 0's before computation.
Paramaters
----------
arr : np.ndarray
Input array to normalize
Returns
-------
np.ndarray
Normalized array
"""
arr = np.nan_to_num(arr)
arr -= arr.min()
max_ = arr.max()
return arr if np.isclose(max_, 0.) else arr / max_
def test_norm_with_nan():
arr = np.array([np.nan, 1])
result = normalize_array(arr)
truth = np.array([0, 1.])
np.testing.assert_array_almost_equal(result, truth)
def test_norm_all_nans():
arr = np.array([[np.nan, np.nan]])
result = normalize_array(arr)
truth = np.array([[0., 0.]])
np.testing.assert_array_almost_equal(result, truth)
test_norm_with_nan()
test_norm_all_nans()
One of the benefit of testing is that we’re feeling very confident in our code. In this case I’m pretty sure that once all tests pass the function is doing what’s I think it’s doing. This security is rare, and if you feel secure about untested code you wrote then I believe this normalization example shows you the number of pitholes that you can fall into when writing “super-simple” scientific code.
If you think about it a bit more, I haven’t really tested the function for most inputs. I looked for edge cases and simple ones, but I didn’t really do anything with the typical inputs I should actually handle. This should come at no surprise - I can’t really manually normalize large arrays, so how could I get my “truth” array?
The way to get these “truth” inputs varies. In statistics code, for example, each new model they add to their libraries is verified with a few other implementations, to make sure that both SPSS, R and JMP output the same results given the same inputs. Simpler cases, like the one we have here, may create their ground truths by using “snapshot testing”, which simply means taking a snapshot of the current output the code is giving, and making sure that new code we write doesn’t give different results from the old code.
The process is simple: Run your code for some random input, and save both input and result. Then write a test that reads the input, runs the function on it and tests that the new result is identical to the one saved to disk.
# Let's create mock data
shape1 = (100, 100, 3)
input1 = np.random.randint(20, 100, shape1)
output1 = normalize_array(input1)
shape2 = (10, 4, 1, 3)
input2 = np.random.randn(*shape2) * 12
output2 = normalize_array(input2)
np.savez('tests_demo/normalize_array/snapshot_05_2020.npz', input1=input1, output1=output1, input2=input2, output2=output2)
def test_snapshots():
fname = 'tests_demo/normalize_array/snapshot_05_2020.npz'
data = np.load(fname)
num_of_snaps = len(data) // 2
for snap in range(1, num_of_snaps + 1):
inp = data[f"input{snap}"]
old_output = data[f"output{snap}"]
new_output = normalize_array(inp)
np.testing.assert_array_almost_equal(old_output, new_output)
At this point we can say we’re done. 16 tests for this function seems about right. We can do better (spoiler alert! wait for a couple of lessons from now!), but I believe that the importance of tests came across.
Integration Testing#
Unit tests repeatedly tests functions or methods your write under different inputs, and are the backbone of any reliable test suit. However, unit tests are not enough, since they don’t check the interface between the different functions and classes in your application.
Integration tests are larger, heavier tests that take at least two components, or units, of your application, and makes sure that they interact well with each other.
Obviously, if we start taking each two consecutive functions and write an integration test for that pair, and then continue with all three consecutive functions and write these integration tests, and so on - we’ll never finish writing the damn application. That’s why integration testing is used at crucial junctions of our application, between major classes for example.
Prime-Fibonacci Difference Exercise
Write a function that returns the difference between the Fibonacci series and the prime numbers series for the n first numbers in both series. The output should be an array of numbers, \(n\)-items long, and a plot to accompany it. As an example, for \(n=3\), the Fibonacci sequence is [0, 1, 1]
while the primes are [2, 3, 5]
, so I expect the returned array to be [-2, -2, -4]
. Write the class in a test-driven development style.
Hint
The Fibonacci series is a series of numbers starting from (0, 1), with its next element being the sum of the previous two numbers: 0, 1, 1, 2, 3, 5, … .
Prime numbers start from 2 and are only divisible by themselves and 1 without a remainder.
Don’t try to implement things in a performant way with fancy algorithms. The focus here is the unit tests and test-driven development.
Solution
def generate_fibonacci(n: int):
"""
Returns the Fibonacci series up to some n.
Parameters
----------
n : int
Number of Fibonacci elements to return
"""
vals = (0, 1)
result = [0, 1]
for idx in range(self.n):
vals = (vals[1], vals[0] + vals[1])
result.append(vals[1])
return np.array(result[:self.n])
def generate_primes(n: int):
"""
Returns the prime numbers series up to some n.
Parameters
----------
n : int
Number of prime numbers to return
"""
num = 3
result = [2]
primes_found = 0
while primes_found < (self.n - 1):
for divisor in range(num-1, 1, -1):
if num % divisor == 0:
break
else:
primes_found += 1
result.append(num)
num += 1
return np.array(result)
def generate_differences(n: int):
"""
Run the comparison of the series.
Parameters
----------
n : int
Number of prime numbers to return
"""
if n == 0:
return np.array([])
fib_arr = gen_fib_arr(n)
prime_arr = gen_prime_arr(n)
return fib_arr - prime_arr
Add tests for your new function.
Solution
class TestFib:
fib = np.array([0, 1, 1, 2, 3, 5, 8, 13])
primes = np.array([2, 3, 5, 7, 11, 13, 17, 19])
@pytest.mark.parametrize('inp', [-1, 1., 'a', {},])
def test_invalid_inp(self, inp):
with pytest.raises(TypeError):
generate_differences(inp)
def test_fib_valid(self):
result = generate_fibonacci(4)
assert np.array_equal(result, self.fib[:4])
def test_fib_single(self):
result = generate_fibonacci(1)
assert np.array_equal(result, np.array([0]))
def test_primes_valid(self):
result = generate_primes(6)
assert np.array_equal(result, self.primes[:6])
def test_primes_single(self):
result = generate_primes(1)
assert result == np.array([2])
# Integration tests
def test_valid(self):
result = generate_differences(8)
assert np.array_equal((self.fib - self.primes), result)
def test_zero(self):
assert len(generate_differences(0)) == 0
Class 8b: Linting & Formatting#
(Or how to keep your code up-to-standards)
Linting#
The word “lint” means fluff (fibers and dirt), hinting at the action being performed – “cleaning” and optimizing the code. In general, a linter is a tool designed to identify logical (execution issues, “bugs”) and stylistic errors in a way that allows for the identification and correction of code parts likely to cause errors or that are not written according to the conventions of the programming language being used.
For example, in Python, a linter can identify unused variables (which won’t cause a runtime error), unmatched parentheses, variables that are used but not defined (which will cause runtime errors), etc. The linter also highlights these errors in a way that hints at the type of error by emphasizing the parts of the code causing it (moreover, a message describing the type of error will appear in a small window if you hover over the highlighted piece of code).
There are many types of linters for Python, differing in the types of errors they identify and fix. For example, some linters identify both logical and stylistic errors (like Pylint), while others identify only one type of error (logical or stylistic, like PyFlakes or pycodestyle, respectively). Additionally, there are combo-linters (multi-linters?) that combine the functionalities of several different linters (like Flake8 and Pylama). From my experience, Pylint and Flake8 are the most popular.
Formatting:#
As the name suggests, formatting organizes the code. Its main purpose is to arrange the code uniformly so that all code related to a project is written under the same stylistic conventions (spaces between operators and variables, number of spaces defining a “tab”, maximum number of characters per line before splitting it, etc.). Generally, it’s important to use a formatter to style the code because it allows us to maintain uniformity and readability, which helps when collaborating with fellow programmers on the same piece of code or project.
There are several tools available for formatting as well (formatters?), each offering a different approach to styling. For example, Black, one of the most popular formatters, does not arrange imported packages in a specific order, although PEP8 (the official style guide for Python code) has several rules for ordering imports. Conversely, isort focuses solely on the task of arranging imports in the code. In fact, there has been an almost 4 years-long discussion about integrating import sorting into Black’s functionality, but it was decided to keep them apart at the end.
Additionally, there are other formatters like autopep8 and YAPF (short for Yet Another Python Formatter, kudos for the name), which are more similar to Black but operate under different assumptions about the code and focus on different aspects of it. Finally, although the dry definition of the formatter’s task falls under the linter’s tasks, each excels in its field and is sensitive to different types of stylistic issues in the code. So, rather than differing from each other, they complement each other and help us write code that works properly (linting), in a readable and efficient manner (linting and formatting).