Class 7: More pandas, Visualizations, xarray and Modeling

Class 7: More `pandas`, Visualizations, `xarray` and Modeling#

More pandas!#

Working with String DataFrames#

Pandas’ Series instances with a dtype of object or string expose a str attribute that enables vectorized string operations. These can come in tremendously handy, particularly when cleaning the data and performing aggregations on manually submitted fields.

Let’s imagine having the misfortune of reading some CSV data and finding the following headers:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

messy_strings = [
    'Id___name', 'AGE', ' DomHand ', np.nan, 'qid    score1', 'score2', 3,
    ' COLOR_ SHAPe   _size', 'origin_residence immigration'
]
s = pd.Series(messy_strings, dtype="string", name="Messy Strings")
s

                     Id___name
                           AGE
                      DomHand 
                          <NA>
                 qid    score1
                        score2
                             3
          COLOR_ SHAPe   _size
  origin_residence immigration
Name: Messy Strings, dtype: string

To try and parse something more reasonable, we might first want to remove all unnecessary whitespace and underscores. One way to achieve that would be:

s_1 = s.str.strip().str.replace("[_\s]+", " ", regex=True).str.lower()
s_1

                       id name
                           age
                       domhand
                          <NA>
                    qid score1
                        score2
                             3
              color shape size
  origin residence immigration
Name: Messy Strings, dtype: string

Let’s break this down:

strip() removed all whitespace from the beginning and end of the string.
We used a regular expression to replace all one or more (+) occurrences of whitespace (\s) and underscores with single spaces.
We converted all characters to lowercase.

Next, we’ll split() strings separated by whitespace and extract an array of the values:

s_2 = s_1.str.split(expand=True)
print(f"DataFrame:\n{s_2}")

s_3 = s_2.to_numpy().flatten()
print(f"\nArray:\n{s_3}")

DataFrame:
         0          1            2
0       id       name         <NA>
1      age       <NA>         <NA>
2  domhand       <NA>         <NA>
3     <NA>       <NA>         <NA>
4      qid     score1         <NA>
5   score2       <NA>         <NA>
6        3       <NA>         <NA>
7    color      shape         size
8   origin  residence  immigration

Array:
['id' 'name' <NA> 'age' <NA> <NA> 'domhand' <NA> <NA> <NA> <NA> <NA> 'qid'
 'score1' <NA> 'score2' <NA> <NA> '3' <NA> <NA> 'color' 'shape' 'size'
 'origin' 'residence' 'immigration']

Finally, we can get rid of the <NA> values:

column_names = s_3[~pd.isnull(s_3)]
column_names

array(['id', 'name', 'age', 'domhand', 'qid', 'score1', 'score2', '3',
       'color', 'shape', 'size', 'origin', 'residence', 'immigration'],
      dtype=object)

DataFrame String Operations Exercise

Generate a 1000x1 shapedpd.DataFrame filled with 3-letter strings. Use the string module’s ascii_lowercase attribute and numpy’s random module.

Add a column indicating if the string in this row has a z in its 2nd character.

Add a third column containing the capitalized and reversed versions of the original strings.

Concatenation and Merging#

Similarly to NumPy arrays, Series and DataFrame objects can be concatenated as well. However, having indices can often make this operation somewhat less trivial.

ser1 = pd.Series(['a', 'b', 'c'], index=[1, 2, 3])
ser2 = pd.Series(['d', 'e', 'f'], index=[4, 5, 6])
pd.concat([ser1, ser2])  # row-wise (axis=0) by default

  a
  b
  c
  d
  e
  f
dtype: object

Let’s do the same with dataframes:

df1 = pd.DataFrame([['a', 'A'], ['b', 'B']], columns=['let', 'LET'], index=[0, 1])
df2 = pd.DataFrame([['c', 'C'], ['d', 'D']], columns=['let', 'LET'], index=[2, 3])
pd.concat([df1, df2])  # again, along the first axis

	let	LET
0	a	A
1	b	B
2	c	C
3	d	D

This time, let’s complicate things a bit, and introduce different column names:

df1 = pd.DataFrame([['a', 'A'], ['b', 'B']], columns=['let1', 'LET1'], index=[0, 1])
df2 = pd.DataFrame([['c', 'C'], ['d', 'D']], columns=['let2', 'LET2'], index=[2, 3])
pd.concat([df1, df2])  # pandas can't make the column index compatible, so it resorts to columnar concat

	let1	LET1	let2	LET2
0	a	A	NaN	NaN
1	b	B	NaN	NaN
2	NaN	NaN	c	C
3	NaN	NaN	d	D

The same result would be achieved by:

pd.concat([df1, df2], axis=1)

	let1	LET1	let2	LET2
0	a	A	NaN	NaN
1	b	B	NaN	NaN
2	NaN	NaN	c	C
3	NaN	NaN	d	D

But what happens if introduce overlapping indices?

df1 = pd.DataFrame([['a', 'A'], ['b', 'B']], columns=['let', 'LET'], index=[0, 1])
df2 = pd.DataFrame([['c', 'C'], ['d', 'D']], columns=['let', 'LET'], index=[0, 2])
pd.concat([df1, df2])

	let	LET
0	a	A
1	b	B
0	c	C
2	d	D

Nothing, really! While not recommended in practice, pandas won’t judge you.

If, however, we wish to keep the integrity of the indices, we can use the verify_integrity keyword:

df1 = pd.DataFrame([['a', 'A'], ['b', 'B']], columns=['let', 'LET'], index=[0, 1])
df2 = pd.DataFrame([['c', 'C'], ['d', 'D']], columns=['let', 'LET'], index=[0, 2])
pd.concat([df1, df2], verify_integrity=True)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[10], line 3
      1 df1 = pd.DataFrame([['a', 'A'], ['b', 'B']], columns=['let', 'LET'], index=[0, 1])
      2 df2 = pd.DataFrame([['c', 'C'], ['d', 'D']], columns=['let', 'LET'], index=[0, 2])
----> 3 pd.concat([df1, df2], verify_integrity=True)

File /opt/hostedtoolcache/Python/3.10.16/x64/lib/python3.10/site-packages/pandas/core/reshape/concat.py:395, in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    380     copy = False
    382 op = _Concatenator(
    383     objs,
    384     axis=axis,
   (...)
    392     sort=sort,
    393 )
--> 395 return op.get_result()

File /opt/hostedtoolcache/Python/3.10.16/x64/lib/python3.10/site-packages/pandas/core/reshape/concat.py:671, in _Concatenator.get_result(self)
    669 for obj in self.objs:
    670     indexers = {}
--> 671     for ax, new_labels in enumerate(self.new_axes):
    672         # ::-1 to convert BlockManager ax to DataFrame ax
    673         if ax == self.bm_axis:
    674             # Suppress reindexing on concat axis
    675             continue

File properties.pyx:36, in pandas._libs.properties.CachedProperty.__get__()

File /opt/hostedtoolcache/Python/3.10.16/x64/lib/python3.10/site-packages/pandas/core/reshape/concat.py:702, in _Concatenator.new_axes(self)
    699 @cache_readonly
    700 def new_axes(self) -> list[Index]:
    701     ndim = self._get_result_dim()
--> 702     return [
    703         self._get_concat_axis if i == self.bm_axis else self._get_comb_axis(i)
    704         for i in range(ndim)
    705     ]

File /opt/hostedtoolcache/Python/3.10.16/x64/lib/python3.10/site-packages/pandas/core/reshape/concat.py:703, in <listcomp>(.0)
    699 @cache_readonly
    700 def new_axes(self) -> list[Index]:
    701     ndim = self._get_result_dim()
    702     return [
--> 703         self._get_concat_axis if i == self.bm_axis else self._get_comb_axis(i)
    704         for i in range(ndim)
    705     ]

File properties.pyx:36, in pandas._libs.properties.CachedProperty.__get__()

File /opt/hostedtoolcache/Python/3.10.16/x64/lib/python3.10/site-packages/pandas/core/reshape/concat.py:766, in _Concatenator._get_concat_axis(self)
    761 else:
    762     concat_axis = _make_concat_multiindex(
    763         indexes, self.keys, self.levels, self.names
    764     )
--> 766 self._maybe_check_integrity(concat_axis)
    768 return concat_axis

File /opt/hostedtoolcache/Python/3.10.16/x64/lib/python3.10/site-packages/pandas/core/reshape/concat.py:774, in _Concatenator._maybe_check_integrity(self, concat_index)
    772 if not concat_index.is_unique:
    773     overlap = concat_index[concat_index.duplicated()].unique()
--> 774     raise ValueError(f"Indexes have overlapping values: {overlap}")

ValueError: Indexes have overlapping values: Index([0], dtype='int64')

If we don’t care about the indices, we can just ignore them:

pd.concat([df1, df2], ignore_index=True)  # resets the index

	let	LET
0	a	A
1	b	B
2	c	C
3	d	D

We can also create a new MultiIndex if that happens to makes more sense:

pd.concat([df1, df2], keys=['df1', 'df2'])  # "remembers" the origin of the data, super useful!

		let	LET
df1	0	a	A
df1	1	b	B
df2	0	c	C
df2	2	d	D

A common real world example of concatenation happens when joining two datasets sampled at different times. For example, if we conducted in day 1 measurements at times 8:00, 10:00, 14:00 and 16:00, but during day 2 we were a bit dizzy, and conducted the measurements at 8:00, 10:00, 13:00 and 16:30. On top of that, we recorded another parameter that we forget to measure at day 1.

The default concatenation behavior of pandas keeps all the data. In database terms (SQL people rejoice!) it’s called an “outer join”:

# Prepare mock data
day_1_times = pd.to_datetime(['08:00', '10:00', '14:00', '16:00'],
                             format='%H:%M').time
day_2_times = pd.to_datetime(['08:00', '10:00', '13:00', '16:30'],
                             format='%H:%M').time

day_1_data = {
    "temperature": [36.6, 36.7, 37.0, 36.8],
    "humidity": [30., 31., 30.4, 30.4]
}
day_2_data = {
    "temperature": [35.9, 36.1, 36.5, 36.2],
    "humidity": [32.2, 34.2, 30.9, 32.6],
    "light": [200, 130, 240, 210]
}

df_1 = pd.DataFrame(day_1_data, index=day_1_times)
df_2 = pd.DataFrame(day_2_data, index=day_2_times)

df_1

	temperature	humidity
08:00:00	36.6	30.0
10:00:00	36.7	31.0
14:00:00	37.0	30.4
16:00:00	36.8	30.4

Note

Note how pd.to_datetime() returns a DatetimeIndex object which exposes a time property, allowing us to easily remove the “date” part of the returned “datetime”, considering it is not represented in our mock data.

df_2

	temperature	humidity	light
08:00:00	35.9	32.2	200
10:00:00	36.1	34.2	130
13:00:00	36.5	30.9	240
16:30:00	36.2	32.6	210

# Outer join
pd.concat([df_1, df_2], join='outer')  # outer join is the default behavior  

	temperature	humidity	light
08:00:00	36.6	30.0	NaN
10:00:00	36.7	31.0	NaN
14:00:00	37.0	30.4	NaN
16:00:00	36.8	30.4	NaN
08:00:00	35.9	32.2	200.0
10:00:00	36.1	34.2	130.0
13:00:00	36.5	30.9	240.0
16:30:00	36.2	32.6	210.0

To take the intersection of the columns we have to use inner join. The intersection is all the columns that are common in all datasets.

# Inner join - the excess data column was dropped (index is still not unique)
pd.concat([df_1, df_2], join='inner')

	temperature	humidity
08:00:00	36.6	30.0
10:00:00	36.7	31.0
14:00:00	37.0	30.4
16:00:00	36.8	30.4
08:00:00	35.9	32.2
10:00:00	36.1	34.2
13:00:00	36.5	30.9
16:30:00	36.2	32.6

One can also specify the exact columns that should be the result of the join operation using the columns keyword. All in all, this basic functionality is easy to understand and allows for high flexibility.

Finally, joining on the columns will require the indices to be unique:

pd.concat([df_1, df_2], join='inner', axis='columns')

	temperature	humidity	temperature	humidity	light
08:00:00	36.6	30.0	35.9	32.2	200
10:00:00	36.7	31.0	36.1	34.2	130

This doesn’t look so good. The columns are a mess and we’re barely left with any data.

Our best option using pd.concat() might be something like:

df_concat = pd.concat([df_1, df_2], keys=["Day 1", "Day 2"])
df_concat

		temperature	humidity	light
Day 1	08:00:00	36.6	30.0	NaN
	10:00:00	36.7	31.0	NaN
	14:00:00	37.0	30.4	NaN
	16:00:00	36.8	30.4	NaN
Day 2	08:00:00	35.9	32.2	200.0
	10:00:00	36.1	34.2	130.0
	13:00:00	36.5	30.9	240.0
	16:30:00	36.2	32.6	210.0

Or maybe an unstacked version:

df_concat.unstack(0)

	temperature		humidity		light
	Day 1	Day 2	Day 1	Day 2	Day 1	Day 2
08:00:00	36.6	35.9	30.0	32.2	NaN	200.0
10:00:00	36.7	36.1	31.0	34.2	NaN	130.0
13:00:00	NaN	36.5	NaN	30.9	NaN	240.0
14:00:00	37.0	NaN	30.4	NaN	NaN	NaN
16:00:00	36.8	NaN	30.4	NaN	NaN	NaN
16:30:00	NaN	36.2	NaN	32.6	NaN	210.0

We could also use pd.merge():

pd.merge(df_1,
         df_2,
         how="outer",           # Keep all indices (rather than just the intersection)
         left_index=True,       # Use left index
         right_index=True,      # Use right index
         suffixes=("_1", "_2")) # Suffixes to use for overlapping columns

	temperature_1	humidity_1	temperature_2	humidity_2	light
08:00:00	36.6	30.0	35.9	32.2	200.0
10:00:00	36.7	31.0	36.1	34.2	130.0
13:00:00	NaN	NaN	36.5	30.9	240.0
14:00:00	37.0	30.4	NaN	NaN	NaN
16:00:00	36.8	30.4	NaN	NaN	NaN
16:30:00	NaN	NaN	36.2	32.6	210.0

The dataframe’s merge() method also enables easily combining columns from a different (but similarly indexed) dataframe:

mouse_id = [511, 512, 513, 514]
meas1 = [67, 66, 89, 92]
meas2 = [45, 45, 65, 61]

data_1 = {"ID": [500, 501, 502, 503], "Blood Volume": [100, 102, 99, 101]}
data_2 = {"ID": [500, 501, 502, 503], "Monocytes": [20, 19, 25, 21]}

df_1 = pd.DataFrame(data_1)
df_2 = pd.DataFrame(data_2)
df_1

	ID	Blood Volume
0	500	100
1	501	102
2	502	99
3	503	101

df_1.merge(df_2)  # merge identified that the only "key" connecting the two tables was the 'id' key

	ID	Blood Volume	Monocytes
0	500	100	20
1	501	102	19
2	502	99	25
3	503	101	21

Database-like operations are a very broad topic with advanced implementations in pandas.

Concatenation and Merging Exercise

Create three dataframes with random values and shapes of (10, 2), (10, 1), (15, 3). Their index should be simple ordinal integers, and their column names should be different.

Concatenate these dataframes over the second axis using pd.concat().

Concatenate these dataframes over the second axis using pd.merge().

Grouping#

Yet another SQL-like feature that pandas posses is the group-by operation, sometimes known as “split-apply-combine”.

# Mock data
subject = range(100, 200)
alive = np.random.choice([True, False], 100)
placebo = np.random.choice([True, False], 100)
measurement_1 = np.random.random(100)
measurement_2 = np.random.random(100)
data = {
    "Subject ID": subject,
    "Alive": alive,
    "Placebo": placebo,
    "Measurement 1": measurement_1,
    "Measurement 2": measurement_2
}
df = pd.DataFrame(data).set_index("Subject ID")
df

	Alive	Placebo	Measurement 1	Measurement 2
Subject ID
100	True	True	0.061373	0.416078
101	False	True	0.070015	0.336608
102	False	False	0.013424	0.357559
103	True	True	0.329372	0.945809
104	False	True	0.235029	0.124525
...	...	...	...	...
195	False	True	0.551530	0.805579
196	True	False	0.919658	0.583379
197	True	True	0.976370	0.191885
198	False	True	0.655571	0.457506
199	False	False	0.460519	0.251235

100 rows × 4 columns

The most sensible thing to do is to group by either the “Alive” or the “Placebo” columns (or both). This is the “split” part.

grouped = df.groupby('Alive')
grouped  # DataFrameGroupBy object - intermediate object ready to be evaluated

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fdb42c8f580>

This intermediate object is an internal pandas representation which should allow it to run very fast computation the moment we want to actually know something about these groups. Assuming we want the mean of Measurement 1, as long as we won’t specifically write grouped.mean() pandas will do very little in terms of actual computation. It’s called “lazy evaluation”.

The intermediate object has some useful attributes:

grouped.groups

{False: [101, 102, 104, 105, 107, 108, 111, 113, 115, 118, 119, 120, 121, 122, 123, 124, 128, 131, 132, 133, 135, 137, 138, 139, 141, 144, 148, 152, 154, 155, 157, 159, 160, 162, 163, 164, 165, 171, 172, 174, 175, 176, 177, 178, 181, 183, 184, 186, 187, 188, 190, 191, 194, 195, 198, 199], True: [100, 103, 106, 109, 110, 112, 114, 116, 117, 125, 126, 127, 129, 130, 134, 136, 140, 142, 143, 145, 146, 147, 149, 150, 151, 153, 156, 158, 161, 166, 167, 168, 169, 170, 173, 179, 180, 182, 185, 189, 192, 193, 196, 197]}

len(grouped)  # True and False

If we wish to run some actual processing, we have to use an aggregation function:

grouped.sum()

	Placebo	Measurement 1	Measurement 2
Alive
False	33	26.580577	27.395772
True	26	24.388858	17.539337

grouped.mean()

	Placebo	Measurement 1	Measurement 2
Alive
False	0.589286	0.474653	0.489210
True	0.590909	0.554292	0.398621

grouped.size()

Alive
False    56
True     44
dtype: int64

If we just wish to see one of the groups, we can use get_group():

grouped.get_group(True).head()

	Alive	Placebo	Measurement 1	Measurement 2
Subject ID
100	True	True	0.061373	0.416078
103	True	True	0.329372	0.945809
106	True	True	0.927303	0.936306
109	True	True	0.853655	0.090558
110	True	False	0.506752	0.070763

We can also call several functions at once using the .agg attribute:

grouped.agg([np.mean, np.std]).drop("Placebo", axis=1)

/tmp/ipykernel_2168/773109026.py:1: FutureWarning: The provided callable <function mean at 0x7fdb78592cb0> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
  grouped.agg([np.mean, np.std]).drop("Placebo", axis=1)
/tmp/ipykernel_2168/773109026.py:1: FutureWarning: The provided callable <function std at 0x7fdb78592dd0> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
  grouped.agg([np.mean, np.std]).drop("Placebo", axis=1)

	Measurement 1		Measurement 2
	mean	std	mean	std
Alive
False	0.474653	0.306343	0.489210	0.291327
True	0.554292	0.300351	0.398621	0.320286

Grouping by multiple columns:

grouped2 = df.groupby(['Alive', 'Placebo'])
grouped2

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fdb42b04820>

grouped2.agg([np.sum, np.var])

/tmp/ipykernel_2168/557621468.py:1: FutureWarning: The provided callable <function sum at 0x7fdb78591ab0> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
  grouped2.agg([np.sum, np.var])
/tmp/ipykernel_2168/557621468.py:1: FutureWarning: The provided callable <function var at 0x7fdb78592ef0> is currently using SeriesGroupBy.var. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "var" instead.
  grouped2.agg([np.sum, np.var])

		Measurement 1		Measurement 2
		sum	var	sum	var
Alive	Placebo
False	False	10.313805	0.073764	11.969978	0.062914
False	True	16.266773	0.109746	15.425794	0.101431
True	False	11.413448	0.062501	7.040621	0.098873
True	True	12.975410	0.104905	10.498716	0.109142

groupby() offers many more features, available here.

Grouping Exercise

Create a dataframe with two columns, 10,000 entries in length. The first should be a random boolean column, and the second should be a sine wave from 0 to 20$\pi$. This simulates measuring a parameter from two distinct groups.

Group the dataframe by your boolean column, creating a GroupBy object.

Plot the values of the grouped dataframe.

Use the rolling() method to create a rolling average window of length 5 and overlay the result.

Other Pandas Features#

Pandas has a TON of features and small implementation details that are there to make your life simpler. Features like IntervalIndex to index the data between two numbers instead of having a single label, for example, are very nice and ergonomic if you need them. Sparse DataFrames are also included, as well as many other computational tools, serialization capabilities, and more. If you need it - there’s a good chance it already exists as a method in the pandas jungle.

Data Visualization#

As mentioned previously the visualization landscape in Python is rich, and is becoming richer by the day. Below, we’ll explore some of the options we have.

* We’ll assume that 2D data is accessed from a dataframe.

`matplotlib`#

The built-in df.plot() method is a simple wrapper around pyplot from matplotlib, and as we’ve seen before it works quite well for many types of plots, as long as we wish to keep them all overlayed in some sort. Let’s look at examples taken straight from the visualization manual of pandas:

ts = pd.Series(np.random.randn(1000),
               index=pd.date_range('1/1/2000', periods=1000))
df = pd.DataFrame(np.random.randn(1000, 4),
                  index=ts.index,
                  columns=list('ABCD'))
df = df.cumsum()
df

	A	B	C	D
2000-01-01	1.806046	-1.315358	-0.118597	-0.869575
2000-01-02	2.615834	-1.527681	-0.321535	-0.019145
2000-01-03	5.313991	-1.298248	-1.802322	0.765442
2000-01-04	6.391797	-0.893820	-1.467048	1.528912
2000-01-05	7.264324	-2.085985	-2.208286	0.355300
...	...	...	...	...
2002-09-22	47.939012	-57.208453	-59.112450	-13.008850
2002-09-23	46.619567	-57.203694	-57.556765	-12.782334
2002-09-24	46.181044	-56.922109	-57.951912	-12.397390
2002-09-25	46.193654	-56.648810	-59.023997	-13.253690
2002-09-26	45.639238	-58.586726	-59.984425	-12.305346

1000 rows × 4 columns

_ = df.plot()

../../_images/d1d83293d2247e60251f0f3f015728c76009e00b881e7c1ff1ea4e4501b3e929.png

Nice to see we got a few things for “free”, like sane x-axis labels and the legend.

We can tell pandas which column corresponds to x, and which to y:

_ = df.plot(x='A', y='B')

../../_images/d6e0eec97e098ba996fed2cc7371bfb451e6f6809d26fdf1b0e6c7429b2d29e4.png

There are, of course, many possible types of plots that can be directly called from the pandas interface:

_ = df.iloc[:10, :].plot.bar()

../../_images/d232c7a9cbb5ae8a936cd4755908c9671a1c23efdcdd60952b36a1afff7ee67e.png

_ = df.plot.hist(alpha=0.5)

../../_images/6084e60e21c6e6ed8a2ff4ec7b9b15ee891f2799a6e55b76a689eac1413a639e.png

Histogramming each column separately can be done by calling the hist() method directly:

_ = df.hist()

../../_images/3633acd5dd29b2538d8d397c4c943ad56e3bd207713e92d94b917833eccedc8f.png

Lastly, a personal favorite:

_ = df.plot.hexbin(x='A', y='B', gridsize=25)

../../_images/397523e97dabff1ce3a01421e133400281b26a61c64acc02aecce8df9f34919d.png

Altair#

Matplotlib (and pandas’ interface to it) is the gold standard in the Python ecosystem - but there are other ecosystems as well. For example, vega-lite is a famous plotting library for the web and Javascript, and it uses a different grammar to define its plots. If you’re familiar with it you’ll be delighted to hear that Python’s altair provides bindings to it, and even if you’ve never heard of it it’s always nice to see that there are many other different ways to tell a computer how to draw stuff on the screen. Let’s look at a couple of examples:

import altair as alt

chart = alt.Chart(df)
chart.mark_point().encode(x='A', y='B')

In Altair you first create a chart object (a simple Chart above), and then you ask it to mark_point(), or mark_line(), to add that type of visualization to the chart. Then we specify the axis and other types of parameters (like color) and map (or encode) them to their corresponding column.

Let’s see how Altair works with other datatypes:

datetime_df = pd.DataFrame({'value': np.random.randn(100).cumsum()},
                           index=pd.date_range('2020', freq='D', periods=100))

datetime_df.head()

	value
2020-01-01	-0.804062
2020-01-02	-1.773340
2020-01-03	-0.634067
2020-01-04	-1.304664
2020-01-05	-1.145437

chart = alt.Chart(datetime_df.reset_index())
chart.mark_line().encode(x='index:T', y='value:Q')

Above we plot the datetime data by telling Altair that the column named “index” is of type T, i.e. Time, while the column “value” is of type Q for quantitative.

One of the great things about these charts is that they can easily be made to be interactive:

from vega_datasets import data  # ready-made DFs for easy visualization examples

cars = data.cars
cars()

	Name	Miles_per_Gallon	Cylinders	Displacement	Horsepower	Weight_in_lbs	Acceleration	Year	Origin
0	chevrolet chevelle malibu	18.0	8	307.0	130.0	3504	12.0	1970-01-01	USA
1	buick skylark 320	15.0	8	350.0	165.0	3693	11.5	1970-01-01	USA
2	plymouth satellite	18.0	8	318.0	150.0	3436	11.0	1970-01-01	USA
3	amc rebel sst	16.0	8	304.0	150.0	3433	12.0	1970-01-01	USA
4	ford torino	17.0	8	302.0	140.0	3449	10.5	1970-01-01	USA
...	...	...	...	...	...	...	...	...	...
401	ford mustang gl	27.0	4	140.0	86.0	2790	15.6	1982-01-01	USA
402	vw pickup	44.0	4	97.0	52.0	2130	24.6	1982-01-01	Europe
403	dodge rampage	32.0	4	135.0	84.0	2295	11.6	1982-01-01	USA
404	ford ranger	28.0	4	120.0	79.0	2625	18.6	1982-01-01	USA
405	chevy s-10	31.0	4	119.0	82.0	2720	19.4	1982-01-01	USA

406 rows × 9 columns

cars_url = data.cars.url
cars_url  # The data is online and in json format which is standard practice for altair-based workflows

'https://cdn.jsdelivr.net/npm/vega-datasets@v1.29.0/data/cars.json'

alt.Chart(cars_url).mark_point().encode(
    x='Miles_per_Gallon:Q',
    y='Horsepower:Q',
    color='Origin:N',  # N for nominal, i.e. discrete and unordered (just like colors)
)

brush = alt.selection_interval()  # selection of type 'interval'

alt.Chart(cars_url).mark_point().encode(
    x='Miles_per_Gallon:Q',
    y='Horsepower:Q',
    color='Origin:N',  # N for nominal, i.e.discrete and unordered (just like colors)
).add_selection(brush)

The selection looks good but doesn’t do anything. Let’s add functionality:

alt.Chart(cars_url).mark_point().encode(
    x='Miles_per_Gallon:Q',
    y='Horsepower:Q',
    color=alt.condition(brush, 'Origin:N', alt.value('lightgray'))
).add_selection(
    brush
)

Altair has a ton more visualization types, some of which are more easily generated than others, and some are easier to generate using Altair rather than Matplotlib.

Bokeh, Holoviews and pandas-bokeh#

Bokeh is another visualization effort in the Python ecosystem, but this time it revolves around web-based plots. Bokeh can be used directly, but it also serves as a backend plotting device for more advanced plotting libraries, like Holoviews and pandas-bokeh. It’s also designed in mind with huge datasets that don’t fit in memory, which is something that other tools might have trouble visualizing.

import bokeh
from bokeh.io import output_notebook, show
from bokeh.plotting import figure as bkfig
output_notebook()

Loading BokehJS ...

bokeh_figure = bkfig(width=400, height=400)
x = [1, 2, 3, 4, 5]
y = [6, 7, 2, 4, 5]
bokeh_figure.circle(x,
                    y,
                    size=15,
                    line_color="navy",
                    fill_color="orange",
                    fill_alpha=0.5)

show(bokeh_figure)

BokehDeprecationWarning: 'circle() method with size value' was deprecated in Bokeh 3.4.0 and will be removed, use 'scatter(size=...) instead' instead.

We see how bokeh immediately outputs an interactive graph, i.e. an HTML document that will open in your browser (a couple of cells above we kindly asked bokeh to output its plots to the notebook instead). Bokeh can be used for many other types of plots, like:

datetime_df = datetime_df.reset_index()
datetime_df

	index	value
0	2020-01-01	-0.804062
1	2020-01-02	-1.773340
2	2020-01-03	-0.634067
3	2020-01-04	-1.304664
4	2020-01-05	-1.145437
...	...	...
95	2020-04-05	-0.317013
96	2020-04-06	-0.312850
97	2020-04-07	-1.624781
98	2020-04-08	-2.127312
99	2020-04-09	-3.445261

100 rows × 2 columns

bokeh_figure_2 = bkfig(x_axis_type="datetime",
                       title="Value over Time",
                       height=350,
                       width=800)

bokeh_figure_2.xgrid.grid_line_color = None
bokeh_figure_2.ygrid.grid_line_alpha = 0.5
bokeh_figure_2.xaxis.axis_label = 'Time'
bokeh_figure_2.yaxis.axis_label = 'Value'

bokeh_figure_2.line(datetime_df.index, datetime_df.value)
show(bokeh_figure_2)

Let’s look at energy consumption, split by source (from the Pandas-Bokeh manual):

url = "https://raw.githubusercontent.com/PatrikHlobil/Pandas-Bokeh/master/docs/Testdata/energy/energy.csv"
df_energy = pd.read_csv(url, parse_dates=["Year"])
df_energy.head()

	Year	Oil	Gas	Coal	Nuclear Energy	Hydroelectricity	Other Renewable
0	1970-01-01	2291.5	826.7	1467.3	17.7	265.8	5.8
1	1971-01-01	2427.7	884.8	1459.2	24.9	276.4	6.3
2	1972-01-01	2613.9	933.7	1475.7	34.1	288.9	6.8
3	1973-01-01	2818.1	978.0	1519.6	45.9	292.5	7.3
4	1974-01-01	2777.3	1001.9	1520.9	59.6	321.1	7.7

Another Bokeh-based library is Holoviews. Its uniqueness stems from the way it handles DataFrames with multiple columns, and the way you add plots to each other. It’s very suitable for Jupyter notebook based plots:

import holoviews as hv
hv.extension("bokeh")

df_energy.head()

	Year	Oil	Gas	Coal	Nuclear Energy	Hydroelectricity	Other Renewable
0	1970-01-01	2291.5	826.7	1467.3	17.7	265.8	5.8
1	1971-01-01	2427.7	884.8	1459.2	24.9	276.4	6.3
2	1972-01-01	2613.9	933.7	1475.7	34.1	288.9	6.8
3	1973-01-01	2818.1	978.0	1519.6	45.9	292.5	7.3
4	1974-01-01	2777.3	1001.9	1520.9	59.6	321.1	7.7

scatter = hv.Scatter(df_energy, 'Oil', 'Gas')
scatter

scatter + hv.Curve(df_energy, 'Oil', 'Hydroelectricity')

def get_year_coal(df, year) -> int:
    return df.loc[df["Year"] == year, "Coal"]

items = {year: hv.Bars(get_year_coal(df_energy, year)) for year in df_energy["Year"]}
                       
hv.HoloMap(items, kdims=['Year'])

Holoviews really needs an entire class (or two) to go over its concepts, but once you get them you can create complicated visualizations which include a strong interactive component in a few lines of code.

Seaborn#

A library which has really become a shining example of quick, efficient and clear plotting in the post-pandas era is seaborn. It combines many of the features of the previous libraries into a very concise API. Unlike a few of the previous libraries, however, it doesn’t use bokeh as its backend, but matplotlib, which means that the interactivity of the resulting plots isn’t as good. Be that as it may, it’s still a widely used library, and for good reasons.

In order to use seaborn to its full extent (and really all of the above libraries) we have to take a short detour and understand how to transform our data into a long-form format.

Long form (“tidy”) data#

Tidy data was first defined in the R language (its “tidyverse” subset) as the preferred format for analysis and visualization. If you assume that the data you’re about to visualize is always in such a format, you can design plotting libraries that use these assumptions to cut the number of lines of code you have to write in order to see the final art. Tidy data migrated to the Pythonic data science ecosystem, and nowadays it’s the preferred data format in the pandas ecosystem as well. The way to construct a “tidy” table is to follow three simple rules:

Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.

In the paper defining tidy data, the following example is given - Assume we have the following data table:

name	treatment a	treatment b
John Smith	-	20.1
Jane Doe	15.1	13.2
Mary Johnson	22.8	27.5

Is this the “tidy” form? What are the variables and observations here? Well, we could’ve written this table in a different (‘transposed’) format:

treatment type	John Smith	Jane Doe	Mary Johnson
treat. a	-	15.1	22.8
treat. b	20.1	13.2	27.5

Is this “long form”?

In both cases, the answer is no. We have to move each observation into its own row, and in the above two tables two (or more) observations were placed in the same row. For example, Both observations concerning Mary Johnson (the measured value of treatment a and b) were located in the same row, which violates rule #2 of the “tidy” data rules. This is how the tidy version of the above tables look like:

name	treatment	measurement
John Doe	a	-
Jane Doe	a	15.1
Mary Johnson	a	22.8
John Doe	b	20.1
Jane Doe	b	13.2
Mary Johnson	b	27.5

Now each measurement has a single row, and the treatment column became an “index” of some sort. The only shortcoming of this approach is the fact that we now have more cells in the table. We had 9 in the previous two versions, but this one has 18. This is quite a jump, but if we’re smart about our data types (categorical data types) then the jump in memory usage wouldn’t become too hard.

As I wrote in the previous class, pandas has methods to transform data into its long form. You’ll usually need to use df.stack() or df.melt() to make it tidy. Let’s try to make our own data tidy:

df = pd.read_csv("pew_raw.csv")
df

	religion	10000	20000	30000	40000	50000	75000
0	Agnostic	27	34	60	81	76	137
1	Atheist	12	27	37	52	35	70
2	Buddhist	27	21	30	34	33	58
3	Catholic	418	617	732	670	638	1116
4	Dont know/refused	15	14	15	11	10	35
5	Evangelical Prot	575	869	1064	982	881	1486
6	Hindu	1	9	7	9	11	34
7	Historically Black Prot	228	244	236	238	197	223
8	Jehovahs Witness	20	27	24	24	21	30
9	Jewish	19	19	25	25	30	95

This is a table from the Pew Research Center on the relations between income (in USD) and religion. This dataset is not in a tidy format since the column headers contain information about specific observations (measurements). For example, the 27 agnostic individuals who donated less than $10k represent a measurement, and the 34 that donated $10k-20k represent another one, and so on.

To make it tidy we’ll use melt():

tidy_df = (pd.melt(df,
                  id_vars=["religion"],
                  var_name="income",
                  value_name="freq")
           .sort_values(by="religion")
           .reset_index(drop=True)
           .astype({'income': 'category', 'religion': 'category'}))

tidy_df

	religion	income	freq
0	Agnostic	10000	27
1	Agnostic	40000	81
2	Agnostic	50000	76
3	Agnostic	75000	137
4	Agnostic	20000	34
5	Agnostic	30000	60
6	Atheist	50000	35
7	Atheist	30000	37
8	Atheist	20000	27
9	Atheist	40000	52
10	Atheist	10000	12
11	Atheist	75000	70
12	Buddhist	50000	33
13	Buddhist	10000	27
14	Buddhist	20000	21
15	Buddhist	40000	34
16	Buddhist	75000	58
17	Buddhist	30000	30
18	Catholic	50000	638
19	Catholic	40000	670
20	Catholic	30000	732
21	Catholic	75000	1116
22	Catholic	20000	617
23	Catholic	10000	418
24	Dont know/refused	30000	15
25	Dont know/refused	50000	10
26	Dont know/refused	10000	15
27	Dont know/refused	75000	35
28	Dont know/refused	20000	14
29	Dont know/refused	40000	11
30	Evangelical Prot	30000	1064
31	Evangelical Prot	75000	1486
32	Evangelical Prot	20000	869
33	Evangelical Prot	10000	575
34	Evangelical Prot	50000	881
35	Evangelical Prot	40000	982
36	Hindu	75000	34
37	Hindu	30000	7
38	Hindu	50000	11
39	Hindu	20000	9
40	Hindu	40000	9
41	Hindu	10000	1
42	Historically Black Prot	50000	197
43	Historically Black Prot	40000	238
44	Historically Black Prot	75000	223
45	Historically Black Prot	30000	236
46	Historically Black Prot	20000	244
47	Historically Black Prot	10000	228
48	Jehovahs Witness	10000	20
49	Jehovahs Witness	40000	24
50	Jehovahs Witness	75000	30
51	Jehovahs Witness	50000	21
52	Jehovahs Witness	30000	24
53	Jehovahs Witness	20000	27
54	Jewish	30000	25
55	Jewish	40000	25
56	Jewish	20000	19
57	Jewish	10000	19
58	Jewish	50000	30
59	Jewish	75000	95

The first argument to melt is the column name that will be used as the “identifier variable”, i.e. will be repeated as necessary to be used as an “index” of some sorts. var_name is the new name of the column we made from the values in the old columns, and value_name is the name of the column that contains the actual values in the cells from before.

After the melting I sorted the dataframe to make it look prettier (all agnostics in row, etc.) and threw away the old and irrelevant index. Finally I converted the “religion” and “income” columns to a categorical data type, which saves memory and better conveys their true meaning.

Now, once we have this long form data, we can put seaborn to the test.

import seaborn as sns

income_barplot = sns.barplot(data=tidy_df, x='income', y='freq', hue='religion')

../../_images/72758670a70a04a8da9d71fac38c210e8c5fb279d4312dacb95cf0d908472e39.png

To fix the legend location:

income_barplot = sns.barplot(data=tidy_df, x='income', y='freq', hue='religion')
_ = income_barplot.legend(bbox_to_anchor=(1, 1))

../../_images/34f20d363a658b71bc86fcc94846ccbdd4aef321a5b8794f6e004174f1a44b78.png

Each seaborn visualization functions has a “data” keyword to which you pass your dataframe, and then a few other with which you specify the relations of the columns to one another. Look how simple it was to receive this beautiful bar chart.

income_catplot = sns.catplot(data=tidy_df, x="religion", y="freq", hue="income", aspect=2)
_ = plt.xticks(rotation=45)

../../_images/d9e625808e1291801584f6a9d185a1a9f31e16ad00a4c7f11cfc82e4a7dfc833.png

Seaborn also takes care of faceting the data for us:

_ = sns.catplot(data=tidy_df, x="religion", y="freq", hue="income", col="income")

../../_images/967ae33dcac9d5124414c2ae82ba1b90f925f8ccf60df4b716c9000403247e90.png

Figure is a bit small? We can use matplotlib to change it:

_, ax = plt.subplots(figsize=(25, 8))
_ = sns.stripplot(data=tidy_df, x="religion", y="freq", hue="income", ax=ax)

../../_images/7ce082582646a0cb6cf6eba855005ced4a93b978d0a6b9ed9da61b9c2a4fadb3.png

Simpler data can also be visualized, no need for categorical variables:

simple_df = pd.DataFrame(np.random.random((1000, 4)), columns=list('abcd'))
simple_df

	a	b	c	d
0	0.582505	0.219027	0.940992	0.852302
1	0.183157	0.176459	0.023548	0.884956
2	0.459231	0.799980	0.016649	0.080180
3	0.657102	0.861479	0.006403	0.621801
4	0.826114	0.932863	0.370964	0.510185
...	...	...	...	...
995	0.434883	0.578981	0.960100	0.451182
996	0.935176	0.325224	0.584419	0.913218
997	0.698224	0.988464	0.078299	0.723948
998	0.092100	0.969418	0.520707	0.997747
999	0.894900	0.910261	0.799504	0.898984

1000 rows × 4 columns

_ = sns.jointplot(data=simple_df, x='a', y='b', kind='kde')

../../_images/7d8c12e8ce42884ecdf9915d6389f8221461c1c2dacd2336d40de919f58c2cc8.png

And complex relations can also be visualized:

_ = sns.pairplot(data=simple_df)

../../_images/4175136ef6e1d7268949a314173987762f6f33accba12fa19c94c15712c32331.png

Seaborn should probably be your go-to choice when all you need is a 2D graph.

Class 7: More pandas, Visualizations, xarray and Modeling

Contents

Class 7: More `pandas`, Visualizations, `xarray` and Modeling#

More pandas!#

Working with String DataFrames#

Concatenation and Merging#

Grouping#

Other Pandas Features#

Data Visualization#

`matplotlib`#

Altair#

Bokeh, Holoviews and pandas-bokeh#

Seaborn#

Long form (“tidy”) data#

Higher Dimensionality: `xarray`#

DataArray#

Dataset#

Exercise: Rat Visual Stimulus Experiment Database#

Methods and functions to implement#

Exercise solutions below…#

Class 7: More pandas, Visualizations, xarray and Modeling

Contents

Class 7: More pandas, Visualizations, xarray and Modeling#

More pandas!#

Working with String DataFrames#

Concatenation and Merging#

Grouping#

Other Pandas Features#

Data Visualization#

matplotlib#

Altair#

Bokeh, Holoviews and pandas-bokeh#

Seaborn#

Long form (“tidy”) data#

Higher Dimensionality: xarray#

DataArray#

Dataset#

Exercise: Rat Visual Stimulus Experiment Database#

Methods and functions to implement#

Exercise solutions below…#

Class 7: More `pandas`, Visualizations, `xarray` and Modeling#

`matplotlib`#

Higher Dimensionality: `xarray`#