Class 7b: Tidy format, Visualizations, xarray and Modeling

Class 7b: Tidy format, Visualizations, `xarray` and Modeling#

Long form (“tidy”) data#

Tidy data was first defined in the R language (its “tidyverse” subset) as the preferred format for analysis and visualization. If you assume that the data you’re about to visualize is always in such a format, you can design plotting libraries that use these assumptions to cut the number of lines of code you have to write in order to see the final art. Tidy data migrated to the Pythonic data science ecosystem, and nowadays it’s the preferred data format in the pandas ecosystem as well. The way to construct a “tidy” table is to follow three simple rules:

Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.

In the paper defining tidy data, the following example is given - Assume we have the following data table:

name	treatment a	treatment b
John Smith	-	20.1
Jane Doe	15.1	13.2
Mary Johnson	22.8	27.5

Is this the “tidy” form? What are the variables and observations here? Well, we could’ve written this table in a different (‘transposed’) format:

treatment type	John Smith	Jane Doe	Mary Johnson
treat. a	-	15.1	22.8
treat. b	20.1	13.2	27.5

Is this “long form”?

In both cases, the answer is no. We have to move each observation into its own row, and in the above two tables two (or more) observations were placed in the same row. For example, Both observations concerning Mary Johnson (the measured value of treatment a and b) were located in the same row, which violates rule #2 of the “tidy” data rules. This is how the tidy version of the above tables look like:

name	treatment	measurement
John Doe	a	-
Jane Doe	a	15.1
Mary Johnson	a	22.8
John Doe	b	20.1
Jane Doe	b	13.2
Mary Johnson	b	27.5

Now each measurement has a single row, and the treatment column became an “index” of some sort. The only shortcoming of this approach is the fact that we now have more cells in the table. We had 9 in the previous two versions, but this one has 18. This is quite a jump, but if we’re smart about our data types (categorical data types) then the jump in memory usage wouldn’t become too hard.

As I wrote in the previous class, pandas has methods to transform data into its long form. You’ll usually need to use df.stack() or df.melt() to make it tidy. Let’s try to make our own data tidy:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv("pew_raw.csv")
df

	religion	10000	20000	30000	40000	50000	75000
0	Agnostic	27	34	60	81	76	137
1	Atheist	12	27	37	52	35	70
2	Buddhist	27	21	30	34	33	58
3	Catholic	418	617	732	670	638	1116
4	Dont know/refused	15	14	15	11	10	35
5	Evangelical Prot	575	869	1064	982	881	1486
6	Hindu	1	9	7	9	11	34
7	Historically Black Prot	228	244	236	238	197	223
8	Jehovahs Witness	20	27	24	24	21	30
9	Jewish	19	19	25	25	30	95

This is a table from the Pew Research Center on the relations between income (in USD) and religion. This dataset is not in a tidy format since the column headers contain information about specific observations (measurements). For example, the 27 agnostic individuals who donated less than $10k represent a measurement, and the 34 that donated $10k-20k represent another one, and so on.

To make it tidy we’ll use melt():

tidy_df = (
    pd.melt(df, id_vars=["religion"], var_name="income", value_name="freq")
    .sort_values(by="religion")
    .reset_index(drop=True)
    .astype({"income": "category", "religion": "category"})
)

tidy_df

	religion	income	freq
0	Agnostic	10000	27
1	Agnostic	40000	81
2	Agnostic	50000	76
3	Agnostic	75000	137
4	Agnostic	20000	34
5	Agnostic	30000	60
6	Atheist	50000	35
7	Atheist	30000	37
8	Atheist	20000	27
9	Atheist	40000	52
10	Atheist	10000	12
11	Atheist	75000	70
12	Buddhist	50000	33
13	Buddhist	10000	27
14	Buddhist	20000	21
15	Buddhist	40000	34
16	Buddhist	75000	58
17	Buddhist	30000	30
18	Catholic	50000	638
19	Catholic	40000	670
20	Catholic	30000	732
21	Catholic	75000	1116
22	Catholic	20000	617
23	Catholic	10000	418
24	Dont know/refused	30000	15
25	Dont know/refused	50000	10
26	Dont know/refused	10000	15
27	Dont know/refused	75000	35
28	Dont know/refused	20000	14
29	Dont know/refused	40000	11
30	Evangelical Prot	30000	1064
31	Evangelical Prot	75000	1486
32	Evangelical Prot	20000	869
33	Evangelical Prot	10000	575
34	Evangelical Prot	50000	881
35	Evangelical Prot	40000	982
36	Hindu	75000	34
37	Hindu	30000	7
38	Hindu	50000	11
39	Hindu	20000	9
40	Hindu	40000	9
41	Hindu	10000	1
42	Historically Black Prot	50000	197
43	Historically Black Prot	40000	238
44	Historically Black Prot	75000	223
45	Historically Black Prot	30000	236
46	Historically Black Prot	20000	244
47	Historically Black Prot	10000	228
48	Jehovahs Witness	10000	20
49	Jehovahs Witness	40000	24
50	Jehovahs Witness	75000	30
51	Jehovahs Witness	50000	21
52	Jehovahs Witness	30000	24
53	Jehovahs Witness	20000	27
54	Jewish	30000	25
55	Jewish	40000	25
56	Jewish	20000	19
57	Jewish	10000	19
58	Jewish	50000	30
59	Jewish	75000	95

The first argument to melt is the column name that will be used as the “identifier variable”, i.e. will be repeated as necessary to be used as an “index” of some sorts. var_name is the new name of the column we made from the values in the old columns, and value_name is the name of the column that contains the actual values in the cells from before.

After the melting I sorted the dataframe to make it look prettier (all agnostics in row, etc.) and threw away the old and irrelevant index. Finally I converted the “religion” and “income” columns to a categorical data type, which saves memory and better conveys their true meaning.

Data Visualization#

As mentioned previously the visualization landscape in Python is rich, and is becoming richer by the day. Below, we’ll explore some of the options we have.

* We’ll assume that 2D data is accessed from a dataframe.

`matplotlib`#

The built-in df.plot() method is a simple wrapper around pyplot from matplotlib, and as we’ve seen before it works quite well for many types of plots, as long as we wish to keep them all overlayed in some sort. Let’s look at examples taken straight from the visualization manual of pandas:

ts = pd.Series(np.random.randn(1000),
               index=pd.date_range('1/1/2000', periods=1000))
df = pd.DataFrame(np.random.randn(1000, 4),
                  index=ts.index,
                  columns=list('ABCD'))
df = df.cumsum()
df

	A	B	C	D
2000-01-01	0.816364	1.316394	-0.848722	1.807674
2000-01-02	0.326226	1.931632	-1.336688	1.654345
2000-01-03	-0.327805	2.062384	-1.899247	1.400437
2000-01-04	-0.244425	2.411731	-2.067694	0.547233
2000-01-05	0.160672	1.948629	-1.966393	0.412699
...	...	...	...	...
2002-09-22	41.788482	-3.598979	-43.564976	10.499952
2002-09-23	41.747076	-3.489523	-42.480919	7.461325
2002-09-24	41.457258	-2.594209	-41.980355	8.031147
2002-09-25	40.408441	-1.160594	-43.754431	7.342947
2002-09-26	40.917328	-1.514896	-42.349892	7.084579

1000 rows × 4 columns

_ = df.plot()

../../_images/0658c111fb018ba3f3472c5303d06468cad54a74286a1f7c75ce1ea2f7934827.png

Nice to see we got a few things for “free”, like sane x-axis labels and the legend.

We can tell pandas which column corresponds to x, and which to y:

_ = df.plot(x='A', y='B')

../../_images/407cfbb524142cde4a0bf287f969aa6adb7328fe4655ee0ee3b7822b5996d29d.png

There are, of course, many possible types of plots that can be directly called from the pandas interface:

_ = df.iloc[:10, :].plot.bar()

../../_images/cd921aef173564b036a411a9aef22b1d76dc36293cc376b974fd4308c8f77ae8.png

_ = df.plot.hist(alpha=0.5)

../../_images/9b4573c72a134711fd9eaf58991475fb890b1d5271795ea71becb83ec70f601d.png

Histogramming each column separately can be done by calling the hist() method directly:

_ = df.hist()

../../_images/2458f91a785e36e219befe3e6db1cf63c6325b338ac8fd829e7f5d6ba1c7c4fd.png

Lastly, a personal favorite:

_ = df.plot.hexbin(x='A', y='B', gridsize=25)

../../_images/edc7deca79b259e89d29405311ad5d52659dc0cf6f88f8a92852c3247975f259.png

Altair#

Matplotlib (and pandas’ interface to it) is the gold standard in the Python ecosystem - but there are other ecosystems as well. For example, vega-lite is a famous plotting library for the web and Javascript, and it uses a different grammar to define its plots. If you’re familiar with it you’ll be delighted to hear that Python’s altair provides bindings to it, and even if you’ve never heard of it it’s always nice to see that there are many other different ways to tell a computer how to draw stuff on the screen. Let’s look at a couple of examples:

import altair as alt

chart = alt.Chart(df)
chart.mark_point().encode(x='A', y='B')

In Altair you first create a chart object (a simple Chart above), and then you ask it to mark_point(), or mark_line(), to add that type of visualization to the chart. Then we specify the axis and other types of parameters (like color) and map (or encode) them to their corresponding column.

Let’s see how Altair works with other datatypes:

datetime_df = pd.DataFrame({'value': np.random.randn(100).cumsum()},
                           index=pd.date_range('2020', freq='D', periods=100))

datetime_df.head()

	value
2020-01-01	1.758362
2020-01-02	2.413572
2020-01-03	3.496669
2020-01-04	2.956879
2020-01-05	1.692051

chart = alt.Chart(datetime_df.reset_index())
chart.mark_line().encode(x='index:T', y='value:Q')

Above we plot the datetime data by telling Altair that the column named “index” is of type T, i.e. Time, while the column “value” is of type Q for quantitative.

One of the great things about these charts is that they can easily be made to be interactive:

from vega_datasets import data  # ready-made DFs for easy visualization examples

cars = data.cars
cars()

	Name	Miles_per_Gallon	Cylinders	Displacement	Horsepower	Weight_in_lbs	Acceleration	Year	Origin
0	chevrolet chevelle malibu	18.0	8	307.0	130.0	3504	12.0	1970-01-01	USA
1	buick skylark 320	15.0	8	350.0	165.0	3693	11.5	1970-01-01	USA
2	plymouth satellite	18.0	8	318.0	150.0	3436	11.0	1970-01-01	USA
3	amc rebel sst	16.0	8	304.0	150.0	3433	12.0	1970-01-01	USA
4	ford torino	17.0	8	302.0	140.0	3449	10.5	1970-01-01	USA
...	...	...	...	...	...	...	...	...	...
401	ford mustang gl	27.0	4	140.0	86.0	2790	15.6	1982-01-01	USA
402	vw pickup	44.0	4	97.0	52.0	2130	24.6	1982-01-01	Europe
403	dodge rampage	32.0	4	135.0	84.0	2295	11.6	1982-01-01	USA
404	ford ranger	28.0	4	120.0	79.0	2625	18.6	1982-01-01	USA
405	chevy s-10	31.0	4	119.0	82.0	2720	19.4	1982-01-01	USA

406 rows × 9 columns

cars_url = data.cars.url
cars_url  # The data is online and in json format which is standard practice for altair-based workflows

'https://cdn.jsdelivr.net/npm/vega-datasets@v1.29.0/data/cars.json'

alt.Chart(cars_url).mark_point().encode(
    x='Miles_per_Gallon:Q',
    y='Horsepower:Q',
    color='Origin:N',  # N for nominal, i.e. discrete and unordered (just like colors)
)

brush = alt.selection_interval()  # selection of type 'interval'

alt.Chart(cars_url).mark_point().encode(
    x='Miles_per_Gallon:Q',
    y='Horsepower:Q',
    color='Origin:N',  # N for nominal, i.e.discrete and unordered (just like colors)
).add_selection(brush)

The selection looks good but doesn’t do anything. Let’s add functionality:

alt.Chart(cars_url).mark_point().encode(
    x='Miles_per_Gallon:Q',
    y='Horsepower:Q',
    color=alt.condition(brush, 'Origin:N', alt.value('lightgray'))
).add_selection(
    brush
)

Altair has a ton more visualization types, some of which are more easily generated than others, and some are easier to generate using Altair rather than Matplotlib.

Bokeh, Holoviews and pandas-bokeh#

Bokeh is another visualization effort in the Python ecosystem, but this time it revolves around web-based plots. Bokeh can be used directly, but it also serves as a backend plotting device for more advanced plotting libraries, like Holoviews and pandas-bokeh. It’s also designed in mind with huge datasets that don’t fit in memory, which is something that other tools might have trouble visualizing.

import bokeh
from bokeh.io import output_notebook, show
from bokeh.plotting import figure as bkfig
output_notebook()

Loading BokehJS ...

bokeh_figure = bkfig(width=400, height=400)
x = [1, 2, 3, 4, 5]
y = [6, 7, 2, 4, 5]
bokeh_figure.scatter(x,
                    y,
                    size=15,
                    line_color="navy",
                    fill_color="orange",
                    fill_alpha=0.5)

show(bokeh_figure)

We see how bokeh immediately outputs an interactive graph, i.e. an HTML document that will open in your browser (a couple of cells above we kindly asked bokeh to output its plots to the notebook instead). Bokeh can be used for many other types of plots, like:

datetime_df = datetime_df.reset_index()
datetime_df

	index	value
0	2020-01-01	1.758362
1	2020-01-02	2.413572
2	2020-01-03	3.496669
3	2020-01-04	2.956879
4	2020-01-05	1.692051
...	...	...
95	2020-04-05	0.325621
96	2020-04-06	-0.614649
97	2020-04-07	0.228853
98	2020-04-08	-0.796353
99	2020-04-09	-1.327180

100 rows × 2 columns

bokeh_figure_2 = bkfig(x_axis_type="datetime",
                       title="Value over Time",
                       height=350,
                       width=800)

bokeh_figure_2.xgrid.grid_line_color = None
bokeh_figure_2.ygrid.grid_line_alpha = 0.5
bokeh_figure_2.xaxis.axis_label = 'Time'
bokeh_figure_2.yaxis.axis_label = 'Value'

bokeh_figure_2.line(datetime_df.index, datetime_df.value)
show(bokeh_figure_2)

Let’s look at energy consumption, split by source (from the Pandas-Bokeh manual):

url = "https://raw.githubusercontent.com/PatrikHlobil/Pandas-Bokeh/master/docs/Testdata/energy/energy.csv"
df_energy = pd.read_csv(url, parse_dates=["Year"])
df_energy.head()

	Year	Oil	Gas	Coal	Nuclear Energy	Hydroelectricity	Other Renewable
0	1970-01-01	2291.5	826.7	1467.3	17.7	265.8	5.8
1	1971-01-01	2427.7	884.8	1459.2	24.9	276.4	6.3
2	1972-01-01	2613.9	933.7	1475.7	34.1	288.9	6.8
3	1973-01-01	2818.1	978.0	1519.6	45.9	292.5	7.3
4	1974-01-01	2777.3	1001.9	1520.9	59.6	321.1	7.7

Another Bokeh-based library is Holoviews. Its uniqueness stems from the way it handles DataFrames with multiple columns, and the way you add plots to each other. It’s very suitable for Jupyter notebook based plots:

import holoviews as hv
hv.extension("bokeh")

df_energy.head()

	Year	Oil	Gas	Coal	Nuclear Energy	Hydroelectricity	Other Renewable
0	1970-01-01	2291.5	826.7	1467.3	17.7	265.8	5.8
1	1971-01-01	2427.7	884.8	1459.2	24.9	276.4	6.3
2	1972-01-01	2613.9	933.7	1475.7	34.1	288.9	6.8
3	1973-01-01	2818.1	978.0	1519.6	45.9	292.5	7.3
4	1974-01-01	2777.3	1001.9	1520.9	59.6	321.1	7.7

scatter = hv.Scatter(df_energy, 'Oil', 'Gas')
scatter

scatter + hv.Curve(df_energy, 'Oil', 'Hydroelectricity')

def get_year_coal(df, year) -> int:
    return df.loc[df["Year"] == year, "Coal"]

items = {year: hv.Bars(get_year_coal(df_energy, year)) for year in df_energy["Year"]}
                       
hv.HoloMap(items, kdims=['Year'])

Holoviews really needs an entire class (or two) to go over its concepts, but once you get them you can create complicated visualizations which include a strong interactive component in a few lines of code.

Seaborn#

A library which has really become a shining example of quick, efficient and clear plotting in the post-pandas era is seaborn. It combines many of the features of the previous libraries into a very concise API. Unlike a few of the previous libraries, however, it doesn’t use bokeh as its backend, but matplotlib, which means that the interactivity of the resulting plots isn’t as good. Be that as it may, it’s still a widely used library, and for good reasons.

In order to use seaborn to its full extent (and really all of the above libraries) we have to transform our data into a long-form format.

Once we have this long form data, we can put seaborn to the test.

tidy_df.sample(10)

	religion	income	freq
2	Agnostic	50000	76
11	Atheist	75000	70
53	Jehovahs Witness	20000	27
45	Historically Black Prot	30000	236
43	Historically Black Prot	40000	238
0	Agnostic	10000	27
58	Jewish	50000	30
34	Evangelical Prot	50000	881
32	Evangelical Prot	20000	869
4	Agnostic	20000	34

import seaborn as sns

income_barplot = sns.barplot(data=tidy_df, x='income', y='freq', hue='religion')

../../_images/db78b5e88c326fed3aa4e9633b0351ad5513300c3a4157343066f6378bf03326.png

To fix the legend location:

income_barplot = sns.barplot(data=tidy_df, x='income', y='freq', hue='religion')
_ = income_barplot.legend(bbox_to_anchor=(1, 1))

../../_images/8aa0e1747cde13376c3536603686554c94c5a8ed084758f45e1f23a110dbf943.png

Each seaborn visualization functions has a “data” keyword to which you pass your dataframe, and then a few other with which you specify the relations of the columns to one another. Look how simple it was to receive this beautiful bar chart.

income_catplot = sns.catplot(data=tidy_df, x="religion", y="freq", hue="income", aspect=2)
_ = plt.xticks(rotation=45)

../../_images/5b804065e2ec224ee06c3f46cc0447a3f254189af6cc787358b1a422a1149895.png

Seaborn also takes care of faceting the data for us:

_ = sns.catplot(data=tidy_df, x="religion", y="freq", hue="income", col="income")

../../_images/90cabfa50af4e27c2b98fb5fdb0742073b0733a50ea3206b629a4d1b6b3ca6ea.png

Figure is a bit small? We can use matplotlib to change it:

_, ax = plt.subplots(figsize=(25, 8))
_ = sns.stripplot(data=tidy_df, x="religion", y="freq", hue="income", ax=ax)

../../_images/92884b0915a221f89514da9cc3a60a50466a6b57e8ddc524c06a50f24b4a3283.png

Simpler data can also be visualized, no need for categorical variables:

simple_df = pd.DataFrame(np.random.random((1000, 4)), columns=list('abcd'))
simple_df

	a	b	c	d
0	0.483864	0.719078	0.560177	0.616056
1	0.913328	0.229166	0.368916	0.063013
2	0.124544	0.676949	0.292933	0.787386
3	0.316716	0.918850	0.179946	0.900219
4	0.629568	0.742966	0.726829	0.325813
...	...	...	...	...
995	0.452657	0.938954	0.789799	0.684253
996	0.439011	0.280001	0.416180	0.209283
997	0.891278	0.128377	0.715805	0.657397
998	0.268303	0.930716	0.394275	0.530545
999	0.980559	0.864520	0.495984	0.490965

1000 rows × 4 columns

_ = sns.jointplot(data=simple_df, x='a', y='b', kind='kde')

../../_images/a9b584255489458a55b328501556817dc21dca857e61149696b9c9673e77cbef.png

And complex relations can also be visualized:

_ = sns.pairplot(data=simple_df)

../../_images/8ff493b0f7b8c5b861c123958fe73ea25dfe17a29cfd61248c09d950987d364d.png

Seaborn should probably be your go-to choice when all you need is a 2D graph.

Class 7b: Tidy format, Visualizations, xarray and Modeling

Contents

Class 7b: Tidy format, Visualizations, `xarray` and Modeling#

Long form (“tidy”) data#

Data Visualization#

`matplotlib`#

Altair#

Bokeh, Holoviews and pandas-bokeh#

Seaborn#

Higher Dimensionality: `xarray`#

DataArray#

Dataset#

Exercise: Rat Visual Stimulus Experiment Database#

Methods and functions to implement#

Exercise solutions below…#

Class 7b: Tidy format, Visualizations, xarray and Modeling

Contents

Class 7b: Tidy format, Visualizations, xarray and Modeling#

Long form (“tidy”) data#

Data Visualization#

matplotlib#

Altair#

Bokeh, Holoviews and pandas-bokeh#

Seaborn#

Higher Dimensionality: xarray#

DataArray#

Dataset#

Exercise: Rat Visual Stimulus Experiment Database#

Methods and functions to implement#

Exercise solutions below…#

Class 7b: Tidy format, Visualizations, `xarray` and Modeling#

`matplotlib`#

Higher Dimensionality: `xarray`#