Class 7a: More pandas

Class 7a: More `pandas`#

More pandas!#

Working with String DataFrames#

Pandas’ Series instances with a dtype of object or string expose a str attribute that enables vectorized string operations. These can come in tremendously handy, particularly when cleaning the data and performing aggregations on manually submitted fields.

Let’s imagine having the misfortune of reading some CSV data and finding the following headers:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

messy_strings = [
    'Id___name', 'AGE', ' DomHand ', np.nan, 'qid    score1', 'score2', 3,
    ' COLOR_ SHAPe   _size', 'origin_residence immigration'
]
s = pd.Series(messy_strings, dtype="string", name="Messy Strings")
s

                     Id___name
                           AGE
                      DomHand 
                          <NA>
                 qid    score1
                        score2
                             3
          COLOR_ SHAPe   _size
  origin_residence immigration
Name: Messy Strings, dtype: string

To try and parse something more reasonable, we might first want to remove all unnecessary whitespace and underscores. One way to achieve that would be:

s_1 = s.str.strip().str.replace("[_\s]+", " ", regex=True).str.lower()
s_1

                       id name
                           age
                       domhand
                          <NA>
                    qid score1
                        score2
                             3
              color shape size
  origin residence immigration
Name: Messy Strings, dtype: string

Let’s break this down:

strip() removed all whitespace from the beginning and end of the string.
We used a regular expression to replace all one or more (+) occurrences of whitespace (\s) and underscores with single spaces.
We converted all characters to lowercase.

Next, we’ll split() strings separated by whitespace and extract an array of the values:

s_2 = s_1.str.split(expand=True)
print(f"DataFrame:\n{s_2}")

s_3 = s_2.to_numpy().flatten()
print(f"\nArray:\n{s_3}")

DataFrame:
         0          1            2
0       id       name         <NA>
1      age       <NA>         <NA>
2  domhand       <NA>         <NA>
3     <NA>       <NA>         <NA>
4      qid     score1         <NA>
5   score2       <NA>         <NA>
6        3       <NA>         <NA>
7    color      shape         size
8   origin  residence  immigration

Array:
['id' 'name' <NA> 'age' <NA> <NA> 'domhand' <NA> <NA> <NA> <NA> <NA> 'qid'
 'score1' <NA> 'score2' <NA> <NA> '3' <NA> <NA> 'color' 'shape' 'size'
 'origin' 'residence' 'immigration']

Finally, we can get rid of the <NA> values:

column_names = s_3[~pd.isnull(s_3)]
column_names

array(['id', 'name', 'age', 'domhand', 'qid', 'score1', 'score2', '3',
       'color', 'shape', 'size', 'origin', 'residence', 'immigration'],
      dtype=object)

DataFrame String Operations Exercise

Generate a 1000x1 shapedpd.DataFrame filled with 3-letter strings. Use the string module’s ascii_lowercase attribute and numpy’s random module.

Add a column indicating if the string in this row has a z in its 2nd character.

Add a third column containing the capitalized and reversed versions of the original strings.

Concatenation and Merging#

Similarly to NumPy arrays, Series and DataFrame objects can be concatenated as well. However, having indices can often make this operation somewhat less trivial.

ser1 = pd.Series(['a', 'b', 'c'], index=[1, 2, 3])
ser2 = pd.Series(['d', 'e', 'f'], index=[4, 5, 6])
pd.concat([ser1, ser2])  # row-wise (axis=0) by default

  a
  b
  c
  d
  e
  f
dtype: object

Let’s do the same with dataframes:

df1 = pd.DataFrame([['a', 'A'], ['b', 'B']], columns=['let', 'LET'], index=[0, 1])
df2 = pd.DataFrame([['c', 'C'], ['d', 'D']], columns=['let', 'LET'], index=[2, 3])
pd.concat([df1, df2])  # again, along the first axis

	let	LET
0	a	A
1	b	B
2	c	C
3	d	D

This time, let’s complicate things a bit, and introduce different column names:

df1 = pd.DataFrame([['a', 'A'], ['b', 'B']], columns=['let1', 'LET1'], index=[0, 1])
df2 = pd.DataFrame([['c', 'C'], ['d', 'D']], columns=['let2', 'LET2'], index=[2, 3])
pd.concat([df1, df2])  # pandas can't make the column index compatible, so it resorts to columnar concat

	let1	LET1	let2	LET2
0	a	A	NaN	NaN
1	b	B	NaN	NaN
2	NaN	NaN	c	C
3	NaN	NaN	d	D

The same result would be achieved by:

pd.concat([df1, df2], axis=1)

	let1	LET1	let2	LET2
0	a	A	NaN	NaN
1	b	B	NaN	NaN
2	NaN	NaN	c	C
3	NaN	NaN	d	D

But what happens if introduce overlapping indices?

df1 = pd.DataFrame([['a', 'A'], ['b', 'B']], columns=['let', 'LET'], index=[0, 1])
df2 = pd.DataFrame([['c', 'C'], ['d', 'D']], columns=['let', 'LET'], index=[0, 2])
pd.concat([df1, df2])

	let	LET
0	a	A
1	b	B
0	c	C
2	d	D

Nothing, really! While not recommended in practice, pandas won’t judge you.

If, however, we wish to keep the integrity of the indices, we can use the verify_integrity keyword:

df1 = pd.DataFrame([['a', 'A'], ['b', 'B']], columns=['let', 'LET'], index=[0, 1])
df2 = pd.DataFrame([['c', 'C'], ['d', 'D']], columns=['let', 'LET'], index=[0, 2])
pd.concat([df1, df2], verify_integrity=True)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[11], line 3
      1 df1 = pd.DataFrame([['a', 'A'], ['b', 'B']], columns=['let', 'LET'], index=[0, 1])
      2 df2 = pd.DataFrame([['c', 'C'], ['d', 'D']], columns=['let', 'LET'], index=[0, 2])
----> 3 pd.concat([df1, df2], verify_integrity=True)

File ~/Projects/courses/python_for_neuroscientists/textbook-public/venv/lib/python3.11/site-packages/pandas/core/reshape/concat.py:395, in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    380     copy = False
    382 op = _Concatenator(
    383     objs,
    384     axis=axis,
   (...)    392     sort=sort,
    393 )
--> 395 return op.get_result()

File ~/Projects/courses/python_for_neuroscientists/textbook-public/venv/lib/python3.11/site-packages/pandas/core/reshape/concat.py:671, in _Concatenator.get_result(self)
    669 for obj in self.objs:
    670     indexers = {}
--> 671     for ax, new_labels in enumerate(self.new_axes):
    672         # ::-1 to convert BlockManager ax to DataFrame ax
    673         if ax == self.bm_axis:
    674             # Suppress reindexing on concat axis
    675             continue

File properties.pyx:36, in pandas._libs.properties.CachedProperty.__get__()

File ~/Projects/courses/python_for_neuroscientists/textbook-public/venv/lib/python3.11/site-packages/pandas/core/reshape/concat.py:702, in _Concatenator.new_axes(self)
    699 @cache_readonly
    700 def new_axes(self) -> list[Index]:
    701     ndim = self._get_result_dim()
--> 702     return [
    703         self._get_concat_axis if i == self.bm_axis else self._get_comb_axis(i)
    704         for i in range(ndim)
    705     ]

File ~/Projects/courses/python_for_neuroscientists/textbook-public/venv/lib/python3.11/site-packages/pandas/core/reshape/concat.py:703, in <listcomp>(.0)
    699 @cache_readonly
    700 def new_axes(self) -> list[Index]:
    701     ndim = self._get_result_dim()
    702     return [
--> 703         self._get_concat_axis if i == self.bm_axis else self._get_comb_axis(i)
    704         for i in range(ndim)
    705     ]

File properties.pyx:36, in pandas._libs.properties.CachedProperty.__get__()

File ~/Projects/courses/python_for_neuroscientists/textbook-public/venv/lib/python3.11/site-packages/pandas/core/reshape/concat.py:766, in _Concatenator._get_concat_axis(self)
    761 else:
    762     concat_axis = _make_concat_multiindex(
    763         indexes, self.keys, self.levels, self.names
    764     )
--> 766 self._maybe_check_integrity(concat_axis)
    768 return concat_axis

File ~/Projects/courses/python_for_neuroscientists/textbook-public/venv/lib/python3.11/site-packages/pandas/core/reshape/concat.py:774, in _Concatenator._maybe_check_integrity(self, concat_index)
    772 if not concat_index.is_unique:
    773     overlap = concat_index[concat_index.duplicated()].unique()
--> 774     raise ValueError(f"Indexes have overlapping values: {overlap}")

ValueError: Indexes have overlapping values: Index([0], dtype='int64')

If we don’t care about the indices, we can just ignore them:

pd.concat([df1, df2], ignore_index=True)  # resets the index

	let	LET
0	a	A
1	b	B
2	c	C
3	d	D

We can also create a new MultiIndex if that happens to makes more sense:

pd.concat([df1, df2], keys=['df1', 'df2'])  # "remembers" the origin of the data, super useful!

		let	LET
df1	0	a	A
df1	1	b	B
df2	0	c	C
df2	2	d	D

A common real world example of concatenation happens when joining two datasets sampled at different times. For example, if we conducted in day 1 measurements at times 8:00, 10:00, 14:00 and 16:00, but during day 2 we were a bit dizzy, and conducted the measurements at 8:00, 10:00, 13:00 and 16:30. On top of that, we recorded another parameter that we forget to measure at day 1.

The default concatenation behavior of pandas keeps all the data. In database terms (SQL people rejoice!) it’s called an “outer join”:

# Prepare mock data
day_1_times = pd.to_datetime(['08:00', '10:00', '14:00', '16:00'],
                             format='%H:%M').time
day_2_times = pd.to_datetime(['08:00', '10:00', '13:00', '16:30'],
                             format='%H:%M').time

day_1_data = {
    "temperature": [36.6, 36.7, 37.0, 36.8],
    "humidity": [30., 31., 30.4, 30.4]
}
day_2_data = {
    "temperature": [35.9, 36.1, 36.5, 36.2],
    "humidity": [32.2, 34.2, 30.9, 32.6],
    "light": [200, 130, 240, 210]
}

df_1 = pd.DataFrame(day_1_data, index=day_1_times)
df_2 = pd.DataFrame(day_2_data, index=day_2_times)

df_1

	temperature	humidity
08:00:00	36.6	30.0
10:00:00	36.7	31.0
14:00:00	37.0	30.4
16:00:00	36.8	30.4

Note

Note how pd.to_datetime() returns a DatetimeIndex object which exposes a time property, allowing us to easily remove the “date” part of the returned “datetime”, considering it is not represented in our mock data.

df_2

	temperature	humidity	light
08:00:00	35.9	32.2	200
10:00:00	36.1	34.2	130
13:00:00	36.5	30.9	240
16:30:00	36.2	32.6	210

# Outer join
pd.concat([df_1, df_2], join='outer')  # outer join is the default behavior  

	temperature	humidity	light
08:00:00	36.6	30.0	NaN
10:00:00	36.7	31.0	NaN
14:00:00	37.0	30.4	NaN
16:00:00	36.8	30.4	NaN
08:00:00	35.9	32.2	200.0
10:00:00	36.1	34.2	130.0
13:00:00	36.5	30.9	240.0
16:30:00	36.2	32.6	210.0

To take the intersection of the columns we have to use inner join. The intersection is all the columns that are common in all datasets.

# Inner join - the excess data column was dropped (index is still not unique)
pd.concat([df_1, df_2], join='inner')

	temperature	humidity
08:00:00	36.6	30.0
10:00:00	36.7	31.0
14:00:00	37.0	30.4
16:00:00	36.8	30.4
08:00:00	35.9	32.2
10:00:00	36.1	34.2
13:00:00	36.5	30.9
16:30:00	36.2	32.6

One can also specify the exact columns that should be the result of the join operation using the columns keyword. All in all, this basic functionality is easy to understand and allows for high flexibility.

Finally, joining on the columns will require the indices to be unique:

pd.concat([df_1, df_2], join='inner', axis='columns')

	temperature	humidity	temperature	humidity	light
08:00:00	36.6	30.0	35.9	32.2	200
10:00:00	36.7	31.0	36.1	34.2	130

This doesn’t look so good. The columns are a mess and we’re barely left with any data.

Our best option using pd.concat() might be something like:

df_concat = pd.concat([df_1, df_2], keys=["Day 1", "Day 2"])
df_concat

		temperature	humidity	light
Day 1	08:00:00	36.6	30.0	NaN
	10:00:00	36.7	31.0	NaN
	14:00:00	37.0	30.4	NaN
	16:00:00	36.8	30.4	NaN
Day 2	08:00:00	35.9	32.2	200.0
	10:00:00	36.1	34.2	130.0
	13:00:00	36.5	30.9	240.0
	16:30:00	36.2	32.6	210.0

Or maybe an unstacked version:

df_concat.unstack(0)

	temperature		humidity		light
	Day 1	Day 2	Day 1	Day 2	Day 1	Day 2
08:00:00	36.6	35.9	30.0	32.2	NaN	200.0
10:00:00	36.7	36.1	31.0	34.2	NaN	130.0
13:00:00	NaN	36.5	NaN	30.9	NaN	240.0
14:00:00	37.0	NaN	30.4	NaN	NaN	NaN
16:00:00	36.8	NaN	30.4	NaN	NaN	NaN
16:30:00	NaN	36.2	NaN	32.6	NaN	210.0

We could also use pd.merge():

pd.merge(df_1,
         df_2,
         how="outer",           # Keep all indices (rather than just the intersection)
         left_index=True,       # Use left index
         right_index=True,      # Use right index
         suffixes=("_1", "_2")) # Suffixes to use for overlapping columns

	temperature_1	humidity_1	temperature_2	humidity_2	light
08:00:00	36.6	30.0	35.9	32.2	200.0
10:00:00	36.7	31.0	36.1	34.2	130.0
13:00:00	NaN	NaN	36.5	30.9	240.0
14:00:00	37.0	30.4	NaN	NaN	NaN
16:00:00	36.8	30.4	NaN	NaN	NaN
16:30:00	NaN	NaN	36.2	32.6	210.0

The dataframe’s merge() method also enables easily combining columns from a different (but similarly indexed) dataframe:

mouse_id = [511, 512, 513, 514]
meas1 = [67, 66, 89, 92]
meas2 = [45, 45, 65, 61]

data_1 = {"ID": [500, 501, 502, 503], "Blood Volume": [100, 102, 99, 101]}
data_2 = {"ID": [500, 501, 502, 503], "Monocytes": [20, 19, 25, 21]}

df_1 = pd.DataFrame(data_1)
df_2 = pd.DataFrame(data_2)
df_1

	ID	Blood Volume
0	500	100
1	501	102
2	502	99
3	503	101

df_1.merge(df_2)  # merge identified that the only "key" connecting the two tables was the 'id' key

	ID	Blood Volume	Monocytes
0	500	100	20
1	501	102	19
2	502	99	25
3	503	101	21

Database-like operations are a very broad topic with advanced implementations in pandas.

Concatenation and Merging Exercise

Create three dataframes with random values and shapes of (10, 2), (10, 1), (15, 3). Their index should be simple ordinal integers, and their column names should be different.

Concatenate these dataframes over the second axis using pd.concat().

Concatenate these dataframes over the second axis using pd.merge().

Grouping#

Yet another SQL-like feature that pandas posses is the group-by operation, sometimes known as “split-apply-combine”.

# Mock data
subject = range(100, 200)
alive = np.random.choice([True, False], 100)
placebo = np.random.choice([True, False], 100)
measurement_1 = np.random.random(100)
measurement_2 = np.random.random(100)
data = {
    "Subject ID": subject,
    "Alive": alive,
    "Placebo": placebo,
    "Measurement 1": measurement_1,
    "Measurement 2": measurement_2
}
df = pd.DataFrame(data).set_index("Subject ID")
df

	Alive	Placebo	Measurement 1	Measurement 2
Subject ID
100	False	True	0.693578	0.109215
101	True	True	0.536397	0.662925
102	False	False	0.224192	0.007915
103	True	False	0.541027	0.973593
104	False	False	0.497384	0.268210
...	...	...	...	...
195	True	True	0.001750	0.103425
196	True	False	0.339509	0.312900
197	False	False	0.825436	0.575736
198	False	False	0.340220	0.550900
199	True	True	0.202218	0.730814

100 rows × 4 columns

The most sensible thing to do is to group by either the “Alive” or the “Placebo” columns (or both). This is the “split” part.

grouped = df.groupby('Alive')
grouped  # DataFrameGroupBy object - intermediate object ready to be evaluated

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x78ae0be4b110>

This intermediate object is an internal pandas representation which should allow it to run very fast computation the moment we want to actually know something about these groups. Assuming we want the mean of Measurement 1, as long as we won’t specifically write grouped.mean() pandas will do very little in terms of actual computation. It’s called “lazy evaluation”.

The intermediate object has some useful attributes:

grouped.groups

{False: [100, 102, 104, 105, 107, 109, 110, 111, 112, 113, 114, 115, 116, 119, 123, 124, 125, 126, 130, 131, 135, 139, 141, 142, 143, 144, 146, 151, 152, 153, 155, 156, 157, 158, 165, 166, 168, 169, 170, 171, 173, 174, 179, 184, 187, 191, 192, 193, 197, 198], True: [101, 103, 106, 108, 117, 118, 120, 121, 122, 127, 128, 129, 132, 133, 134, 136, 137, 138, 140, 145, 147, 148, 149, 150, 154, 159, 160, 161, 162, 163, 164, 167, 172, 175, 176, 177, 178, 180, 181, 182, 183, 185, 186, 188, 189, 190, 194, 195, 196, 199]}

len(grouped)  # True and False

If we wish to run some actual processing, we have to use an aggregation function:

grouped.sum()

	Placebo	Measurement 1	Measurement 2
Alive
False	19	26.355717	24.211523
True	29	23.786082	28.607547

grouped.mean()

	Placebo	Measurement 1	Measurement 2
Alive
False	0.38	0.527114	0.484230
True	0.58	0.475722	0.572151

grouped.size()

Alive
False    50
True     50
dtype: int64

If we just wish to see one of the groups, we can use get_group():

grouped.get_group(True).head()

	Alive	Placebo	Measurement 1	Measurement 2
Subject ID
101	True	True	0.536397	0.662925
103	True	False	0.541027	0.973593
106	True	True	0.072025	0.743331
108	True	False	0.982296	0.969610
117	True	False	0.641704	0.475893

We can also call several functions at once using the .agg attribute:

grouped.agg([np.mean, np.std]).drop("Placebo", axis=1)

/tmp/ipykernel_234405/773109026.py:1: FutureWarning: The provided callable <function mean at 0x78ae7c5f6840> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
  grouped.agg([np.mean, np.std]).drop("Placebo", axis=1)
/tmp/ipykernel_234405/773109026.py:1: FutureWarning: The provided callable <function std at 0x78ae7c5f6980> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
  grouped.agg([np.mean, np.std]).drop("Placebo", axis=1)

	Measurement 1		Measurement 2
	mean	std	mean	std
Alive
False	0.527114	0.263592	0.484230	0.323475
True	0.475722	0.279444	0.572151	0.301904

Grouping by multiple columns:

grouped2 = df.groupby(['Alive', 'Placebo'])
grouped2

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x78ae0b608410>

grouped2.agg([np.sum, np.var])

/tmp/ipykernel_234405/557621468.py:1: FutureWarning: The provided callable <function sum at 0x78ae7c5f5440> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
  grouped2.agg([np.sum, np.var])
/tmp/ipykernel_234405/557621468.py:1: FutureWarning: The provided callable <function var at 0x78ae7c5f6ac0> is currently using SeriesGroupBy.var. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "var" instead.
  grouped2.agg([np.sum, np.var])

		Measurement 1		Measurement 2
		sum	var	sum	var
Alive	Placebo
False	False	14.460490	0.063328	15.173510	0.096224
False	True	11.895227	0.066926	9.038013	0.124345
True	False	10.230053	0.079387	12.305695	0.113690
True	True	13.556029	0.079782	16.301852	0.078051

groupby() offers many more features, available here.

Grouping Exercise

Create a dataframe with two columns, 10,000 entries in length. The first should be a random boolean column, and the second should be a sine wave from 0 to 20\(\pi\). This simulates measuring a parameter from two distinct groups.

Group the dataframe by your boolean column, creating a GroupBy object.

Plot the values of the grouped dataframe.

Use the rolling() method to create a rolling average window of length 5 and overlay the result.

Other Pandas Features#

Pandas has a TON of features and small implementation details that are there to make your life simpler. Features like IntervalIndex to index the data between two numbers instead of having a single label, for example, are very nice and ergonomic if you need them. Sparse DataFrames are also included, as well as many other computational tools, serialization capabilities, and more. If you need it - there’s a good chance it already exists as a method in the pandas jungle.