Class 5¶
In our previous class we discussed NumPy, which in many ways is the cornerstone of the scientific ecosystem of Python. Besides NumPy, there are a few additional libraries which every scientific Python user should know. In this class we will discuss matplotlib
and pandas
.
Data Visualization with matplotlib
¶
The most widely-used plotting library in the Python ecosystem is matplotlib
. It has a number of strong alternatives and complimentary libraries, e.g. bokeh
and seaborn
, but in terms of raw features it still has no real contenders.
matplotlib
allows for very complicated data visualizations, and has two parallel implementations, a procedural one, and an object-oriented one. The procedural one resembles the MATLAB plotting interface very much, allowing for a very quick transition for MATLAB veterans. Evidently, matplotlib
was initially inspired by MATLAB’s approach to visualization.
Having said that, and considering how “old habits die hard”, it’s important to emphasize that the object-oriented interface is better in the long run, since it complies with more online examples and allows for easier plot manipulations. Finally, the “best” way to visualize your data will be to coerce it into a seaborn
-like format and use that library to do so. More on that later.
Object-oriented Examples¶
This time, we will start things off by instantiating the Figure
object using matplotlib
’s subplots()
function:
A figure is the “complete” plot, which can contain many subplots, and the axis is the “container” for data itself.
Figures and axes can also be created separately the following two lines:
Multiple plots¶

To save a plot we could use the Figure
object’s savefig()
method:
matplotlib
is used in conjuction with numpy
to visualize arrays:
def f(t):
return np.exp(-t) * np.cos(2 * np.pi * t)
def g(t):
return np.sin(t) * np.cos(1 / (t + 0.1))
t1 = np.arange(0.0, 5.0, 0.1) # (start, stop, step)
t2 = np.arange(0.0, 5.0, 0.02)
# Create figure and axis
fig2 = plt.figure()
ax1 = fig2.add_subplot(111)
# Plot g over t1 and f over t2 in one line
ax1.plot(t1, g(t1), 'ro', t2, f(t2), 'k')
# Add grid
ax1.grid(color='b', alpha=0.5, linestyle='dashed', linewidth=0.5)
# Assigning labels and creating a legend
f_label = r'$e^{-t}\cos(2 \pi t)$' # Using r'' allows us to use "\" in our strings
g_label = r'$\sin(t) \cdot \cos(\frac{1}{t + 0.1})$'
_ = ax1.legend([g_label, f_label])

Multiple Axes¶
data = np.random.randn(2, 100) # random numbers from normal distribution
fig, axs = plt.subplots(2, 2, figsize=(10, 6)) # 4 axes in a 2-by-2 grid.
axs[0, 0].hist(data[0])
axs[1, 0].scatter(data[0], data[1])
axs[0, 1].plot(data[0], data[1], '-.', linewidth=0.15)
_ = axs[1, 1].hist2d(data[0], data[1])

Note that “axes” is a numpy.ndarray instance, and that in order to draw on a specific axis (plot), we start by calling the specific axis according to it’s location on the array.
numpy.ndarray
matplotlib.axes._subplots.AxesSubplot


Using matplotlib
’s style
objects open up a world of possibilities.
To display available predefined styles:
['Solarize_Light2', '_classic_test_patch', 'bmh', 'classic', 'dark_background', 'fast', 'fivethirtyeight', 'ggplot', 'grayscale', 'seaborn', 'seaborn-bright', 'seaborn-colorblind', 'seaborn-dark', 'seaborn-dark-palette', 'seaborn-darkgrid', 'seaborn-deep', 'seaborn-muted', 'seaborn-notebook', 'seaborn-paper', 'seaborn-pastel', 'seaborn-poster', 'seaborn-talk', 'seaborn-ticks', 'seaborn-white', 'seaborn-whitegrid', 'tableau-colorblind10']
Usage example:
SciPy¶
SciPy is a large library consisting of many smaller modules, each targeting a single field of scientific computing.
Available modules include scipy.stats
, scipy.linalg
, scipy.fftpack
, scipy.signal
and many more.
Because of its extremely wide scope of available use-cases, we won’t go through all of them. All you need to do is to remember that many functions that you’re used to find in different MATLAB toolboxes are located somewhere in SciPy.
Below you’ll find a few particularly interesting use-cases.
.mat files input\output¶
Linear algebra¶
Curve fitting¶
from scipy import optimize
def test_func(x, a, b):
return a * np.sin(b * x)
# Create noisy data
x_data = np.linspace(-5, 5, num=50)
y_data = 2.9 * np.sin(1.5 * x_data) # create baseline - sin wave
y_data += np.random.normal(size=50) # add normally-distributed noise
fig, ax = plt.subplots()
ax.scatter(x_data, y_data)
params, params_covariance = optimize.curve_fit(test_func,
x_data,
y_data,
p0=[3, 1])
_ = ax.plot(x_data, test_func(x_data, params[0], params[1]), 'k')

Statistics¶
from scipy import stats
# Create two random normal distributions with different paramters
a = np.random.normal(loc=0, scale=1, size=100)
b = np.random.normal(loc=1, scale=1, size=10)
# Draw histograms describing the distributions
fig, ax = plt.subplots()
ax.hist(a, color='b', density=True)
ax.hist(b, color='orange', density=True)
# Calculate the T-test for the means of two independent samples of scores
stats.ttest_ind(a, b)
IPython¶
IPython is the REPL in which this text is written in. As stated, it’s the most popular “command window” of Python. When most Python programmers wish to write and execute a small Python script, they won’t use the regular Python interpreter, accessible with python my_file.py
. Instead, they will run it with IPython since it has more features. For instance, the popular MATLAB feature which saves the variables that returned from the script you ran is accessible when running a script as ipython -i my_file.py
.
Let’s examine some of IPython’s other features, accessible by using the %
magic operator before writing your actual code:
%%prun
- benchmark each function line¶
Run a statement through the python code profiler.
%run
- run external script¶
Run the named file inside IPython as a program.
%matplotlib [notebook\inline]
¶
Easily display matplotlib
figures inside the notebook and set up matplotlib to work interactively.
%reset
¶
Resets the namespace by removing all names defined by the user, if called without arguments, or by removing some types of objects, such as everything currently in IPython’s In[] and Out[] containers (see the parameters for details).
scikit-image
¶
scikit-image
is one of the main image processing libraries in Python. We’ll look at it in greater interest later in the semester, but for now let’s examine some of its algorithms:
Edge detection using a Sobel filter.¶
from skimage import data, io, filters
fig,axes = plt.subplots(1,2,figsize=(14,6))
image = data.coins()
io.imshow(image,ax=axes[0])
axes[0].set_title("Edges",fontsize=18)
edges = filters.sobel(image) # edge-detection filter
# plt.figure()
io.imshow(edges,ax=axes[1])
axes[1].set_title("Original",fontsize=18)
plt.tight_layout()

Segmentation using a “random walker” algorithm¶
Further reading
The random walker algorithm determines the segmentation of an image from a set of markers labeling several phases (2 or more). An anisotropic diffusion equation is solved with tracers initiated at the markers’ position. The local diffusivity coefficient is greater if neighboring pixels have similar values, so that diffusion is difficult across high gradients. The label of each unknown pixel is attributed to the label of the known marker that has the highest probability to be reached first during this diffusion process.
In this example, two phases are clearly visible, but the data are too noisy to perform the segmentation from the histogram only. We determine markers of the two phases from the extreme tails of the histogram of gray values, and use the random walker for the segmentation.
from skimage.segmentation import random_walker
from skimage.data import binary_blobs
import skimage
# Generate noisy synthetic data
data1 = skimage.img_as_float(binary_blobs(length=128, seed=1)) # data
data1 += 0.35 * np.random.randn(*data1.shape) # added noise
markers = np.zeros(data1.shape, dtype=np.uint)
markers[data1 < -0.3] = 1
markers[data1 > 1.3] = 2
# Run random walker algorithm
labels = random_walker(data1, markers, beta=10, mode='bf')
# Plot results
fig, (ax1, ax2, ax3) = plt.subplots(1,
3,
figsize=(8, 3.2),
sharex=True,
sharey=True)
ax1.imshow(data1, cmap='gray', interpolation='nearest')
ax1.axis('off')
ax1.set_adjustable('box')
ax1.set_title('Noisy data')
ax2.imshow(markers, cmap='hot', interpolation='nearest')
ax2.axis('off')
ax2.set_adjustable('box')
ax2.set_title('Markers')
ax3.imshow(labels, cmap='gray', interpolation='nearest')
ax3.axis('off')
ax3.set_adjustable('box')
ax3.set_title('Segmentation')
fig.tight_layout()

Template matching¶
Further reading
We use template matching to identify the occurrence of an image patch (in this case, a sub-image centered on a single coin). Here, we return a single match (the exact same coin), so the maximum value in the match_template result corresponds to the coin location. The other coins look similar, and thus have local maxima; if you expect multiple matches, you should use a proper peak-finding function.
The match_template function uses fast, normalized cross-correlation 1 to find instances of the template in the image. Note that the peaks in the output of match_template correspond to the origin (i.e. top-left corner) of the template.
from skimage.feature import match_template
image = skimage.data.coins()
coin = image[170:220, 75:130]
result = match_template(image, coin)
ij = np.unravel_index(np.argmax(result), result.shape)
x, y = ij[::-1]
fig = plt.figure(figsize=(8, 3))
ax1 = plt.subplot(1, 3, 1)
ax2 = plt.subplot(1, 3, 2, adjustable='box')
ax3 = plt.subplot(1, 3, 3, sharex=ax2, sharey=ax2, adjustable='box')
ax1.imshow(coin, cmap=plt.cm.gray)
ax1.set_axis_off()
ax1.set_title('template')
ax2.imshow(image, cmap=plt.cm.gray)
ax2.set_axis_off()
ax2.set_title('image')
# highlight matched region
hcoin, wcoin = coin.shape
rect = plt.Rectangle((x, y),
wcoin,
hcoin,
edgecolor='r',
facecolor='none',
linewidth=2)
ax2.add_patch(rect)
ax3.imshow(result)
ax3.set_axis_off()
ax3.set_title('`match_template`\nresult')
# highlight matched region
ax3.autoscale(False)
ax3.plot(x, y, 'o', markeredgecolor='r', markerfacecolor='none', markersize=10)
Exercise¶
Perform these exercises using the object-oriented interface of matplotlib
. Search for the proper methods from the different SciPy and matplotlib
modules.
Create 1000 normally-distributed points. Histogram them. Overlay the histogram with a dashed line showing the theoretical normal distribution we would expect from the data.

Create a (1000, 3)-shaped matrix of uniformly distributed points between [0, 1). Create a scatter plot with the first two columns as the x and y columns, while the third should control the size of the created point.

Using
np.random.choice
, “roll a die” 100 times. Create a 6x1 figure panel with a shared x-axis containing values between 0 and 10000 (exclusive). The first panel should show a vector with a value of 1 everywhere the die roll came out as 1, and 0 elsewhere. The second panel should show a vector with a value of 1 everywhere the die roll came out as 2, and 0 elsewhere, and so on. Create a title for the entire figure. The y-axis of each panel should indicate the value this plot refers to.
plt.style.use('ggplot')
die = np.arange(1, 7)
num = 100
rolls = np.random.choice(die, num)
fig, ax = plt.subplots(6, 1, sharex=True)
for roll, axis in enumerate(ax, 1):
axis.scatter(np.arange(num), rolls == roll,
s=5) # notice how we plot a boolean vector
axis.set_ylabel(roll)
axis.yaxis.set_ticks([])
axis.set_xlim([0, num])
axis.set_xlabel('Roll number')
fig.suptitle('Dice Roll Distribution')
_ = fig.text(0.01,
0.5,
'Roll value',
ha='center',
va='center',
rotation='vertical')

Data Analysis with pandas
¶
A large part of what makes Python so popular nowadays is pandas
, or the “Python data analysis library”.
pandas
has been around since 2008, and while in itself it’s built on the solid foundations of numpy
, it introduced a vast array of important features that can hardly be found anywhere outside of the Python ecosystem.
The general priniciple in working with pandas
is to first look up in its immense codebase (via its docs), or somewhere online, an existing function that does exactly what you’re looking for, and if you can’t - only then should you implement it youself.
Much of the discussion below is taken from the Python Data Science Handbook, by Jake VanderPlas. Be sure to check it out if you need further help with one of the topics.
The need for pandas
¶
With only clean data in the world, pandas
wouldn’t be as necessary. By clean we mean that all of our data was sampled properly, without any missing data points. We also mean that the data is homogeneous, i.e. of a single type (floats, ints), and one-dimensional.
An example of this simple data might be an electrophysiological measurement of a neuron’s votlage over time, a calcium trace of a single imaged neuron and other simple cases such as these.
pandas
provide flexibility for our numerical computing tasks via its two main data types: DataFrame
and Series
, which are multi-puporse data containers with very useful features, which you’ll soon learn about.
Mastering pandas
is one of the most important goals of this course. Your work as scientists will be greatly simplified if you’ll feel comfortable in the pandas
jungle.
Series¶
A pandas
Series is generalization of a simple numpy
array. It’s the basic building block of pandas
objects.
We received a series instance with our values and an associated index. The index was given automatically, and it defaults to ordinal numbers. Notice how the data is displayed as a column. This is because the pandas library deals with tabular data.
We can access the internal arrays, data and indices, by using the array
and index
attributes:
Note that in many places you’ll see series.values
used when trying to access the raw data. This is no longer encouraged, and you should generally use either series.array
or, even better, series.to_numpy()
.
The index of the array is a true index, just like that of a dictionary, making item access pretty intuitive:
While this feature is very similar to a numpy
array’s index, a series can also have non-integer indices:
The index of a series is one of its most important features. It also strengthens the analogy of a series to an enhanced Python dictionary. The main difference between a series and a dictionary lies in its vectorization - data inside a series can be processed in a vectorized manner, just like you would act upon a standard numpy
array.
Series Instantiation¶
Simplest form:
Or, very similarly:
Indices can be specified, as we’ve seen:
A series (and a DataFrame) can be composed out of a dictionary as well:
Notice how the right dtype
was inferred automatically.
When creating a series from a dictionary, the importance of the index is revealed again:
We can also use slicing on these non-numeric indices:
Note
Note the inclusive last index - string indices are inclusive on both ends. This makes more sense when using location-based indices, since in day-to-day speak we regulary talk with “inclusive” indices - “hand me over the tests of students 1-5” obviously refers to 5 students, not 4.
We’ll dicuss pandas indexing extensively later on, but I do want to point out now that indexes can be non-unique:
A few operations require a unique index, making them raise an exception, but most operations should work seamlessly.
Lastly, series objects can have a name attached to them as well:
DataFrame¶
A DataFrame
is a concatenation of multiple Series
objects that share the same index. It’s a generalization of a two dimensional numpy
array.
You can also think of it as a dictionary of Series
objects, as a database table, or a spreadsheet.
Due to its flexibility, DataFrame
is the more widely used data structure.
A dataframe has a row index (“index”) and a column index (columns):
Instantiation¶
Creating a dataframe can be done in one of several ways:
Dictionary of 1D numpy arrays, lists, dictionaries or Series
A 2D numpy array
A Series
A different dataframe
Alongside the data itself, you can pass two important arguments to the constructor:
columns
- An iterable of the headers of each data column.index
- Similar to a series.
Just like in the case of the series, passing these arguments ensures that the resulting dataframe will contain these specific columns and indices, which might lead to NaN
s in certain rows and\or columns.
one | two | |
---|---|---|
a | 1.0 | 1.0 |
b | 2.0 | 2.0 |
c | 3.0 | 3.0 |
d | NaN | 4.0 |
Again, rows will be dropped for missing indices:
A column of NaN
s is forced in the case of a missing column:
two | three | |
---|---|---|
d | 4.0 | NaN |
b | 2.0 | NaN |
a | 1.0 | NaN |
A 1D dataframe is also possible:
data | |
---|---|
0 | 1 |
1 | 2 |
2 | 3 |
Columnar Operations¶
If we continue with the dictionary analogy, we can observe how intuitive the operations on series and dataframe columns can be:
population | medals | |
---|---|---|
Europe | 100.0 | 10 |
Africa | 907.8 | 21 |
America | 700.1 | 9 |
Asia | 2230.0 | 9 |
Australia | 73.7 | 19 |
A dataframe can be thought of as a dictionary. Thus, accessing a column is done in the following manner:
This will definitely be one of your main sources of confusion - in a 2D array, arr[0]
will return the first row. In a dataframe, df['col0']
will return the first column. Thus, the dictionary analogy might be better suited for indexing operations.
To show a few operations on a dataframe, let’s remind ourselves of the df
variable:
First we see that we can access columns using standard dot notation as well (although it’s usually not recommended):
Can you guess what will these two operations do?
Columns can be deleted with del
, or popped like a dictionary:
Insertion of some scalar value will propagate throughout the column:
Simple plotting¶
You can plot dataframes and series objects quite easily using the plot()
method:
More plotting methods will be shown in class 8.
The assign
method¶
There’s a more powerful way to insert a column into a dataframe, using the assign method:
population | medals | rel_medals | |
---|---|---|---|
Europe | 100.0 | 10 | 0.100000 |
Africa | 907.8 | 21 | 0.023133 |
America | 700.1 | 9 | 0.012855 |
Asia | 2230.0 | 9 | 0.004036 |
Australia | 73.7 | 19 | 0.257802 |
But assign()
can also help us do more complicated stuff:
# We create a intermediate dataframe and run the calculations on it
area = [100, 89, 200, 21, 45]
olympics_new["area"] = area
olympics_new.assign(rel_area_medals=lambda x: x.medals / x.area).plot(
kind='scatter', x='population', y='rel_area_medals')
plt.show()
print("Note that the DataFrame itself didn't change:\n",olympics_new)
Indexing¶
pandas
indexing can be seem complicated at times due to its high flexibility. However, its relative importance should motivate you to overcome this initial barrier.
The pandas
documentation summarizes it in the following manner:
Operation |
Syntax |
Result |
---|---|---|
Select column |
|
Series |
Select row by label |
|
Series |
Select row by integer location |
|
Series |
Slice rows |
|
DataFrame |
Select rows by boolean vector |
|
DataFrame |
Another helpful summary is the following:
Like lists, you can index by integer position (
df.iloc[intloc]
).Like dictionaries, you can index by label (
df[col]
ordf.loc[row_label]
).Like NumPy arrays, you can index with boolean masks (
df[bool_vec]
).Any of these indexers could be scalar indexes, or they could be arrays, or they could be slices.
Any of these should work on the index (=row labels) or columns of a DataFrame.
And any of these should work on hierarchical indexes (we’ll discuss hierarchical indices later).
Let’s see what all the fuss is about:
.loc
¶
.loc
is primarily label based, but may also be used with a boolean array. .loc
will raise KeyError
when the items are not found. Allowed inputs are:
A single label, e.g. 5 or ‘a’, (note that 5 is interpreted as a label of the index. This use is not an integer position along the index)
A list or array of labels [‘a’, ‘b’, ‘c’]
A slice object with labels
'a':'f'
(note that contrary to usual python slices, both the start and the stop are included, when present in the index! - also see Slicing with labels)A boolean array
A callable function with one argument (the calling Series, DataFrame or Panel) and that returns valid output for indexing (one of the above)
Using characters is always inclusive on both ends. This is because it’s more “natural” this way, according to pandas
devs. As natural as it may be, it’s definitely confusing.
2D indexing also works:
.iloc
¶
.iloc
is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array. .iloc
will raise IndexError
if a requested indexer is out-of-bounds, except slice indexers which allow out-of-bounds indexing. (this conforms with Python/numpy
slice semantics). Allowed inputs are:
An integer, e.g.
5
A list or array of integers
[4, 3, 0]
A slice object with ints
1:7
A boolean array
A callable function with one argument (the calling Series, DataFrame or Panel) and that returns valid output for indexing (one of the above)
2D indexing works as expected:
We can also slice rows in a more intuitive fashion:
Notice how no exception was raised even though we tried to slice outside the dataframe boundary. This conforms to standard Python and numpy
behavior.
This slice notation (without .iloc
or .loc
) works fine, but it sometimes counter-intuitive. Try this example:
A | B | C | D | |
---|---|---|---|---|
10 | 1 | 2 | 3 | 4 |
20 | 5 | 6 | 7 | 8 |
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/opt/hostedtoolcache/Python/3.8.7/x64/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3079 try:
-> 3080 return self._engine.get_loc(casted_key)
3081 except KeyError as err:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 1
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
<ipython-input-77-9b10e048991a> in <module>
----> 1 df2[1] # we fail, since the key "1" isn't in the columns
2 # df2[10] - this also fails
/opt/hostedtoolcache/Python/3.8.7/x64/lib/python3.8/site-packages/pandas/core/frame.py in __getitem__(self, key)
3022 if self.columns.nlevels > 1:
3023 return self._getitem_multilevel(key)
-> 3024 indexer = self.columns.get_loc(key)
3025 if is_integer(indexer):
3026 indexer = [indexer]
/opt/hostedtoolcache/Python/3.8.7/x64/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3080 return self._engine.get_loc(casted_key)
3081 except KeyError as err:
-> 3082 raise KeyError(key) from err
3083
3084 if tolerance is not None:
KeyError: 1
This is why we generally prefer indexing with either .loc
or .iloc
- we know what we’re after, and we explicitly write it.
Indexing with query
and where
¶
Exercise¶
Give it a try before you reveal the hidden cells (“Click to show”)!
Basics #1:
Create a mock
pd.Series
containing the number of autonomous cars in different cities in Israel. Use proper naming and datatypes, and have at least 7 data points.Show the mean, standard deviation and median of the Series.
Create another mock Series for the population counts of the cities you used in question 1.
Make a DataFrame from both series and plot (scatter plot) the number of autonomous cars as a function of the population using the pandas’ API only, without a direct call to matplotlib (besides
plt.show()
).
Basics #2:
Create three random
pd.Series
and generate apd.DataFrame
from them. Name each series, but make sure to use the same, non-numeric, index for the different series.Display the underlying numpy array.
Create a new column from the addition of two of the columns without the
assign()
method.Create a new column from the multiplication of two of the columns using
assign()
, and plot the result.Take the sine of the entire DF.
one | two | three | |
---|---|---|---|
a | 0.997695 | 0.597312 | 0.096502 |
b | 0.576848 | 0.858550 | 0.282407 |
c | 0.811523 | 0.064637 | 0.421879 |
d | 0.250657 | 0.744438 | 0.080021 |
e | 0.307660 | 0.450906 | 0.625949 |
f | 0.980502 | 0.628293 | 0.497054 |
g | 0.283198 | 0.476355 | 0.828951 |
h | 0.113362 | 0.083312 | 0.742331 |
i | 0.250469 | 0.611273 | 0.108273 |
j | 0.402790 | 0.260669 | 0.448805 |
one | two | three | four | |
---|---|---|---|---|
a | 0.997695 | 0.597312 | 0.096502 | 1.094197 |
b | 0.576848 | 0.858550 | 0.282407 | 0.859255 |
c | 0.811523 | 0.064637 | 0.421879 | 1.233403 |
d | 0.250657 | 0.744438 | 0.080021 | 0.330679 |
e | 0.307660 | 0.450906 | 0.625949 | 0.933609 |
f | 0.980502 | 0.628293 | 0.497054 | 1.477557 |
g | 0.283198 | 0.476355 | 0.828951 | 1.112149 |
h | 0.113362 | 0.083312 | 0.742331 | 0.855694 |
i | 0.250469 | 0.611273 | 0.108273 | 0.358742 |
j | 0.402790 | 0.260669 | 0.448805 | 0.851594 |
one | two | three | four | |
---|---|---|---|---|
a | 0.840223 | 0.562422 | 0.096353 | 0.888560 |
b | 0.545385 | 0.756896 | 0.278668 | 0.757356 |
c | 0.725337 | 0.064592 | 0.409476 | 0.943621 |
d | 0.248041 | 0.677558 | 0.079936 | 0.324685 |
e | 0.302829 | 0.435781 | 0.585867 | 0.803772 |
f | 0.830777 | 0.587765 | 0.476838 | 0.995656 |
g | 0.279428 | 0.458543 | 0.737223 | 0.896652 |
h | 0.113120 | 0.083216 | 0.676008 | 0.755026 |
i | 0.247859 | 0.573910 | 0.108062 | 0.351097 |
j | 0.391986 | 0.257727 | 0.433889 | 0.752332 |
Dates and times in pandas:
Create a DataFrame with at least two columns, a datetime index (look at
pd.date_range
) and random data.Convert the dtype of one of the columns (int <-> float).
View the top and bottom of the dataframe using the
head
andtail
methods. Make sure to visitdescribe()
as well.Use the
sort_value
by column values to sort your DF. What happened to the indices?Re-sort the dataframe with the
sort_index
method.Display the value in the third row, at the second column. What’s the most well suited indexing method?
A | B | C | D | |
---|---|---|---|---|
2018-01-31 | -0.212414 | 0.990292 | 0.242673 | 1.193698 |
2018-02-28 | -0.295161 | 0.235889 | -0.796223 | 0.914655 |
2018-03-31 | -0.267040 | -0.550909 | NaN | 0.042857 |
2018-04-30 | -1.268755 | 0.349107 | -0.797251 | -0.493387 |
2018-05-31 | 0.300880 | -0.750616 | -1.058389 | 1.232958 |
2018-06-30 | 2.289042 | 0.822776 | -0.931744 | 0.320625 |
A | B | C | D | |
---|---|---|---|---|
2018-01-31 | 0 | 0.990292 | 0.242673 | 1.193698 |
2018-02-28 | 0 | 0.235889 | -0.796223 | 0.914655 |
2018-03-31 | 0 | -0.550909 | NaN | 0.042857 |
A | B | C | D | |
---|---|---|---|---|
count | 6.000000 | 6.000000 | 5.000000 | 6.000000 |
mean | 0.166667 | 0.182757 | -0.668187 | 0.535234 |
std | 0.983192 | 0.707345 | 0.520681 | 0.694393 |
min | -1.000000 | -0.750616 | -1.058389 | -0.493387 |
25% | 0.000000 | -0.354210 | -0.931744 | 0.112299 |
50% | 0.000000 | 0.292498 | -0.797251 | 0.617640 |
75% | 0.000000 | 0.704359 | -0.796223 | 1.123937 |
max | 2.000000 | 0.990292 | 0.242673 | 1.232958 |
A | B | C | D | |
---|---|---|---|---|
2018-05-31 | 0 | -0.750616 | -1.058389 | 1.232958 |
2018-06-30 | 2 | 0.822776 | -0.931744 | 0.320625 |
2018-04-30 | -1 | 0.349107 | -0.797251 | -0.493387 |
2018-02-28 | 0 | 0.235889 | -0.796223 | 0.914655 |
2018-01-31 | 0 | 0.990292 | 0.242673 | 1.193698 |
2018-03-31 | 0 | -0.550909 | NaN | 0.042857 |
A | B | C | D | |
---|---|---|---|---|
2018-01-31 | 0 | 0.990292 | 0.242673 | 1.193698 |
2018-02-28 | 0 | 0.235889 | -0.796223 | 0.914655 |
2018-03-31 | 0 | -0.550909 | NaN | 0.042857 |
2018-04-30 | -1 | 0.349107 | -0.797251 | -0.493387 |
2018-05-31 | 0 | -0.750616 | -1.058389 | 1.232958 |
2018-06-30 | 2 | 0.822776 | -0.931744 | 0.320625 |
A | B | C | D | |
---|---|---|---|---|
2018-05-31 | 0 | -0.750616 | -1.058389 | 1.232958 |
2018-06-30 | 2 | 0.822776 | -0.931744 | 0.320625 |
2018-04-30 | -1 | 0.349107 | -0.797251 | -0.493387 |
2018-02-28 | 0 | 0.235889 | -0.796223 | 0.914655 |
2018-01-31 | 0 | 0.990292 | 0.242673 | 1.193698 |
2018-03-31 | 0 | -0.550909 | NaN | 0.042857 |
DataFrame comparisons and operations:
Generate another DataFrame with at least two columns. Populate it with random values between -1 and 1.
Find the places where the dataframe contains negative values, and replace them with their positive inverse (-0.21 turns to 0.21).
Set one of the values to NaN using
.loc
.Drop the entire column containing this null value.
back | front | |
---|---|---|
0 | 0.364488 | -0.678984 |
1 | 0.935116 | -0.739421 |
2 | -0.807927 | 0.949968 |
3 | 0.294234 | -0.700900 |
4 | 0.045437 | -0.244420 |
5 | -0.474362 | -0.933394 |
6 | 0.301566 | -0.476887 |
7 | 0.109245 | -0.639445 |
8 | -0.078301 | 0.588149 |
9 | -0.033571 | 0.410518 |
10 | 0.292354 | -0.050363 |
11 | 0.297447 | -0.538185 |
12 | 0.091290 | -0.515612 |
13 | 0.089143 | -0.989013 |
14 | 0.499024 | -0.084825 |
back | front | |
---|---|---|
0 | 0.364488 | 0.678984 |
1 | 0.935116 | 0.739421 |
2 | 0.807927 | 0.949968 |
3 | 0.294234 | 0.700900 |
4 | 0.045437 | 0.244420 |
5 | 0.474362 | 0.933394 |
6 | 0.301566 | 0.476887 |
7 | 0.109245 | 0.639445 |
8 | 0.078301 | 0.588149 |
9 | 0.033571 | 0.410518 |
10 | 0.292354 | 0.050363 |
11 | 0.297447 | 0.538185 |
12 | 0.091290 | 0.515612 |
13 | 0.089143 | 0.989013 |
14 | 0.499024 | 0.084825 |