Introduction to Pandas#

In basic Python, we can use dictionaries for data handling, containing data as lists. While these basic structures are convenient for simple data handling, they are suboptimal for advanced data processing and analysis.

For such, we introduce pandas, one of the primary tools for handling and analysing tabular data in the Python. Pandas’ primary object, the “DataFrame”, is a 2-dimensional indexed and labeled data structure with columns of different data types, and is extremely useful in wrangling data.

Pandas DataFrame

See also:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Creating DataFrames#

DataFrames from dictionaries#

Assume we did some measurements and have some results available in a dictionary:

measurements = {
    "label": ["A", "B", "C", "D", "E", "F", "G", "H", "I", "J"],
    "value1": [9, 11, 10.2, 9.5, 15,  400, 9, 11, 11.3, 10],
    "value2": [2, np.nan, 8, 5, 7, np.nan, 1, 4, 6, 9],
    "value3": [3.1, 0.4, 5.6, np.nan, 4.4, 2.2, 1.1, 0.9, 3.3, 4.1],
    "valid": [True, False, True, True, False, True, False, True, True, False],
}

This data structure can be easily transferred into a DataFrame which provides additional functionalities, like visualizing it nicely:

df = pd.DataFrame(data=measurements)
df
label value1 value2 value3 valid
0 A 9.0 2.0 3.1 True
1 B 11.0 NaN 0.4 False
2 C 10.2 8.0 5.6 True
3 D 9.5 5.0 NaN True
4 E 15.0 7.0 4.4 False
5 F 400.0 NaN 2.2 True
6 G 9.0 1.0 1.1 False
7 H 11.0 4.0 0.9 True
8 I 11.3 6.0 3.3 True
9 J 10.0 9.0 4.1 False

DataFrames also provide some convenient methods to get an overview of the DataFrames’ structure and the containing data:

  • info() to summarize the DataFrame

  • describe() for descriptive statistics

  • plot() for simple plots of DataFrames and Series

# Show structure of the DataFrame
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   label   10 non-null     object 
 1   value1  10 non-null     float64
 2   value2  8 non-null      float64
 3   value3  9 non-null      float64
 4   valid   10 non-null     bool   
dtypes: bool(1), float64(3), object(1)
memory usage: 458.0+ bytes
# Show basic descriptive statistics of the containing data
df.describe().transpose()
count mean std min 25% 50% 75% max
value1 10.0 49.600000 123.130184 9.0 9.625 10.6 11.225 400.0
value2 8.0 5.250000 2.815772 1.0 3.500 5.5 7.250 9.0
value3 9.0 2.788889 1.769495 0.4 1.100 3.1 4.100 5.6
# Show descriptive statistics for categorical data
df.describe(exclude=['number']).transpose()
count unique top freq
label 10 10 A 1
valid 10 2 True 6
# Plot the data
df.plot(x='label', y=['value1', 'value2', 'value3'], title='Measurements')
plt.show()
../_images/6cee58efdf82ff7611c2aadafe3b6899a450b5d03cd0ffd0b36fe3cdd4e8905c.png

Saving and loading DataFrames#

We can save this DataFrames for continuing to work with it. We chose to save it as a CSV file. You should generally always store your data in such an open, compatible format with well-defined specifications, (not necessarily CSV).

df.to_csv('data/measurements_1.csv', index=False)

Stored DataFrames can be read with pandas from various formats, so we are able to read files into a DataFrame again.

See also:

df_new = pd.read_csv('data/measurements_1.csv')
df_new
label value1 value2 value3 valid
0 A 9.0 2.0 3.1 True
1 B 11.0 NaN 0.4 False
2 C 10.2 8.0 5.6 True
3 D 9.5 5.0 NaN True
4 E 15.0 7.0 4.4 False
5 F 400.0 NaN 2.2 True
6 G 9.0 1.0 1.1 False
7 H 11.0 4.0 0.9 True
8 I 11.3 6.0 3.3 True
9 J 10.0 9.0 4.1 False