Introduction to Pandas

Introduction to Pandas#

In basic Python, we can use dictionaries for data handling, containing data as lists. While these basic structures are convenient for simple data handling, they are suboptimal for advanced data processing and analysis.

For such, we introduce pandas, one of the primary tools for handling and analysing tabular data in the Python. Pandas’ primary object, the “DataFrame”, is a 2-dimensional indexed and labeled data structure with columns of different data types, and is extremely useful in wrangling data.

Creating DataFrames#

DataFrames from dictionaries#

Assume we did some measurements and have some results available in a dictionary:

measurements = {
    "label": ["A", "B", "C", "D", "E", "F", "G", "H", "I", "J"],
    "value1": [9, 11, 10.2, 9.5, 15,  400, 9, 11, 11.3, 10],
    "value2": [2, np.nan, 8, 5, 7, np.nan, 1, 4, 6, 9],
    "value3": [3.1, 0.4, 5.6, np.nan, 4.4, 2.2, 1.1, 0.9, 3.3, 4.1],
    "valid": [True, False, True, True, False, True, False, True, True, False],
}

This data structure can be easily transferred into a DataFrame which provides additional functionalities, like visualizing it nicely:

df = pd.DataFrame(data=measurements)
df

	label	value1	value2	value3	valid
0	A	9.0	2.0	3.1	True
1	B	11.0	NaN	0.4	False
2	C	10.2	8.0	5.6	True
3	D	9.5	5.0	NaN	True
4	E	15.0	7.0	4.4	False
5	F	400.0	NaN	2.2	True
6	G	9.0	1.0	1.1	False
7	H	11.0	4.0	0.9	True
8	I	11.3	6.0	3.3	True
9	J	10.0	9.0	4.1	False

DataFrames also provide some convenient methods to get an overview of the DataFrames’ structure and the containing data:

info() to summarize the DataFrame
describe() for descriptive statistics
plot() for simple plots of DataFrames and Series

# Show structure of the DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   label   10 non-null     object 
 1   value1  10 non-null     float64
 2   value2  8 non-null      float64
 3   value3  9 non-null      float64
 4   valid   10 non-null     bool   
dtypes: bool(1), float64(3), object(1)
memory usage: 458.0+ bytes

# Show basic descriptive statistics of the containing data
df.describe().transpose()

	count	mean	std	min	25%	50%	75%	max
value1	10.0	49.600000	123.130184	9.0	9.625	10.6	11.225	400.0
value2	8.0	5.250000	2.815772	1.0	3.500	5.5	7.250	9.0
value3	9.0	2.788889	1.769495	0.4	1.100	3.1	4.100	5.6

# Show descriptive statistics for categorical data
df.describe(exclude=['number']).transpose()

	count	unique	top	freq
label	10	10	A	1
valid	10	2	True	6

# Plot the data
df.plot(x='label', y=['value1', 'value2', 'value3'], title='Measurements')
plt.show()

../_images/6cee58efdf82ff7611c2aadafe3b6899a450b5d03cd0ffd0b36fe3cdd4e8905c.png

Saving and loading DataFrames#

We can save this DataFrames for continuing to work with it. We chose to save it as a CSV file. You should generally always store your data in such an open, compatible format with well-defined specifications, (not necessarily CSV).

df.to_csv('data/measurements_1.csv', index=False)

Stored DataFrames can be read with pandas from various formats, so we are able to read files into a DataFrame again.

Introduction to Pandas

Contents

Introduction to Pandas#

Creating DataFrames#

DataFrames from dictionaries#

Saving and loading DataFrames#