Group by: split-apply-combine#
“group by” is an analysis process involving one or more of the following steps:
Splitting the data into groups based on some criteria
Applying a function to each group independently
Combining the results into a data structure
In the apply step, we might wish to do one of the following:
Aggregation: compute a summary statistic (or statistics) for each group, e.g., sums or means
Transformation: run group-specific computations and return a like-indexed object, e.g., standardize data or fill missing values within a group
Filter: discard some groups, according to a group-wise computation that evaluates to True or False, e.g., discard data based on a groups mean
See also:
import pandas as pd
import numpy as np
Let’s load the dataframe we saved in exercise notebook 03. It contains measurements of objects from two different files.
df = pd.read_csv('data/filled_measurements.csv', index_col=0, parse_dates=True, dtype={'label': 'string', 'valid': 'boolean'})
df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 11 entries, 2023-01-01 07:00:00 to 2023-01-01 20:00:00
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 label 11 non-null string
1 valid 11 non-null boolean
2 value1 11 non-null float64
3 value2 11 non-null float64
4 value3 11 non-null float64
5 value4 11 non-null float64
6 value5 11 non-null float64
dtypes: boolean(1), float64(5), string(1)
memory usage: 638.0 bytes
Let’s say we want to compute the mean of all values for both valid and non-valid measurements.
Ignoring for the second the mechanics of how we would do this with Python, let’s think about it in English. What do we need to do?
Split the data set according to the ‘valid’ criterion, i.e., split it, so we have a separate data set for the two classes
True
andFalse
.Apply a mean function to the value columns.
Combine the results of these averages on the split data set into a new summary data set that contains the two classes.
We see that the strategy we want is a split-apply-combine strategy, and it turns out that this is a strategy we want to use very often.
Split the data in terms of some criterion.
Apply some function to the split data.
Combine the results into a new data frame.
Note that if the data is tidy, this procedure makes a lot of sense. Choose the column you want to use to split by. All rows with like entries in the splitting column are then grouped into a new data set. You can then apply any function you want into these new data sets. You can then combine the results into a new data frame.
Pandas’ split-apply-combine operations are achieved using the DataFrames groupby()
method. You can think of groupby()
as the splitting part. You can then apply functions to the resulting DataFrameGroupBy object. The Pandas documentation on split-apply-combine is worth reading through. It is extensive though, so don’t let yourself get intimidated by it.
df
label | valid | value1 | value2 | value3 | value4 | value5 | |
---|---|---|---|---|---|---|---|
timestamp | |||||||
2023-01-01 07:00:00 | A | True | 9.0 | 2.0 | 3.1 | 0.98 | 1.23 |
2023-01-01 08:00:00 | B | False | 11.0 | 3.5 | 0.4 | 0.40 | 1.11 |
2023-01-01 10:00:00 | D | True | 9.5 | 5.0 | 3.0 | 2.56 | 1.52 |
2023-01-01 10:15:00 | C | True | 10.2 | 8.0 | 5.6 | 2.85 | 1.93 |
2023-01-01 11:00:00 | E | False | 15.0 | 7.0 | 4.4 | 3.14 | 2.34 |
2023-01-01 12:00:00 | F | True | 400.0 | 4.0 | 2.2 | 2.71 | 2.45 |
2023-01-01 16:00:00 | G | False | 9.0 | 1.0 | 1.1 | 3.58 | 0.98 |
2023-01-01 17:00:00 | H | True | 11.0 | 4.0 | 0.9 | 4.12 | 1.05 |
2023-01-01 18:00:00 | I | True | 11.3 | 6.0 | 3.3 | 3.33 | 1.67 |
2023-01-01 19:00:00 | J | False | 10.0 | 9.0 | 4.1 | 2.98 | 2.89 |
2023-01-01 20:00:00 | K | True | 10.0 | 9.0 | 4.1 | 3.87 | 2.89 |
Aggregation of columns mean#
Split#
Let’s start splitting the data using groupby()
and have a look at the resulting object.
grouped = df.groupby('valid')
grouped
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x12d643460>
There is not much to see on the resulting DataFrameGroupBy object itself, but there is a lot we can do with it. First, let’s iterate through the groups and have a closer look:
for name, group in grouped:
print(name)
print(type(group))
False
<class 'pandas.core.frame.DataFrame'>
True
<class 'pandas.core.frame.DataFrame'>
Apply and Combine#
Using the .
operator on the DataFrameGroupBy object and hitting tab will show you the many possibilities. For most of these possibilities, the apply and combine steps are comprised and a new DataFrame is returned.
The .mean()
method is exactly what we want. Here, the numeric_only
option is set to exclude the string-typed columns from the calculation.
df_valid_mean = grouped.mean(numeric_only=True)
# Take a look
df_valid_mean
value1 | value2 | value3 | value4 | value5 | |
---|---|---|---|---|---|
valid | |||||
False | 11.250000 | 5.125000 | 2.500000 | 2.525000 | 1.83 |
True | 65.857143 | 5.428571 | 3.171429 | 2.917143 | 1.82 |
The resulting DataFrame has the means of all numerical columns. Note that the column ‘valid’ has been set as row index. If we want to keep ‘valid’ as a normal column, we can use the reset_index()
method.
df_valid_mean.reset_index()
valid | value1 | value2 | value3 | value4 | value5 | |
---|---|---|---|---|---|---|
0 | False | 11.250000 | 5.125000 | 2.500000 | 2.525000 | 1.83 |
1 | True | 65.857143 | 5.428571 | 3.171429 | 2.917143 | 1.82 |
Applying custom functions#
If we want to apply an aggregation function that is not built-in, we can also do so.
For example, let’s check numerical columns for outliers based on the standard deviation. We can define a generic function like shown below.
def check_for_outliers(data):
"""Check whether any outliers are present based on the standard deviation."""
mean, std = data.mean(), data.std()
if std == 0:
return False
outliers = np.abs(data - mean) > (2 * std)
return outliers.any()
Let’s group our data again and apply our custom function as an aggregating function to the numerical columns only.
df.groupby('valid')[df.select_dtypes(include="number").columns].agg(check_for_outliers)
value1 | value2 | value3 | value4 | value5 | |
---|---|---|---|---|---|
valid | |||||
False | False | False | False | False | False |
True | True | False | False | False | False |