Group by: split-apply-combine#

“group by” is an analysis process involving one or more of the following steps:

  • Splitting the data into groups based on some criteria

  • Applying a function to each group independently

  • Combining the results into a data structure

In the apply step, we might wish to do one of the following:

  • Aggregation: compute a summary statistic (or statistics) for each group, e.g., sums or means

  • Transformation: run group-specific computations and return a like-indexed object, e.g., standardize data or fill missing values within a group

  • Filter: discard some groups, according to a group-wise computation that evaluates to True or False, e.g., discard data based on a groups mean

See also:

import pandas as pd
import numpy as np

Let’s load the dataframe we saved in exercise notebook 03. It contains measurements of objects from two different files.

df = pd.read_csv('data/filled_measurements.csv', index_col=0, parse_dates=True, dtype={'label': 'string', 'valid': 'boolean'})
df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 11 entries, 2023-01-01 07:00:00 to 2023-01-01 20:00:00
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   label   11 non-null     string 
 1   valid   11 non-null     boolean
 2   value1  11 non-null     float64
 3   value2  11 non-null     float64
 4   value3  11 non-null     float64
 5   value4  11 non-null     float64
 6   value5  11 non-null     float64
dtypes: boolean(1), float64(5), string(1)
memory usage: 638.0 bytes

Let’s say we want to compute the mean of all values for both valid and non-valid measurements.

Ignoring for the second the mechanics of how we would do this with Python, let’s think about it in English. What do we need to do?

  • Split the data set according to the ‘valid’ criterion, i.e., split it, so we have a separate data set for the two classes True and False.

  • Apply a mean function to the value columns.

  • Combine the results of these averages on the split data set into a new summary data set that contains the two classes.

We see that the strategy we want is a split-apply-combine strategy, and it turns out that this is a strategy we want to use very often.

  • Split the data in terms of some criterion.

  • Apply some function to the split data.

  • Combine the results into a new data frame.

Note that if the data is tidy, this procedure makes a lot of sense. Choose the column you want to use to split by. All rows with like entries in the splitting column are then grouped into a new data set. You can then apply any function you want into these new data sets. You can then combine the results into a new data frame.

Pandas’ split-apply-combine operations are achieved using the DataFrames groupby() method. You can think of groupby() as the splitting part. You can then apply functions to the resulting DataFrameGroupBy object. The Pandas documentation on split-apply-combine is worth reading through. It is extensive though, so don’t let yourself get intimidated by it.

df
label valid value1 value2 value3 value4 value5
timestamp
2023-01-01 07:00:00 A True 9.0 2.0 3.1 0.98 1.23
2023-01-01 08:00:00 B False 11.0 3.5 0.4 0.40 1.11
2023-01-01 10:00:00 D True 9.5 5.0 3.0 2.56 1.52
2023-01-01 10:15:00 C True 10.2 8.0 5.6 2.85 1.93
2023-01-01 11:00:00 E False 15.0 7.0 4.4 3.14 2.34
2023-01-01 12:00:00 F True 400.0 4.0 2.2 2.71 2.45
2023-01-01 16:00:00 G False 9.0 1.0 1.1 3.58 0.98
2023-01-01 17:00:00 H True 11.0 4.0 0.9 4.12 1.05
2023-01-01 18:00:00 I True 11.3 6.0 3.3 3.33 1.67
2023-01-01 19:00:00 J False 10.0 9.0 4.1 2.98 2.89
2023-01-01 20:00:00 K True 10.0 9.0 4.1 3.87 2.89

Aggregation of columns mean#

Split#

Let’s start splitting the data using groupby() and have a look at the resulting object.

grouped = df.groupby('valid')
grouped
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x12d643460>

There is not much to see on the resulting DataFrameGroupBy object itself, but there is a lot we can do with it. First, let’s iterate through the groups and have a closer look:

for name, group in grouped:
    print(name)
    print(type(group))
False
<class 'pandas.core.frame.DataFrame'>
True
<class 'pandas.core.frame.DataFrame'>

Apply and Combine#

Using the . operator on the DataFrameGroupBy object and hitting tab will show you the many possibilities. For most of these possibilities, the apply and combine steps are comprised and a new DataFrame is returned.

The .mean() method is exactly what we want. Here, the numeric_only option is set to exclude the string-typed columns from the calculation.

df_valid_mean = grouped.mean(numeric_only=True)

# Take a look
df_valid_mean
value1 value2 value3 value4 value5
valid
False 11.250000 5.125000 2.500000 2.525000 1.83
True 65.857143 5.428571 3.171429 2.917143 1.82

The resulting DataFrame has the means of all numerical columns. Note that the column ‘valid’ has been set as row index. If we want to keep ‘valid’ as a normal column, we can use the reset_index() method.

df_valid_mean.reset_index()
valid value1 value2 value3 value4 value5
0 False 11.250000 5.125000 2.500000 2.525000 1.83
1 True 65.857143 5.428571 3.171429 2.917143 1.82

Applying custom functions#

If we want to apply an aggregation function that is not built-in, we can also do so.

For example, let’s check numerical columns for outliers based on the standard deviation. We can define a generic function like shown below.

def check_for_outliers(data):
    """Check whether any outliers are present based on the standard deviation."""
    mean, std = data.mean(), data.std()
    if std == 0:
        return False
    outliers = np.abs(data - mean) > (2 * std)
    return outliers.any()

Let’s group our data again and apply our custom function as an aggregating function to the numerical columns only.

df.groupby('valid')[df.select_dtypes(include="number").columns].agg(check_for_outliers)
value1 value2 value3 value4 value5
valid
False False False False False False
True True False False False False