Group by: split-apply-combine#

“group by” is an analysis process involving one or more of the following steps:

  • Splitting the data into groups based on some criteria

  • Applying a function to each group independently

  • Combining the results into a data structure

In the apply step, we might wish to do one of the following:

  • Aggregation: compute a summary statistic (or statistics) for each group, e.g., sums or means

  • Transformation: run group-specific computations and return a like-indexed object, e.g., standardize data or fill missing values within a group

  • Filter: discard some groups, according to a group-wise computation that evaluates to True or False, e.g., discard data based on a groups mean

See also:

import pandas as pd
import numpy as np

Let’s load the dataframe we saved in exercise notebook 03. It contains measurements of objects from two different files.

df = pd.read_csv('../data/BBBC007_concat.csv', index_col=0)
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 111 entries, 0 to 110
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   area               111 non-null    int64  
 1   intensity_mean     111 non-null    float64
 2   major_axis_length  111 non-null    float64
 3   minor_axis_length  111 non-null    float64
 4   aspect_ratio       111 non-null    float64
 5   source_file        111 non-null    object 
dtypes: float64(4), int64(1), object(1)
memory usage: 6.1+ KB

Let’s say we want to compute the median “intensity_mean” of round objects and also discriminate these objects per source file. Ignoring for the second the mechanics of how we would do this with Python, let’s think about it in English. What do we need to do?

  • Split the data set according to a ‘round’ criterion field, i.e., split it, so we have a separate data set for the two classes, those which are round and those which are not.

  • Apply a median function to the intensity in these split data sets.

  • Combine the results of these averages on the split data set into a new summary data set that contains the two classes (round and not round) and medians for each.

We see that the strategy we want is a split-apply-combine strategy, and it turns out that this is a strategy we want to use very often. Split the data in terms of some criterion. Apply some function to the split data. Combine the results into a new data frame.

Note that if the data is tidy, this procedure makes a lot of sense. Choose the column you want to use to split by. All rows with like entries in the splitting column are then grouped into a new data set. You can then apply any function you want into these new data sets. You can then combine the results into a new data frame.

Pandas’ split-apply-combine operations are achieved using the DataFrames groupby() method. You can think of groupby() as the splitting part. You can then apply functions to the resulting DataFrameGroupBy object. The Pandas documentation on split-apply-combine is worth reading through. It is extensive though, so don’t let yourself get intimidated by it.

Before all that, let’s first create a new column with our criterion for roundness.

df['round'] = df['aspect_ratio'] < 1.2
df.head()
area intensity_mean major_axis_length minor_axis_length aspect_ratio source_file round
0 256 93.250000 19.995017 17.021559 1.174688 BBBC007_20P1_POS0007_D_1UL True
1 90 82.488889 15.939969 7.516326 2.120713 BBBC007_20P1_POS0007_D_1UL False
2 577 90.637782 35.324458 21.759434 1.623409 BBBC007_20P1_POS0007_D_1UL False
3 270 95.640741 20.229431 17.669052 1.144908 BBBC007_20P1_POS0007_D_1UL True
4 153 84.908497 15.683703 12.420475 1.262730 BBBC007_20P1_POS0007_D_1UL False

Aggregation of median intensity#

Split#

Let’s start splitting the data using groupby() and have a look at the resulting object.

grouped = df.groupby('round')
grouped
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x10a3bc1f0>

There is not much to see on the resulting DataFrameGroupBy object itself, but there is a lot we can do with it. First, let’s iterate through the groups and have a closer look:

for name, group in grouped:
    print(name)
    print(type(group))
False
<class 'pandas.core.frame.DataFrame'>
True
<class 'pandas.core.frame.DataFrame'>

Apply and Combine#

Using the . operator on the DataFrameGroupBy object and hitting tab will show you the many possibilities. For most of these possibilities, the apply and combine steps are comprised and a new DataFrame is returned. The .median() method is exactly what we want. Here, the numeric_only option is set to exclude the object-typed ‘source_file’ column from the calculation.

df_median = grouped.median(numeric_only = True)

# Take a look
df_median
area intensity_mean major_axis_length minor_axis_length aspect_ratio
round
False 270.0 92.788345 21.459495 15.858324 1.412849
True 291.0 100.256000 20.155547 18.352287 1.101700

The resulting DataFrame has the medians of all numerical columns, including the intensities that we wanted. Note that the column ‘round’ has been set as row index. If we want to keep ‘round’ as a normal column (which, remember, is what we used to split up the data before we computed the summary statistics), we can use the reset_index() method.

df_median = df_median.reset_index()
# Alternatively, do:
# df_median.reset_index(inplace=True)
df_median
round area intensity_mean major_axis_length minor_axis_length aspect_ratio
0 False 270.0 92.788345 21.459495 15.858324 1.412849
1 True 291.0 100.256000 20.155547 18.352287 1.101700

Split on multiple columns#

Assume, we may wish to look at four groups, round / not round from the first file, round / not round from the second file. To do this, we can pass a list of columns to the groupby() method. We will chain the commands, performing a groupby, applying a median, and then resetting the index of the result, all in one line.

df.groupby(['round', 'source_file']).median().reset_index()
round source_file area intensity_mean major_axis_length minor_axis_length aspect_ratio
0 False BBBC007_20P1_POS0007_D_1UL 323.0 91.796791 23.755227 17.072477 1.467410
1 False BBBC007_20P1_POS0010_D_1UL 237.0 93.269113 20.410737 14.832035 1.353858
2 True BBBC007_20P1_POS0007_D_1UL 293.0 98.227799 20.307965 18.599043 1.101700
3 True BBBC007_20P1_POS0010_D_1UL 277.5 103.299825 19.662330 17.680741 1.103133

Applying custom functions#

If we want to apply a function that is not built-in, we can also do so. For example, let’s apply the coefficient of variance. We can define a generic function that calculates it like shown below.

def coefficient_of_variance(data):
    """Compute coefficient of variation from an array of data."""
    return np.std(data) / np.mean(data)

Let’s group our data again and apply our custom function as an aggregating function. If there are other categorical variables, they should be masked.

grouped = df.groupby(['round', 'source_file'])
grouped.agg(coefficient_of_variance).reset_index()
round source_file area intensity_mean major_axis_length minor_axis_length aspect_ratio
0 False BBBC007_20P1_POS0007_D_1UL 0.540399 0.145956 0.349857 0.289063 0.243450
1 False BBBC007_20P1_POS0010_D_1UL 0.765156 0.143506 0.447638 0.402708 0.316206
2 True BBBC007_20P1_POS0007_D_1UL 0.248799 0.099636 0.145247 0.147626 0.036950
3 True BBBC007_20P1_POS0010_D_1UL 0.275120 0.107008 0.167722 0.166214 0.043755