Group by: split-apply-combine

Group by: split-apply-combine#

“group by” is an analysis process involving one or more of the following steps:

Splitting the data into groups based on some criteria
Applying a function to each group independently
Combining the results into a data structure

In the apply step, we might wish to do one of the following:

Aggregation: compute a summary statistic (or statistics) for each group, e.g., sums or means
Transformation: run group-specific computations and return a like-indexed object, e.g., standardize data or fill missing values within a group
Filter: discard some groups, according to a group-wise computation that evaluates to True or False, e.g., discard data based on a groups mean

See also:

import pandas as pd
import numpy as np

Let’s load the dataframe we saved in exercise notebook 03. It contains measurements of objects from two different files.

df = pd.read_csv('../data/BBBC007_concat.csv', index_col=0)
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 111 entries, 0 to 110
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   area               111 non-null    int64  
 1   intensity_mean     111 non-null    float64
 2   major_axis_length  111 non-null    float64
 3   minor_axis_length  111 non-null    float64
 4   aspect_ratio       111 non-null    float64
 5   source_file        111 non-null    object 
dtypes: float64(4), int64(1), object(1)
memory usage: 6.1+ KB

Let’s say we want to compute the median “intensity_mean” of round objects and also discriminate these objects per source file. Ignoring for the second the mechanics of how we would do this with Python, let’s think about it in English. What do we need to do?

Split the data set according to a ‘round’ criterion field, i.e., split it, so we have a separate data set for the two classes, those which are round and those which are not.
Apply a median function to the intensity in these split data sets.
Combine the results of these averages on the split data set into a new summary data set that contains the two classes (round and not round) and medians for each.

We see that the strategy we want is a split-apply-combine strategy, and it turns out that this is a strategy we want to use very often. Split the data in terms of some criterion. Apply some function to the split data. Combine the results into a new data frame.

Note that if the data is tidy, this procedure makes a lot of sense. Choose the column you want to use to split by. All rows with like entries in the splitting column are then grouped into a new data set. You can then apply any function you want into these new data sets. You can then combine the results into a new data frame.

Pandas’ split-apply-combine operations are achieved using the DataFrames groupby() method. You can think of groupby() as the splitting part. You can then apply functions to the resulting DataFrameGroupBy object. The Pandas documentation on split-apply-combine is worth reading through. It is extensive though, so don’t let yourself get intimidated by it.

Before all that, let’s first create a new column with our criterion for roundness.

df['round'] = df['aspect_ratio'] < 1.2
df.head()

	area	intensity_mean	major_axis_length	minor_axis_length	aspect_ratio	source_file	round
0	256	93.250000	19.995017	17.021559	1.174688	BBBC007_20P1_POS0007_D_1UL	True
1	90	82.488889	15.939969	7.516326	2.120713	BBBC007_20P1_POS0007_D_1UL	False
2	577	90.637782	35.324458	21.759434	1.623409	BBBC007_20P1_POS0007_D_1UL	False
3	270	95.640741	20.229431	17.669052	1.144908	BBBC007_20P1_POS0007_D_1UL	True
4	153	84.908497	15.683703	12.420475	1.262730	BBBC007_20P1_POS0007_D_1UL	False

Aggregation of median intensity#

Split#

Let’s start splitting the data using groupby() and have a look at the resulting object.

grouped = df.groupby('round')
grouped

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x10a3bc1f0>

There is not much to see on the resulting DataFrameGroupBy object itself, but there is a lot we can do with it. First, let’s iterate through the groups and have a closer look:

for name, group in grouped:
    print(name)
    print(type(group))

False
<class 'pandas.core.frame.DataFrame'>
True
<class 'pandas.core.frame.DataFrame'>

Apply and Combine#

Using the . operator on the DataFrameGroupBy object and hitting tab will show you the many possibilities. For most of these possibilities, the apply and combine steps are comprised and a new DataFrame is returned. The .median() method is exactly what we want. Here, the numeric_only option is set to exclude the object-typed ‘source_file’ column from the calculation.

df_median = grouped.median(numeric_only = True)

# Take a look
df_median

	area	intensity_mean	major_axis_length	minor_axis_length	aspect_ratio
round
False	270.0	92.788345	21.459495	15.858324	1.412849
True	291.0	100.256000	20.155547	18.352287	1.101700

The resulting DataFrame has the medians of all numerical columns, including the intensities that we wanted. Note that the column ‘round’ has been set as row index. If we want to keep ‘round’ as a normal column (which, remember, is what we used to split up the data before we computed the summary statistics), we can use the reset_index() method.

df_median = df_median.reset_index()
# Alternatively, do:
# df_median.reset_index(inplace=True)
df_median

	round	area	intensity_mean	major_axis_length	minor_axis_length	aspect_ratio
0	False	270.0	92.788345	21.459495	15.858324	1.412849
1	True	291.0	100.256000	20.155547	18.352287	1.101700

Split on multiple columns#

Assume, we may wish to look at four groups, round / not round from the first file, round / not round from the second file. To do this, we can pass a list of columns to the groupby() method. We will chain the commands, performing a groupby, applying a median, and then resetting the index of the result, all in one line.

df.groupby(['round', 'source_file']).median().reset_index()

	round	source_file	area	intensity_mean	major_axis_length	minor_axis_length	aspect_ratio
0	False	BBBC007_20P1_POS0007_D_1UL	323.0	91.796791	23.755227	17.072477	1.467410
1	False	BBBC007_20P1_POS0010_D_1UL	237.0	93.269113	20.410737	14.832035	1.353858
2	True	BBBC007_20P1_POS0007_D_1UL	293.0	98.227799	20.307965	18.599043	1.101700
3	True	BBBC007_20P1_POS0010_D_1UL	277.5	103.299825	19.662330	17.680741	1.103133

Applying custom functions#

If we want to apply a function that is not built-in, we can also do so. For example, let’s apply the coefficient of variance. We can define a generic function that calculates it like shown below.

def coefficient_of_variance(data):
    """Compute coefficient of variation from an array of data."""
    return np.std(data) / np.mean(data)

Let’s group our data again and apply our custom function as an aggregating function. If there are other categorical variables, they should be masked.

grouped = df.groupby(['round', 'source_file'])

grouped.agg(coefficient_of_variance).reset_index()

	round	source_file	area	intensity_mean	major_axis_length	minor_axis_length	aspect_ratio
0	False	BBBC007_20P1_POS0007_D_1UL	0.540399	0.145956	0.349857	0.289063	0.243450
1	False	BBBC007_20P1_POS0010_D_1UL	0.765156	0.143506	0.447638	0.402708	0.316206
2	True	BBBC007_20P1_POS0007_D_1UL	0.248799	0.099636	0.145247	0.147626	0.036950
3	True	BBBC007_20P1_POS0010_D_1UL	0.275120	0.107008	0.167722	0.166214	0.043755