Group by: split-apply-combine#
“group by” is an analysis process involving one or more of the following steps:
Splitting the data into groups based on some criteria
Applying a function to each group independently
Combining the results into a data structure
In the apply step, we might wish to do one of the following:
Aggregation: compute a summary statistic (or statistics) for each group, e.g., sums or means
Transformation: run group-specific computations and return a like-indexed object, e.g., standardize data or fill missing values within a group
Filter: discard some groups, according to a group-wise computation that evaluates to True or False, e.g., discard data based on a groups mean
See also:
import pandas as pd
import numpy as np
Let’s load the dataframe we saved in exercise notebook 03. It contains measurements of objects from two different files.
df = pd.read_csv('../data/BBBC007_concat.csv', index_col=0)
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 111 entries, 0 to 110
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 area 111 non-null int64
1 intensity_mean 111 non-null float64
2 major_axis_length 111 non-null float64
3 minor_axis_length 111 non-null float64
4 aspect_ratio 111 non-null float64
5 source_file 111 non-null object
dtypes: float64(4), int64(1), object(1)
memory usage: 6.1+ KB
Let’s say we want to compute the median “intensity_mean” of round objects and also discriminate these objects per source file. Ignoring for the second the mechanics of how we would do this with Python, let’s think about it in English. What do we need to do?
Split the data set according to a ‘round’ criterion field, i.e., split it, so we have a separate data set for the two classes, those which are round and those which are not.
Apply a median function to the intensity in these split data sets.
Combine the results of these averages on the split data set into a new summary data set that contains the two classes (round and not round) and medians for each.
We see that the strategy we want is a split-apply-combine strategy, and it turns out that this is a strategy we want to use very often. Split the data in terms of some criterion. Apply some function to the split data. Combine the results into a new data frame.
Note that if the data is tidy, this procedure makes a lot of sense. Choose the column you want to use to split by. All rows with like entries in the splitting column are then grouped into a new data set. You can then apply any function you want into these new data sets. You can then combine the results into a new data frame.
Pandas’ split-apply-combine operations are achieved using the DataFrames groupby()
method. You can think of groupby()
as the splitting part. You can then apply functions to the resulting DataFrameGroupBy object. The Pandas documentation on split-apply-combine is worth reading through. It is extensive though, so don’t let yourself get intimidated by it.
Before all that, let’s first create a new column with our criterion for roundness.
df['round'] = df['aspect_ratio'] < 1.2
df.head()
area | intensity_mean | major_axis_length | minor_axis_length | aspect_ratio | source_file | round | |
---|---|---|---|---|---|---|---|
0 | 256 | 93.250000 | 19.995017 | 17.021559 | 1.174688 | BBBC007_20P1_POS0007_D_1UL | True |
1 | 90 | 82.488889 | 15.939969 | 7.516326 | 2.120713 | BBBC007_20P1_POS0007_D_1UL | False |
2 | 577 | 90.637782 | 35.324458 | 21.759434 | 1.623409 | BBBC007_20P1_POS0007_D_1UL | False |
3 | 270 | 95.640741 | 20.229431 | 17.669052 | 1.144908 | BBBC007_20P1_POS0007_D_1UL | True |
4 | 153 | 84.908497 | 15.683703 | 12.420475 | 1.262730 | BBBC007_20P1_POS0007_D_1UL | False |
Aggregation of median intensity#
Split#
Let’s start splitting the data using groupby()
and have a look at the resulting object.
grouped = df.groupby('round')
grouped
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x10a3bc1f0>
There is not much to see on the resulting DataFrameGroupBy object itself, but there is a lot we can do with it. First, let’s iterate through the groups and have a closer look:
for name, group in grouped:
print(name)
print(type(group))
False
<class 'pandas.core.frame.DataFrame'>
True
<class 'pandas.core.frame.DataFrame'>
Apply and Combine#
Using the .
operator on the DataFrameGroupBy object and hitting tab will show you the many possibilities. For most of these possibilities, the apply and combine steps are comprised and a new DataFrame is returned. The .median()
method is exactly what we want. Here, the numeric_only
option is set to exclude the object-typed ‘source_file’ column from the calculation.
df_median = grouped.median(numeric_only = True)
# Take a look
df_median
area | intensity_mean | major_axis_length | minor_axis_length | aspect_ratio | |
---|---|---|---|---|---|
round | |||||
False | 270.0 | 92.788345 | 21.459495 | 15.858324 | 1.412849 |
True | 291.0 | 100.256000 | 20.155547 | 18.352287 | 1.101700 |
The resulting DataFrame has the medians of all numerical columns, including the intensities that we wanted. Note that the column ‘round’ has been set as row index. If we want to keep ‘round’ as a normal column (which, remember, is what we used to split up the data before we computed the summary statistics), we can use the reset_index()
method.
df_median = df_median.reset_index()
# Alternatively, do:
# df_median.reset_index(inplace=True)
df_median
round | area | intensity_mean | major_axis_length | minor_axis_length | aspect_ratio | |
---|---|---|---|---|---|---|
0 | False | 270.0 | 92.788345 | 21.459495 | 15.858324 | 1.412849 |
1 | True | 291.0 | 100.256000 | 20.155547 | 18.352287 | 1.101700 |
Split on multiple columns#
Assume, we may wish to look at four groups, round / not round from the first file, round / not round from the second file. To do this, we can pass a list of columns to the groupby()
method. We will chain the commands, performing a groupby, applying a median, and then resetting the index of the result, all in one line.
df.groupby(['round', 'source_file']).median().reset_index()
round | source_file | area | intensity_mean | major_axis_length | minor_axis_length | aspect_ratio | |
---|---|---|---|---|---|---|---|
0 | False | BBBC007_20P1_POS0007_D_1UL | 323.0 | 91.796791 | 23.755227 | 17.072477 | 1.467410 |
1 | False | BBBC007_20P1_POS0010_D_1UL | 237.0 | 93.269113 | 20.410737 | 14.832035 | 1.353858 |
2 | True | BBBC007_20P1_POS0007_D_1UL | 293.0 | 98.227799 | 20.307965 | 18.599043 | 1.101700 |
3 | True | BBBC007_20P1_POS0010_D_1UL | 277.5 | 103.299825 | 19.662330 | 17.680741 | 1.103133 |
Applying custom functions#
If we want to apply a function that is not built-in, we can also do so. For example, let’s apply the coefficient of variance. We can define a generic function that calculates it like shown below.
def coefficient_of_variance(data):
"""Compute coefficient of variation from an array of data."""
return np.std(data) / np.mean(data)
Let’s group our data again and apply our custom function as an aggregating function. If there are other categorical variables, they should be masked.
grouped = df.groupby(['round', 'source_file'])
grouped.agg(coefficient_of_variance).reset_index()
round | source_file | area | intensity_mean | major_axis_length | minor_axis_length | aspect_ratio | |
---|---|---|---|---|---|---|---|
0 | False | BBBC007_20P1_POS0007_D_1UL | 0.540399 | 0.145956 | 0.349857 | 0.289063 | 0.243450 |
1 | False | BBBC007_20P1_POS0010_D_1UL | 0.765156 | 0.143506 | 0.447638 | 0.402708 | 0.316206 |
2 | True | BBBC007_20P1_POS0007_D_1UL | 0.248799 | 0.099636 | 0.145247 | 0.147626 | 0.036950 |
3 | True | BBBC007_20P1_POS0010_D_1UL | 0.275120 | 0.107008 | 0.167722 | 0.166214 | 0.043755 |