Appending and extending DataFrames#

When processing different data, e.g., multiple images, potentially using multiple image processing libraries, a common task is to combine the resulting DataFrames or tables. Pandas provides a couple of approaches to do this.

See also:

We start with two small tables of measurements that could have been obtained from different functions or different libraries.

import pandas as pd
table1 = pd.DataFrame({
    "label":       [1,   2,   3],
    "circularity": [0.3, 0.5, 0.7],
    "elongation":  [2.3, 3.4, 1.2],
})
table1
label circularity elongation
0 1 0.3 2.3
1 2 0.5 3.4
2 3 0.7 1.2
table2 = pd.DataFrame({
    "label":    [3,   2,   1,   4],
    "area":     [22,  32,  25,  18],
    "skewness": [0.5, 0.6, 0.3, 0.3],
})
table2
label area skewness
0 3 22 0.5
1 2 32 0.6
2 1 25 0.3
3 4 18 0.3

Combining columns of DataFrames#

There are multiple ways of combining tables. We first use concat without further adaption as an example to highlight pitfalls when combining tables.

wrongly_combined_tables = pd.concat([table1, table2], axis=1)
wrongly_combined_tables
label circularity elongation label area skewness
0 1.0 0.3 2.3 3 22 0.5
1 2.0 0.5 3.4 2 32 0.6
2 3.0 0.7 1.2 1 25 0.3
3 NaN NaN NaN 4 18 0.3

In the above example, measurements of label 1 and 3 have been mixed. Since both tables where not indexed on the labels but with a numbered row index, concat used this to combine them. Furthermore, one of our tables did not contain measurements for label 4.

A better way for combining tables is the merge command. It allows explicitly specifying how the merging is done, and on which column the tables should be combined. Data scientists speak of the ‘index’ or ‘identifier’ of rows in the tables.

correctly_combined_tables1 = pd.merge(table1, table2, how='inner', on='label')
correctly_combined_tables1
label circularity elongation area skewness
0 1 0.3 2.3 25 0.3
1 2 0.5 3.4 32 0.6
2 3 0.7 1.2 22 0.5

You may note that in the above example, label 4 is missing. We can also include it in our table by performing an outer join.

correctly_combined_tables2 = pd.merge(table1, table2, how='outer', on='label')
correctly_combined_tables2
label circularity elongation area skewness
0 1 0.3 2.3 25 0.3
1 2 0.5 3.4 32 0.6
2 3 0.7 1.2 22 0.5
3 4 NaN NaN 18 0.3

Suppose there is a common measurement name from different tables. For example, table3 below also contains “elongation”.

table3 = pd.DataFrame({
    "label":    [3,   2,   1,   4],
    "area":     [22,  32,  25,  18],
    "skewness": [0.5, 0.6, 0.3, 0.3],
    "elongation":  [2.3, 3.4, 1.2, 1.1]
    })
table3
label area skewness elongation
0 3 22 0.5 2.3
1 2 32 0.6 3.4
2 1 25 0.3 1.2
3 4 18 0.3 1.1

Merging tables containing the same column name still preserves both measurements in different, renamed columns.

correctly_combined_tables3 = pd.merge(table1, table3, how='outer', on='label')
correctly_combined_tables3
label circularity elongation_x area skewness elongation_y
0 1 0.3 2.3 25 0.3 1.2
1 2 0.5 3.4 32 0.6 3.4
2 3 0.7 1.2 22 0.5 2.3
3 4 NaN NaN 18 0.3 1.1

We can define suffixes instead of the default ‘x’ and ‘y’ to better distinguish both columns:

correctly_combined_tables3 = pd.merge(table1, table3, how='outer', on='label', suffixes=('_method-1', '_method-3'))
correctly_combined_tables3
label circularity elongation_method-1 area skewness elongation_method-3
0 1 0.3 2.3 25 0.3 1.2
1 2 0.5 3.4 32 0.6 3.4
2 3 0.7 1.2 22 0.5 2.3
3 4 NaN NaN 18 0.3 1.1

Combining rows of DataFrames#

When applying a workflow to many images, you would get tables with the same column names, but with variable rows. To calculate statistics for whole folders or to perform machine learning, we usually need to concatenate those tables, but it is important to keep track of the source files.

Let’s open two tables generated from applying the same workflow to different files.

df1 = pd.read_csv('../data/BBBC007_20P1_POS0007_D_1UL.csv')
df1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64 entries, 0 to 63
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   area               64 non-null     int64  
 1   intensity_mean     64 non-null     float64
 2   major_axis_length  64 non-null     float64
 3   minor_axis_length  64 non-null     float64
 4   aspect_ratio       64 non-null     float64
dtypes: float64(4), int64(1)
memory usage: 2.6 KB
df2 = pd.read_csv('../data/BBBC007_20P1_POS0010_D_1UL.csv')
df2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47 entries, 0 to 46
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   area               47 non-null     int64  
 1   intensity_mean     47 non-null     float64
 2   major_axis_length  47 non-null     float64
 3   minor_axis_length  47 non-null     float64
 4   aspect_ratio       47 non-null     float64
dtypes: float64(4), int64(1)
memory usage: 2.0 KB

In this particular case where we know we have the same columns, we could concatenate them into a single big table. We can use concat and define axis=0 to concatenate along the rows (index).

big_df = pd.concat([df1, df2], axis=0)
big_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 111 entries, 0 to 46
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   area               111 non-null    int64  
 1   intensity_mean     111 non-null    float64
 2   major_axis_length  111 non-null    float64
 3   minor_axis_length  111 non-null    float64
 4   aspect_ratio       111 non-null    float64
dtypes: float64(4), int64(1)
memory usage: 5.2 KB
big_df.head()
area intensity_mean major_axis_length minor_axis_length aspect_ratio
0 256 93.250000 19.995017 17.021559 1.174688
1 90 82.488889 15.939969 7.516326 2.120713
2 577 90.637782 35.324458 21.759434 1.623409
3 270 95.640741 20.229431 17.669052 1.144908
4 153 84.908497 15.683703 12.420475 1.262730

The problem here is that we’ve lost their source identity - we can’t tell anymore which row originated from which file.

An easy fix for that is to add a new column with the file name before concatenating them. This will facilitate splitting them back and plotting later on. When we give a single value to a new column, it is assigned to all rows.

df1['source_file'] = 'BBBC007_20P1_POS0007_D_1UL'
df2['source_file'] = 'BBBC007_20P1_POS0010_D_1UL'

Now, when we can safely distinguish the source of each row.

big_df = pd.concat([df1, df2], axis=0)
big_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 111 entries, 0 to 46
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   area               111 non-null    int64  
 1   intensity_mean     111 non-null    float64
 2   major_axis_length  111 non-null    float64
 3   minor_axis_length  111 non-null    float64
 4   aspect_ratio       111 non-null    float64
 5   source_file        111 non-null    object 
dtypes: float64(4), int64(1), object(1)
memory usage: 6.1+ KB

If we look at the index information (which also is not a range index anymore), the number of entries and the labels for the index, there seem to be some duplicates. We can double-check with an according method:

big_df.index.duplicated().any()
True

In such cases where the index does not have meaningful information but is a consecutive number, we can fix this by adding ignore_index to the concatmethod. This creates a new index for the combined DataFrame while dismissing the old indices.

big_df = pd.concat([df1, df2], axis=0, ignore_index=True)
big_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 111 entries, 0 to 110
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   area               111 non-null    int64  
 1   intensity_mean     111 non-null    float64
 2   major_axis_length  111 non-null    float64
 3   minor_axis_length  111 non-null    float64
 4   aspect_ratio       111 non-null    float64
 5   source_file        111 non-null    object 
dtypes: float64(4), int64(1), object(1)
memory usage: 5.3+ KB
big_df.head()
area intensity_mean major_axis_length minor_axis_length aspect_ratio source_file
0 256 93.250000 19.995017 17.021559 1.174688 BBBC007_20P1_POS0007_D_1UL
1 90 82.488889 15.939969 7.516326 2.120713 BBBC007_20P1_POS0007_D_1UL
2 577 90.637782 35.324458 21.759434 1.623409 BBBC007_20P1_POS0007_D_1UL
3 270 95.640741 20.229431 17.669052 1.144908 BBBC007_20P1_POS0007_D_1UL
4 153 84.908497 15.683703 12.420475 1.262730 BBBC007_20P1_POS0007_D_1UL
big_df.tail()
area intensity_mean major_axis_length minor_axis_length aspect_ratio source_file
106 315 91.133333 20.927095 19.209283 1.089426 BBBC007_20P1_POS0010_D_1UL
107 206 94.262136 23.381879 11.669668 2.003646 BBBC007_20P1_POS0010_D_1UL
108 45 68.377778 9.406371 6.276445 1.498678 BBBC007_20P1_POS0010_D_1UL
109 33 76.727273 10.724275 4.174568 2.568955 BBBC007_20P1_POS0010_D_1UL
110 16 76.750000 7.745967 2.783882 2.782433 BBBC007_20P1_POS0010_D_1UL

Let’s store this data for later use:

big_df.to_csv('../data/BBBC007_concat.csv')