Appending and extending DataFrames#
When processing different data, e.g., multiple images, potentially using multiple image processing libraries, a common task is to combine the resulting DataFrames or tables. Pandas provides a couple of approaches to do this.
See also:
We start with two small tables of measurements that could have been obtained from different functions or different libraries.
import pandas as pd
table1 = pd.DataFrame({
"label": [1, 2, 3],
"circularity": [0.3, 0.5, 0.7],
"elongation": [2.3, 3.4, 1.2],
})
table1
label | circularity | elongation | |
---|---|---|---|
0 | 1 | 0.3 | 2.3 |
1 | 2 | 0.5 | 3.4 |
2 | 3 | 0.7 | 1.2 |
table2 = pd.DataFrame({
"label": [3, 2, 1, 4],
"area": [22, 32, 25, 18],
"skewness": [0.5, 0.6, 0.3, 0.3],
})
table2
label | area | skewness | |
---|---|---|---|
0 | 3 | 22 | 0.5 |
1 | 2 | 32 | 0.6 |
2 | 1 | 25 | 0.3 |
3 | 4 | 18 | 0.3 |
Combining columns of DataFrames#
There are multiple ways of combining tables. We first use concat
without further adaption as an example to highlight pitfalls when combining tables.
wrongly_combined_tables = pd.concat([table1, table2], axis=1)
wrongly_combined_tables
label | circularity | elongation | label | area | skewness | |
---|---|---|---|---|---|---|
0 | 1.0 | 0.3 | 2.3 | 3 | 22 | 0.5 |
1 | 2.0 | 0.5 | 3.4 | 2 | 32 | 0.6 |
2 | 3.0 | 0.7 | 1.2 | 1 | 25 | 0.3 |
3 | NaN | NaN | NaN | 4 | 18 | 0.3 |
In the above example, measurements of label 1 and 3 have been mixed. Since both tables where not indexed on the labels but with a numbered row index, concat
used this to combine them. Furthermore, one of our tables did not contain measurements for label 4.
A better way for combining tables is the merge
command. It allows explicitly specifying how
the merging is done, and on
which column the tables should be combined. Data scientists speak of the ‘index’ or ‘identifier’ of rows in the tables.
correctly_combined_tables1 = pd.merge(table1, table2, how='inner', on='label')
correctly_combined_tables1
label | circularity | elongation | area | skewness | |
---|---|---|---|---|---|
0 | 1 | 0.3 | 2.3 | 25 | 0.3 |
1 | 2 | 0.5 | 3.4 | 32 | 0.6 |
2 | 3 | 0.7 | 1.2 | 22 | 0.5 |
You may note that in the above example, label 4 is missing. We can also include it in our table by performing an outer join
.
correctly_combined_tables2 = pd.merge(table1, table2, how='outer', on='label')
correctly_combined_tables2
label | circularity | elongation | area | skewness | |
---|---|---|---|---|---|
0 | 1 | 0.3 | 2.3 | 25 | 0.3 |
1 | 2 | 0.5 | 3.4 | 32 | 0.6 |
2 | 3 | 0.7 | 1.2 | 22 | 0.5 |
3 | 4 | NaN | NaN | 18 | 0.3 |
Suppose there is a common measurement name from different tables. For example, table3 below also contains “elongation”.
table3 = pd.DataFrame({
"label": [3, 2, 1, 4],
"area": [22, 32, 25, 18],
"skewness": [0.5, 0.6, 0.3, 0.3],
"elongation": [2.3, 3.4, 1.2, 1.1]
})
table3
label | area | skewness | elongation | |
---|---|---|---|---|
0 | 3 | 22 | 0.5 | 2.3 |
1 | 2 | 32 | 0.6 | 3.4 |
2 | 1 | 25 | 0.3 | 1.2 |
3 | 4 | 18 | 0.3 | 1.1 |
Merging tables containing the same column name still preserves both measurements in different, renamed columns.
correctly_combined_tables3 = pd.merge(table1, table3, how='outer', on='label')
correctly_combined_tables3
label | circularity | elongation_x | area | skewness | elongation_y | |
---|---|---|---|---|---|---|
0 | 1 | 0.3 | 2.3 | 25 | 0.3 | 1.2 |
1 | 2 | 0.5 | 3.4 | 32 | 0.6 | 3.4 |
2 | 3 | 0.7 | 1.2 | 22 | 0.5 | 2.3 |
3 | 4 | NaN | NaN | 18 | 0.3 | 1.1 |
We can define suffixes instead of the default ‘x’ and ‘y’ to better distinguish both columns:
correctly_combined_tables3 = pd.merge(table1, table3, how='outer', on='label', suffixes=('_method-1', '_method-3'))
correctly_combined_tables3
label | circularity | elongation_method-1 | area | skewness | elongation_method-3 | |
---|---|---|---|---|---|---|
0 | 1 | 0.3 | 2.3 | 25 | 0.3 | 1.2 |
1 | 2 | 0.5 | 3.4 | 32 | 0.6 | 3.4 |
2 | 3 | 0.7 | 1.2 | 22 | 0.5 | 2.3 |
3 | 4 | NaN | NaN | 18 | 0.3 | 1.1 |
Combining rows of DataFrames#
When applying a workflow to many images, you would get tables with the same column names, but with variable rows. To calculate statistics for whole folders or to perform machine learning, we usually need to concatenate those tables, but it is important to keep track of the source files.
Let’s open two tables generated from applying the same workflow to different files.
df1 = pd.read_csv('../data/BBBC007_20P1_POS0007_D_1UL.csv')
df1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64 entries, 0 to 63
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 area 64 non-null int64
1 intensity_mean 64 non-null float64
2 major_axis_length 64 non-null float64
3 minor_axis_length 64 non-null float64
4 aspect_ratio 64 non-null float64
dtypes: float64(4), int64(1)
memory usage: 2.6 KB
df2 = pd.read_csv('../data/BBBC007_20P1_POS0010_D_1UL.csv')
df2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47 entries, 0 to 46
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 area 47 non-null int64
1 intensity_mean 47 non-null float64
2 major_axis_length 47 non-null float64
3 minor_axis_length 47 non-null float64
4 aspect_ratio 47 non-null float64
dtypes: float64(4), int64(1)
memory usage: 2.0 KB
In this particular case where we know we have the same columns, we could concatenate them into a single big table. We can use concat
and define axis=0
to concatenate along the rows (index).
big_df = pd.concat([df1, df2], axis=0)
big_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 111 entries, 0 to 46
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 area 111 non-null int64
1 intensity_mean 111 non-null float64
2 major_axis_length 111 non-null float64
3 minor_axis_length 111 non-null float64
4 aspect_ratio 111 non-null float64
dtypes: float64(4), int64(1)
memory usage: 5.2 KB
big_df.head()
area | intensity_mean | major_axis_length | minor_axis_length | aspect_ratio | |
---|---|---|---|---|---|
0 | 256 | 93.250000 | 19.995017 | 17.021559 | 1.174688 |
1 | 90 | 82.488889 | 15.939969 | 7.516326 | 2.120713 |
2 | 577 | 90.637782 | 35.324458 | 21.759434 | 1.623409 |
3 | 270 | 95.640741 | 20.229431 | 17.669052 | 1.144908 |
4 | 153 | 84.908497 | 15.683703 | 12.420475 | 1.262730 |
The problem here is that we’ve lost their source identity - we can’t tell anymore which row originated from which file.
An easy fix for that is to add a new column with the file name before concatenating them. This will facilitate splitting them back and plotting later on. When we give a single value to a new column, it is assigned to all rows.
df1['source_file'] = 'BBBC007_20P1_POS0007_D_1UL'
df2['source_file'] = 'BBBC007_20P1_POS0010_D_1UL'
Now, when we can safely distinguish the source of each row.
big_df = pd.concat([df1, df2], axis=0)
big_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 111 entries, 0 to 46
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 area 111 non-null int64
1 intensity_mean 111 non-null float64
2 major_axis_length 111 non-null float64
3 minor_axis_length 111 non-null float64
4 aspect_ratio 111 non-null float64
5 source_file 111 non-null object
dtypes: float64(4), int64(1), object(1)
memory usage: 6.1+ KB
If we look at the index information (which also is not a range index anymore), the number of entries and the labels for the index, there seem to be some duplicates. We can double-check with an according method:
big_df.index.duplicated().any()
True
In such cases where the index does not have meaningful information but is a consecutive number, we can fix this by adding ignore_index
to the concat
method. This creates a new index for the combined DataFrame while dismissing the old indices.
big_df = pd.concat([df1, df2], axis=0, ignore_index=True)
big_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 111 entries, 0 to 110
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 area 111 non-null int64
1 intensity_mean 111 non-null float64
2 major_axis_length 111 non-null float64
3 minor_axis_length 111 non-null float64
4 aspect_ratio 111 non-null float64
5 source_file 111 non-null object
dtypes: float64(4), int64(1), object(1)
memory usage: 5.3+ KB
big_df.head()
area | intensity_mean | major_axis_length | minor_axis_length | aspect_ratio | source_file | |
---|---|---|---|---|---|---|
0 | 256 | 93.250000 | 19.995017 | 17.021559 | 1.174688 | BBBC007_20P1_POS0007_D_1UL |
1 | 90 | 82.488889 | 15.939969 | 7.516326 | 2.120713 | BBBC007_20P1_POS0007_D_1UL |
2 | 577 | 90.637782 | 35.324458 | 21.759434 | 1.623409 | BBBC007_20P1_POS0007_D_1UL |
3 | 270 | 95.640741 | 20.229431 | 17.669052 | 1.144908 | BBBC007_20P1_POS0007_D_1UL |
4 | 153 | 84.908497 | 15.683703 | 12.420475 | 1.262730 | BBBC007_20P1_POS0007_D_1UL |
big_df.tail()
area | intensity_mean | major_axis_length | minor_axis_length | aspect_ratio | source_file | |
---|---|---|---|---|---|---|
106 | 315 | 91.133333 | 20.927095 | 19.209283 | 1.089426 | BBBC007_20P1_POS0010_D_1UL |
107 | 206 | 94.262136 | 23.381879 | 11.669668 | 2.003646 | BBBC007_20P1_POS0010_D_1UL |
108 | 45 | 68.377778 | 9.406371 | 6.276445 | 1.498678 | BBBC007_20P1_POS0010_D_1UL |
109 | 33 | 76.727273 | 10.724275 | 4.174568 | 2.568955 | BBBC007_20P1_POS0010_D_1UL |
110 | 16 | 76.750000 | 7.745967 | 2.783882 | 2.782433 | BBBC007_20P1_POS0010_D_1UL |
Let’s store this data for later use:
big_df.to_csv('../data/BBBC007_concat.csv')