Appending and extending DataFrames

Appending and extending DataFrames#

When processing different data, e.g., multiple images, potentially using multiple image processing libraries, a common task is to combine the resulting DataFrames or tables. Pandas provides a couple of approaches to do this.

Combining columns of DataFrames#

There are multiple ways of combining tables. We first use concat without further adaption as an example to highlight pitfalls when combining tables.

wrongly_combined_tables = pd.concat([table1, table2], axis=1)
wrongly_combined_tables

	label	circularity	elongation	label	area	skewness
0	1.0	0.3	2.3	3	22	0.5
1	2.0	0.5	3.4	2	32	0.6
2	3.0	0.7	1.2	1	25	0.3
3	NaN	NaN	NaN	4	18	0.3

In the above example, measurements of label 1 and 3 have been mixed. Since both tables where not indexed on the labels but with a numbered row index, concat used this to combine them. Furthermore, one of our tables did not contain measurements for label 4.

A better way for combining tables is the merge command. It allows explicitly specifying how the merging is done, and on which column the tables should be combined. Data scientists speak of the ‘index’ or ‘identifier’ of rows in the tables.

correctly_combined_tables1 = pd.merge(table1, table2, how='inner', on='label')
correctly_combined_tables1

	label	circularity	elongation	area	skewness
0	1	0.3	2.3	25	0.3
1	2	0.5	3.4	32	0.6
2	3	0.7	1.2	22	0.5

You may note that in the above example, label 4 is missing. We can also include it in our table by performing an outer join.

correctly_combined_tables2 = pd.merge(table1, table2, how='outer', on='label')
correctly_combined_tables2

	label	circularity	elongation	area	skewness
0	1	0.3	2.3	25	0.3
1	2	0.5	3.4	32	0.6
2	3	0.7	1.2	22	0.5
3	4	NaN	NaN	18	0.3

Suppose there is a common measurement name from different tables. For example, table3 below also contains “elongation”.

table3 = pd.DataFrame({
    "label":    [3,   2,   1,   4],
    "area":     [22,  32,  25,  18],
    "skewness": [0.5, 0.6, 0.3, 0.3],
    "elongation":  [2.3, 3.4, 1.2, 1.1]
    })
table3

	label	area	skewness	elongation
0	3	22	0.5	2.3
1	2	32	0.6	3.4
2	1	25	0.3	1.2
3	4	18	0.3	1.1

Merging tables containing the same column name still preserves both measurements in different, renamed columns.

correctly_combined_tables3 = pd.merge(table1, table3, how='outer', on='label')
correctly_combined_tables3

	label	circularity	elongation_x	area	skewness	elongation_y
0	1	0.3	2.3	25	0.3	1.2
1	2	0.5	3.4	32	0.6	3.4
2	3	0.7	1.2	22	0.5	2.3
3	4	NaN	NaN	18	0.3	1.1

We can define suffixes instead of the default ‘x’ and ‘y’ to better distinguish both columns:

correctly_combined_tables3 = pd.merge(table1, table3, how='outer', on='label', suffixes=('_method-1', '_method-3'))
correctly_combined_tables3

	label	circularity	elongation_method-1	area	skewness	elongation_method-3
0	1	0.3	2.3	25	0.3	1.2
1	2	0.5	3.4	32	0.6	3.4
2	3	0.7	1.2	22	0.5	2.3
3	4	NaN	NaN	18	0.3	1.1

Combining rows of DataFrames#

When applying a workflow to many images, you would get tables with the same column names, but with variable rows. To calculate statistics for whole folders or to perform machine learning, we usually need to concatenate those tables, but it is important to keep track of the source files.

Let’s open two tables generated from applying the same workflow to different files.

df1 = pd.read_csv('../data/BBBC007_20P1_POS0007_D_1UL.csv')
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64 entries, 0 to 63
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   area               64 non-null     int64  
 1   intensity_mean     64 non-null     float64
 2   major_axis_length  64 non-null     float64
 3   minor_axis_length  64 non-null     float64
 4   aspect_ratio       64 non-null     float64
dtypes: float64(4), int64(1)
memory usage: 2.6 KB

df2 = pd.read_csv('../data/BBBC007_20P1_POS0010_D_1UL.csv')
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47 entries, 0 to 46
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   area               47 non-null     int64  
 1   intensity_mean     47 non-null     float64
 2   major_axis_length  47 non-null     float64
 3   minor_axis_length  47 non-null     float64
 4   aspect_ratio       47 non-null     float64
dtypes: float64(4), int64(1)
memory usage: 2.0 KB

In this particular case where we know we have the same columns, we could concatenate them into a single big table. We can use concat and define axis=0 to concatenate along the rows (index).

big_df = pd.concat([df1, df2], axis=0)
big_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 111 entries, 0 to 46
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   area               111 non-null    int64  
 1   intensity_mean     111 non-null    float64
 2   major_axis_length  111 non-null    float64
 3   minor_axis_length  111 non-null    float64
 4   aspect_ratio       111 non-null    float64
dtypes: float64(4), int64(1)
memory usage: 5.2 KB

big_df.head()

	area	intensity_mean	major_axis_length	minor_axis_length	aspect_ratio
0	256	93.250000	19.995017	17.021559	1.174688
1	90	82.488889	15.939969	7.516326	2.120713
2	577	90.637782	35.324458	21.759434	1.623409
3	270	95.640741	20.229431	17.669052	1.144908
4	153	84.908497	15.683703	12.420475	1.262730

The problem here is that we’ve lost their source identity - we can’t tell anymore which row originated from which file.

An easy fix for that is to add a new column with the file name before concatenating them. This will facilitate splitting them back and plotting later on. When we give a single value to a new column, it is assigned to all rows.

df1['source_file'] = 'BBBC007_20P1_POS0007_D_1UL'
df2['source_file'] = 'BBBC007_20P1_POS0010_D_1UL'

Now, when we can safely distinguish the source of each row.

big_df = pd.concat([df1, df2], axis=0)
big_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 111 entries, 0 to 46
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   area               111 non-null    int64  
 1   intensity_mean     111 non-null    float64
 2   major_axis_length  111 non-null    float64
 3   minor_axis_length  111 non-null    float64
 4   aspect_ratio       111 non-null    float64
 5   source_file        111 non-null    object 
dtypes: float64(4), int64(1), object(1)
memory usage: 6.1+ KB

If we look at the index information (which also is not a range index anymore), the number of entries and the labels for the index, there seem to be some duplicates. We can double-check with an according method:

big_df.index.duplicated().any()

True

In such cases where the index does not have meaningful information but is a consecutive number, we can fix this by adding ignore_index to the concatmethod. This creates a new index for the combined DataFrame while dismissing the old indices.

big_df = pd.concat([df1, df2], axis=0, ignore_index=True)
big_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 111 entries, 0 to 110
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   area               111 non-null    int64  
 1   intensity_mean     111 non-null    float64
 2   major_axis_length  111 non-null    float64
 3   minor_axis_length  111 non-null    float64
 4   aspect_ratio       111 non-null    float64
 5   source_file        111 non-null    object 
dtypes: float64(4), int64(1), object(1)
memory usage: 5.3+ KB

big_df.head()

	area	intensity_mean	major_axis_length	minor_axis_length	aspect_ratio	source_file
0	256	93.250000	19.995017	17.021559	1.174688	BBBC007_20P1_POS0007_D_1UL
1	90	82.488889	15.939969	7.516326	2.120713	BBBC007_20P1_POS0007_D_1UL
2	577	90.637782	35.324458	21.759434	1.623409	BBBC007_20P1_POS0007_D_1UL
3	270	95.640741	20.229431	17.669052	1.144908	BBBC007_20P1_POS0007_D_1UL
4	153	84.908497	15.683703	12.420475	1.262730	BBBC007_20P1_POS0007_D_1UL

big_df.tail()

	area	intensity_mean	major_axis_length	minor_axis_length	aspect_ratio	source_file
106	315	91.133333	20.927095	19.209283	1.089426	BBBC007_20P1_POS0010_D_1UL
107	206	94.262136	23.381879	11.669668	2.003646	BBBC007_20P1_POS0010_D_1UL
108	45	68.377778	9.406371	6.276445	1.498678	BBBC007_20P1_POS0010_D_1UL
109	33	76.727273	10.724275	4.174568	2.568955	BBBC007_20P1_POS0010_D_1UL
110	16	76.750000	7.745967	2.783882	2.782433	BBBC007_20P1_POS0010_D_1UL

Let’s store this data for later use:

big_df.to_csv('../data/BBBC007_concat.csv')

	label	circularity	elongation	label	area	skewness
0	1.0	0.3	2.3	3	22	0.5
1	2.0	0.5	3.4	2	32	0.6
2	3.0	0.7	1.2	1	25	0.3
3	NaN	NaN	NaN	4	18	0.3

	label	circularity	elongation_x	area	skewness	elongation_y
0	1	0.3	2.3	25	0.3	1.2
1	2	0.5	3.4	32	0.6	3.4
2	3	0.7	1.2	22	0.5	2.3
3	4	NaN	NaN	18	0.3	1.1

	label	circularity	elongation_method-1	area	skewness	elongation_method-3
0	1	0.3	2.3	25	0.3	1.2
1	2	0.5	3.4	32	0.6	3.4
2	3	0.7	1.2	22	0.5	2.3
3	4	NaN	NaN	18	0.3	1.1

	label	circularity	elongation	label	area	skewness
0	1.0	0.3	2.3	3	22	0.5
1	2.0	0.5	3.4	2	32	0.6
2	3.0	0.7	1.2	1	25	0.3
3	NaN	NaN	NaN	4	18	0.3

	label	circularity	elongation_x	area	skewness	elongation_y
0	1	0.3	2.3	25	0.3	1.2
1	2	0.5	3.4	32	0.6	3.4
2	3	0.7	1.2	22	0.5	2.3
3	4	NaN	NaN	18	0.3	1.1

	label	circularity	elongation_method-1	area	skewness	elongation_method-3
0	1	0.3	2.3	25	0.3	1.2
1	2	0.5	3.4	32	0.6	3.4
2	3	0.7	1.2	22	0.5	2.3
3	4	NaN	NaN	18	0.3	1.1

Appending and extending DataFrames

Contents

Appending and extending DataFrames#

Combining columns of DataFrames#

Combining rows of DataFrames#

	label	circularity	elongation	label	area	skewness
0	1.0	0.3	2.3	3	22	0.5
1	2.0	0.5	3.4	2	32	0.6
2	3.0	0.7	1.2	1	25	0.3
3	NaN	NaN	NaN	4	18	0.3

	label	circularity	elongation_x	area	skewness	elongation_y
0	1	0.3	2.3	25	0.3	1.2
1	2	0.5	3.4	32	0.6	3.4
2	3	0.7	1.2	22	0.5	2.3
3	4	NaN	NaN	18	0.3	1.1

	label	circularity	elongation_method-1	area	skewness	elongation_method-3
0	1	0.3	2.3	25	0.3	1.2
1	2	0.5	3.4	32	0.6	3.4
2	3	0.7	1.2	22	0.5	2.3
3	4	NaN	NaN	18	0.3	1.1