Appending and extending DataFrames

Appending and extending DataFrames#

When processing different data, e.g., multiple measurements, potentially using multiple processing libraries, a common task is to combine the resulting DataFrames. Pandas provides several approaches to do this.

See also:

We start with two small tables of measurements that could have been obtained from different functions or different libraries.

import pandas as pd
import numpy as np

df1 = pd.read_csv('data/measurements_1.csv')
df2 = pd.read_csv('data/measurements_2.csv')

df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   label   10 non-null     object 
 1   value1  10 non-null     float64
 2   value2  8 non-null      float64
 3   value3  9 non-null      float64
 4   valid   10 non-null     bool   
dtypes: bool(1), float64(3), object(1)
memory usage: 462.0+ bytes

df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   label      10 non-null     object 
 1   value4     10 non-null     float64
 2   timestamp  10 non-null     object 
dtypes: float64(1), object(2)
memory usage: 372.0+ bytes

Combining columns of DataFrames#

There are multiple ways of combining tables. We first use concat without further adaption as an example to highlight pitfalls when combining tables.

# You can check the different result for the `axis` parameter which can either be 0 (along rows) or 1 (along columns).
wrongly_combined_tables = pd.concat([df1, df2], axis=1)
wrongly_combined_tables

	label	value1	value2	value3	valid	label	value4	timestamp
0	A	9.0	2.0	3.1	True	A	0.98	2023-01-01 07:00
1	B	11.0	NaN	0.4	False	B	0.40	2023-01-01 08:00
2	C	10.2	8.0	5.6	True	D	2.56	2023-01-01 10:00
3	D	9.5	5.0	NaN	True	E	3.14	2023-01-01 11:00
4	E	15.0	7.0	4.4	False	F	2.71	2023-01-01 12:00
5	F	400.0	NaN	2.2	True	G	3.58	2023-01-01 16:00
6	G	9.0	1.0	1.1	False	H	4.12	2023-01-01 17:00
7	H	11.0	4.0	0.9	True	I	3.33	2023-01-01 18:00
8	I	11.3	6.0	3.3	True	J	2.98	2023-01-01 19:00
9	J	10.0	9.0	4.1	False	K	3.87	2023-01-01 20:00

In the above example, measurements for labels have been mixed. Since both tables where not indexed on the labels but with a numbered row index,c used this to combine them.

While concat also provides additional parameters, a better way for combining data is merge. It allows explicitly specifying how the merging is done, and on which column the tables should be combined.

# You can check the different results of the `how` parameter for its allowed values: ‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’
merged_tables = pd.merge(df1, df2, how='inner', on='label')
merged_tables

	label	value1	value2	value3	valid	value4	timestamp
0	A	9.0	2.0	3.1	True	0.98	2023-01-01 07:00
1	B	11.0	NaN	0.4	False	0.40	2023-01-01 08:00
2	D	9.5	5.0	NaN	True	2.56	2023-01-01 10:00
3	E	15.0	7.0	4.4	False	3.14	2023-01-01 11:00
4	F	400.0	NaN	2.2	True	2.71	2023-01-01 12:00
5	G	9.0	1.0	1.1	False	3.58	2023-01-01 16:00
6	H	11.0	4.0	0.9	True	4.12	2023-01-01 17:00
7	I	11.3	6.0	3.3	True	3.33	2023-01-01 18:00
8	J	10.0	9.0	4.1	False	2.98	2023-01-01 19:00

Since we want to include all measurements, we perform an outer join on label.

merged_tables = pd.merge(df1, df2, how='outer', on='label')
merged_tables

	label	value1	value2	value3	valid	value4	timestamp
0	A	9.0	2.0	3.1	True	0.98	2023-01-01 07:00
1	B	11.0	NaN	0.4	False	0.40	2023-01-01 08:00
2	C	10.2	8.0	5.6	True	NaN	NaN
3	D	9.5	5.0	NaN	True	2.56	2023-01-01 10:00
4	E	15.0	7.0	4.4	False	3.14	2023-01-01 11:00
5	F	400.0	NaN	2.2	True	2.71	2023-01-01 12:00
6	G	9.0	1.0	1.1	False	3.58	2023-01-01 16:00
7	H	11.0	4.0	0.9	True	4.12	2023-01-01 17:00
8	I	11.3	6.0	3.3	True	3.33	2023-01-01 18:00
9	J	10.0	9.0	4.1	False	2.98	2023-01-01 19:00
10	K	NaN	NaN	NaN	NaN	3.87	2023-01-01 20:00

Exercise#

Now we want to read three files [measurements_1.csv, measurements_2.csv, measurements_3.csv] from the local path data and merge them into one DataFrame. Complete the following code:

import os

file_paths = sorted(
    file.path for file in os.scandir('data')
    if file.is_file() and file.name.startswith('measurements') and file.name.endswith('.csv')
)
print(file_paths)

# TODO: read and collect the dataframes

# TODO: merge the dataframes one-by-one

['data/measurements_1.csv', 'data/measurements_2.csv', 'data/measurements_3.csv']

Appending and extending DataFrames

Contents

Appending and extending DataFrames#

Combining columns of DataFrames#

Exercise#