Appending and extending DataFrames#

When processing different data, e.g., multiple measurements, potentially using multiple processing libraries, a common task is to combine the resulting DataFrames. Pandas provides several approaches to do this.

See also:

We start with two small tables of measurements that could have been obtained from different functions or different libraries.

import pandas as pd
import numpy as np
df1 = pd.read_csv('data/measurements_1.csv')
df2 = pd.read_csv('data/measurements_2.csv')
df1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   label   10 non-null     object 
 1   value1  10 non-null     float64
 2   value2  8 non-null      float64
 3   value3  9 non-null      float64
 4   valid   10 non-null     bool   
dtypes: bool(1), float64(3), object(1)
memory usage: 462.0+ bytes
df2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   label      10 non-null     object 
 1   value4     10 non-null     float64
 2   timestamp  10 non-null     object 
dtypes: float64(1), object(2)
memory usage: 372.0+ bytes

Combining columns of DataFrames#

There are multiple ways of combining tables. We first use concat without further adaption as an example to highlight pitfalls when combining tables.

# You can check the different result for the `axis` parameter which can either be 0 (along rows) or 1 (along columns).
wrongly_combined_tables = pd.concat([df1, df2], axis=1)
wrongly_combined_tables
label value1 value2 value3 valid label value4 timestamp
0 A 9.0 2.0 3.1 True A 0.98 2023-01-01 07:00
1 B 11.0 NaN 0.4 False B 0.40 2023-01-01 08:00
2 C 10.2 8.0 5.6 True D 2.56 2023-01-01 10:00
3 D 9.5 5.0 NaN True E 3.14 2023-01-01 11:00
4 E 15.0 7.0 4.4 False F 2.71 2023-01-01 12:00
5 F 400.0 NaN 2.2 True G 3.58 2023-01-01 16:00
6 G 9.0 1.0 1.1 False H 4.12 2023-01-01 17:00
7 H 11.0 4.0 0.9 True I 3.33 2023-01-01 18:00
8 I 11.3 6.0 3.3 True J 2.98 2023-01-01 19:00
9 J 10.0 9.0 4.1 False K 3.87 2023-01-01 20:00

In the above example, measurements for labels have been mixed. Since both tables where not indexed on the labels but with a numbered row index,c used this to combine them.

While concat also provides additional parameters, a better way for combining data is merge. It allows explicitly specifying how the merging is done, and on which column the tables should be combined.

# You can check the different results of the `how` parameter for its allowed values: ‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’
merged_tables = pd.merge(df1, df2, how='inner', on='label')
merged_tables
label value1 value2 value3 valid value4 timestamp
0 A 9.0 2.0 3.1 True 0.98 2023-01-01 07:00
1 B 11.0 NaN 0.4 False 0.40 2023-01-01 08:00
2 D 9.5 5.0 NaN True 2.56 2023-01-01 10:00
3 E 15.0 7.0 4.4 False 3.14 2023-01-01 11:00
4 F 400.0 NaN 2.2 True 2.71 2023-01-01 12:00
5 G 9.0 1.0 1.1 False 3.58 2023-01-01 16:00
6 H 11.0 4.0 0.9 True 4.12 2023-01-01 17:00
7 I 11.3 6.0 3.3 True 3.33 2023-01-01 18:00
8 J 10.0 9.0 4.1 False 2.98 2023-01-01 19:00

Since we want to include all measurements, we perform an outer join on label.

merged_tables = pd.merge(df1, df2, how='outer', on='label')
merged_tables
label value1 value2 value3 valid value4 timestamp
0 A 9.0 2.0 3.1 True 0.98 2023-01-01 07:00
1 B 11.0 NaN 0.4 False 0.40 2023-01-01 08:00
2 C 10.2 8.0 5.6 True NaN NaN
3 D 9.5 5.0 NaN True 2.56 2023-01-01 10:00
4 E 15.0 7.0 4.4 False 3.14 2023-01-01 11:00
5 F 400.0 NaN 2.2 True 2.71 2023-01-01 12:00
6 G 9.0 1.0 1.1 False 3.58 2023-01-01 16:00
7 H 11.0 4.0 0.9 True 4.12 2023-01-01 17:00
8 I 11.3 6.0 3.3 True 3.33 2023-01-01 18:00
9 J 10.0 9.0 4.1 False 2.98 2023-01-01 19:00
10 K NaN NaN NaN NaN 3.87 2023-01-01 20:00

Exercise#

Now we want to read three files [measurements_1.csv, measurements_2.csv, measurements_3.csv] from the local path data and merge them into one DataFrame. Complete the following code:

import os

file_paths = sorted(
    file.path for file in os.scandir('data')
    if file.is_file() and file.name.startswith('measurements') and file.name.endswith('.csv')
)
print(file_paths)

# TODO: read and collect the dataframes

# TODO: merge the dataframes one-by-one
['data/measurements_1.csv', 'data/measurements_2.csv', 'data/measurements_3.csv']