Appending and extending DataFrames#
When processing different data, e.g., multiple measurements, potentially using multiple processing libraries, a common task is to combine the resulting DataFrames. Pandas provides several approaches to do this.
See also:
We start with two small tables of measurements that could have been obtained from different functions or different libraries.
import pandas as pd
import numpy as np
df1 = pd.read_csv('data/measurements_1.csv')
df2 = pd.read_csv('data/measurements_2.csv')
df1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 label 10 non-null object
1 value1 10 non-null float64
2 value2 8 non-null float64
3 value3 9 non-null float64
4 valid 10 non-null bool
dtypes: bool(1), float64(3), object(1)
memory usage: 462.0+ bytes
df2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 label 10 non-null object
1 value4 10 non-null float64
2 timestamp 10 non-null object
dtypes: float64(1), object(2)
memory usage: 372.0+ bytes
Combining columns of DataFrames#
There are multiple ways of combining tables. We first use concat
without further adaption as an example to highlight pitfalls when combining tables.
# You can check the different result for the `axis` parameter which can either be 0 (along rows) or 1 (along columns).
wrongly_combined_tables = pd.concat([df1, df2], axis=1)
wrongly_combined_tables
label | value1 | value2 | value3 | valid | label | value4 | timestamp | |
---|---|---|---|---|---|---|---|---|
0 | A | 9.0 | 2.0 | 3.1 | True | A | 0.98 | 2023-01-01 07:00 |
1 | B | 11.0 | NaN | 0.4 | False | B | 0.40 | 2023-01-01 08:00 |
2 | C | 10.2 | 8.0 | 5.6 | True | D | 2.56 | 2023-01-01 10:00 |
3 | D | 9.5 | 5.0 | NaN | True | E | 3.14 | 2023-01-01 11:00 |
4 | E | 15.0 | 7.0 | 4.4 | False | F | 2.71 | 2023-01-01 12:00 |
5 | F | 400.0 | NaN | 2.2 | True | G | 3.58 | 2023-01-01 16:00 |
6 | G | 9.0 | 1.0 | 1.1 | False | H | 4.12 | 2023-01-01 17:00 |
7 | H | 11.0 | 4.0 | 0.9 | True | I | 3.33 | 2023-01-01 18:00 |
8 | I | 11.3 | 6.0 | 3.3 | True | J | 2.98 | 2023-01-01 19:00 |
9 | J | 10.0 | 9.0 | 4.1 | False | K | 3.87 | 2023-01-01 20:00 |
In the above example, measurements for labels have been mixed. Since both tables where not indexed on the labels but with a numbered row index,c used this to combine them.
While concat
also provides additional parameters, a better way for combining data is merge
. It allows explicitly specifying how
the merging is done, and on
which column the tables should be combined.
# You can check the different results of the `how` parameter for its allowed values: ‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’
merged_tables = pd.merge(df1, df2, how='inner', on='label')
merged_tables
label | value1 | value2 | value3 | valid | value4 | timestamp | |
---|---|---|---|---|---|---|---|
0 | A | 9.0 | 2.0 | 3.1 | True | 0.98 | 2023-01-01 07:00 |
1 | B | 11.0 | NaN | 0.4 | False | 0.40 | 2023-01-01 08:00 |
2 | D | 9.5 | 5.0 | NaN | True | 2.56 | 2023-01-01 10:00 |
3 | E | 15.0 | 7.0 | 4.4 | False | 3.14 | 2023-01-01 11:00 |
4 | F | 400.0 | NaN | 2.2 | True | 2.71 | 2023-01-01 12:00 |
5 | G | 9.0 | 1.0 | 1.1 | False | 3.58 | 2023-01-01 16:00 |
6 | H | 11.0 | 4.0 | 0.9 | True | 4.12 | 2023-01-01 17:00 |
7 | I | 11.3 | 6.0 | 3.3 | True | 3.33 | 2023-01-01 18:00 |
8 | J | 10.0 | 9.0 | 4.1 | False | 2.98 | 2023-01-01 19:00 |
Since we want to include all measurements, we perform an outer join
on label
.
merged_tables = pd.merge(df1, df2, how='outer', on='label')
merged_tables
label | value1 | value2 | value3 | valid | value4 | timestamp | |
---|---|---|---|---|---|---|---|
0 | A | 9.0 | 2.0 | 3.1 | True | 0.98 | 2023-01-01 07:00 |
1 | B | 11.0 | NaN | 0.4 | False | 0.40 | 2023-01-01 08:00 |
2 | C | 10.2 | 8.0 | 5.6 | True | NaN | NaN |
3 | D | 9.5 | 5.0 | NaN | True | 2.56 | 2023-01-01 10:00 |
4 | E | 15.0 | 7.0 | 4.4 | False | 3.14 | 2023-01-01 11:00 |
5 | F | 400.0 | NaN | 2.2 | True | 2.71 | 2023-01-01 12:00 |
6 | G | 9.0 | 1.0 | 1.1 | False | 3.58 | 2023-01-01 16:00 |
7 | H | 11.0 | 4.0 | 0.9 | True | 4.12 | 2023-01-01 17:00 |
8 | I | 11.3 | 6.0 | 3.3 | True | 3.33 | 2023-01-01 18:00 |
9 | J | 10.0 | 9.0 | 4.1 | False | 2.98 | 2023-01-01 19:00 |
10 | K | NaN | NaN | NaN | NaN | 3.87 | 2023-01-01 20:00 |
Exercise#
Now we want to read three files [measurements_1.csv
, measurements_2.csv
, measurements_3.csv
] from the local path data
and merge them into one DataFrame. Complete the following code:
import os
file_paths = sorted(
file.path for file in os.scandir('data')
if file.is_file() and file.name.startswith('measurements') and file.name.endswith('.csv')
)
print(file_paths)
# TODO: read and collect the dataframes
# TODO: merge the dataframes one-by-one
['data/measurements_1.csv', 'data/measurements_2.csv', 'data/measurements_3.csv']