Tidy-Data#
Hadley Wickham wrote a great article in favor of “tidy data.” Tidy data follows the rules:
Each variable is a column.
Each observation is a row.
Each type of observation has its separate table / DataFrame.
This is less pretty to visualize as a table, but we rarely look at data in tables. Indeed, the representation of data which is convenient for visualization is different from that which is convenient for analysis. A tidy data frame is almost always much easier to work with than non-tidy formats.
Let’s import a prepared table with measurements and have a closer look. Is this table tidy?
import pandas as pd
df = pd.read_csv('data/multi-index-measurements.csv', index_col=[0,1], parse_dates=[1], header=[0,1])
df.info()
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 6 entries, ('dev_1', Timestamp('2023-01-01 07:00:00')) to ('dev_2', Timestamp('2023-01-01 09:00:00'))
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 (before, value1) 6 non-null float64
1 (before, value2) 6 non-null float64
2 (before, value3) 6 non-null float64
3 (after, value1) 6 non-null float64
4 (after, value2) 6 non-null float64
5 (after, value3) 6 non-null float64
dtypes: float64(6)
memory usage: 588.0+ bytes
df
before | after | ||||||
---|---|---|---|---|---|---|---|
device | value1 | value2 | value3 | value1 | value2 | value3 | |
dev_1 | 2023-01-01 07:00:00 | 9.0 | 2.0 | 3.1 | 9.2 | 2.1 | 3.15 |
2023-01-01 08:00:00 | 10.1 | 2.5 | 3.0 | 10.3 | 2.6 | 3.05 | |
2023-01-01 09:00:00 | 9.8 | 3.2 | 3.2 | 10.0 | 3.3 | 3.25 | |
dev_2 | 2023-01-01 07:00:00 | 9.3 | 2.2 | 3.1 | 9.5 | 2.3 | 3.15 |
2023-01-01 08:00:00 | 10.4 | 2.7 | 3.0 | 10.6 | 2.8 | 3.05 | |
2023-01-01 09:00:00 | 10.1 | 3.4 | 3.2 | 10.3 | 3.5 | 3.25 |
A useful method for tidying a DataFrame is Pandas’ melt()
. It reformats the DataFrame from wide format to long format, incorporating the columns’ index labels into column values.
df.melt().head()
None | device | value | |
---|---|---|---|
0 | before | value1 | 9.0 |
1 | before | value1 | 10.1 |
2 | before | value1 | 9.8 |
3 | before | value1 | 9.3 |
4 | before | value1 | 10.4 |
By providing additional parameters, we can specify names for the to-be-created columns for the values. Before this, we need to flatten the multi-index for the columns and put the index back as regular columns.
# Work on a copy to keep the original intact
tidy = df.copy()
# First flatten the MultiIndex columns -> 'before_value1'
flat_cols = []
for c in tidy.columns:
if isinstance(c, tuple):
if c[0] in ("", None):
flat_cols.append(c[1])
else:
flat_cols.append(f"{c[0]}_{c[1]}")
else:
flat_cols.append(c)
tidy.columns = flat_cols
tidy.head()
before_value1 | before_value2 | before_value3 | after_value1 | after_value2 | after_value3 | ||
---|---|---|---|---|---|---|---|
dev_1 | 2023-01-01 07:00:00 | 9.0 | 2.0 | 3.1 | 9.2 | 2.1 | 3.15 |
2023-01-01 08:00:00 | 10.1 | 2.5 | 3.0 | 10.3 | 2.6 | 3.05 | |
2023-01-01 09:00:00 | 9.8 | 3.2 | 3.2 | 10.0 | 3.3 | 3.25 | |
dev_2 | 2023-01-01 07:00:00 | 9.3 | 2.2 | 3.1 | 9.5 | 2.3 | 3.15 |
2023-01-01 08:00:00 | 10.4 | 2.7 | 3.0 | 10.6 | 2.8 | 3.05 |
# Then bring the index columns "device","timestamp"" back as regular columns
tidy.index.set_names(["device","timestamp"], inplace=True)
tidy.reset_index(inplace=True)
tidy.head()
device | timestamp | before_value1 | before_value2 | before_value3 | after_value1 | after_value2 | after_value3 | |
---|---|---|---|---|---|---|---|---|
0 | dev_1 | 2023-01-01 07:00:00 | 9.0 | 2.0 | 3.1 | 9.2 | 2.1 | 3.15 |
1 | dev_1 | 2023-01-01 08:00:00 | 10.1 | 2.5 | 3.0 | 10.3 | 2.6 | 3.05 |
2 | dev_1 | 2023-01-01 09:00:00 | 9.8 | 3.2 | 3.2 | 10.0 | 3.3 | 3.25 |
3 | dev_2 | 2023-01-01 07:00:00 | 9.3 | 2.2 | 3.1 | 9.5 | 2.3 | 3.15 |
4 | dev_2 | 2023-01-01 08:00:00 | 10.4 | 2.7 | 3.0 | 10.6 | 2.8 | 3.05 |
# Then melt the DataFrame
tidy = tidy.melt(
id_vars=["device", "timestamp"],
var_name="phase_variable", # var_name only allows scalar names
value_name="value"
).assign(
phase=lambda x: x["phase_variable"].str.split("_", expand=True)[0], # split phase_variable into two new columns
variable=lambda x: x["phase_variable"].str.split("_", expand=True)[1] # split phase_variable into two new columns
).drop(
columns=["phase_variable"] # drop the now redundant column
)
# Reorder and sort columns
tidy = tidy[["device", "timestamp", "phase", "variable", "value"]].sort_values(["device", "timestamp", "phase", "variable"])
tidy.head(10)
device | timestamp | phase | variable | value | |
---|---|---|---|---|---|
18 | dev_1 | 2023-01-01 07:00:00 | after | value1 | 9.20 |
24 | dev_1 | 2023-01-01 07:00:00 | after | value2 | 2.10 |
30 | dev_1 | 2023-01-01 07:00:00 | after | value3 | 3.15 |
0 | dev_1 | 2023-01-01 07:00:00 | before | value1 | 9.00 |
6 | dev_1 | 2023-01-01 07:00:00 | before | value2 | 2.00 |
12 | dev_1 | 2023-01-01 07:00:00 | before | value3 | 3.10 |
19 | dev_1 | 2023-01-01 08:00:00 | after | value1 | 10.30 |
25 | dev_1 | 2023-01-01 08:00:00 | after | value2 | 2.60 |
31 | dev_1 | 2023-01-01 08:00:00 | after | value3 | 3.05 |
1 | dev_1 | 2023-01-01 08:00:00 | before | value1 | 10.10 |
A DataFrame formatted this way is easier to manipulate in analysis, because now we can more easily mask by columns.
# Here we select the values for "dev_1", "before" and "value3".
tidy[
(tidy["device"] == "dev_1") &
(tidy["phase"] == "before") &
(tidy["variable"] == "value3")
]
device | timestamp | phase | variable | value | |
---|---|---|---|---|---|
12 | dev_1 | 2023-01-01 07:00:00 | before | value3 | 3.1 |
13 | dev_1 | 2023-01-01 08:00:00 | before | value3 | 3.0 |
14 | dev_1 | 2023-01-01 09:00:00 | before | value3 | 3.2 |