Tidy-Data#

Hadley Wickham wrote a great article in favor of “tidy data.” Tidy data follows the rules:

  • Each variable is a column.

  • Each observation is a row.

  • Each type of observation has its separate table / DataFrame.

This is less pretty to visualize as a table, but we rarely look at data in tables. Indeed, the representation of data which is convenient for visualization is different from that which is convenient for analysis. A tidy data frame is almost always much easier to work with than non-tidy formats.

Let’s import a prepared table with measurements and have a closer look. Is this table tidy?

import pandas as pd
df = pd.read_csv('data/multi-index-measurements.csv', index_col=[0,1], parse_dates=[1], header=[0,1])
df.info()
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 6 entries, ('dev_1', Timestamp('2023-01-01 07:00:00')) to ('dev_2', Timestamp('2023-01-01 09:00:00'))
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   (before, value1)  6 non-null      float64
 1   (before, value2)  6 non-null      float64
 2   (before, value3)  6 non-null      float64
 3   (after, value1)   6 non-null      float64
 4   (after, value2)   6 non-null      float64
 5   (after, value3)   6 non-null      float64
dtypes: float64(6)
memory usage: 588.0+ bytes
df
before after
device value1 value2 value3 value1 value2 value3
dev_1 2023-01-01 07:00:00 9.0 2.0 3.1 9.2 2.1 3.15
2023-01-01 08:00:00 10.1 2.5 3.0 10.3 2.6 3.05
2023-01-01 09:00:00 9.8 3.2 3.2 10.0 3.3 3.25
dev_2 2023-01-01 07:00:00 9.3 2.2 3.1 9.5 2.3 3.15
2023-01-01 08:00:00 10.4 2.7 3.0 10.6 2.8 3.05
2023-01-01 09:00:00 10.1 3.4 3.2 10.3 3.5 3.25

A useful method for tidying a DataFrame is Pandas’ melt(). It reformats the DataFrame from wide format to long format, incorporating the columns’ index labels into column values.

df.melt().head()
None device value
0 before value1 9.0
1 before value1 10.1
2 before value1 9.8
3 before value1 9.3
4 before value1 10.4

By providing additional parameters, we can specify names for the to-be-created columns for the values. Before this, we need to flatten the multi-index for the columns and put the index back as regular columns.

# Work on a copy to keep the original intact
tidy = df.copy()
# First flatten the MultiIndex columns -> 'before_value1'
flat_cols = []
for c in tidy.columns:
    if isinstance(c, tuple):
        if c[0] in ("", None):
            flat_cols.append(c[1])
        else:
            flat_cols.append(f"{c[0]}_{c[1]}")
    else:
        flat_cols.append(c)

tidy.columns = flat_cols
tidy.head()
before_value1 before_value2 before_value3 after_value1 after_value2 after_value3
dev_1 2023-01-01 07:00:00 9.0 2.0 3.1 9.2 2.1 3.15
2023-01-01 08:00:00 10.1 2.5 3.0 10.3 2.6 3.05
2023-01-01 09:00:00 9.8 3.2 3.2 10.0 3.3 3.25
dev_2 2023-01-01 07:00:00 9.3 2.2 3.1 9.5 2.3 3.15
2023-01-01 08:00:00 10.4 2.7 3.0 10.6 2.8 3.05
# Then bring the index columns "device","timestamp"" back as regular columns
tidy.index.set_names(["device","timestamp"], inplace=True)
tidy.reset_index(inplace=True)
tidy.head()
device timestamp before_value1 before_value2 before_value3 after_value1 after_value2 after_value3
0 dev_1 2023-01-01 07:00:00 9.0 2.0 3.1 9.2 2.1 3.15
1 dev_1 2023-01-01 08:00:00 10.1 2.5 3.0 10.3 2.6 3.05
2 dev_1 2023-01-01 09:00:00 9.8 3.2 3.2 10.0 3.3 3.25
3 dev_2 2023-01-01 07:00:00 9.3 2.2 3.1 9.5 2.3 3.15
4 dev_2 2023-01-01 08:00:00 10.4 2.7 3.0 10.6 2.8 3.05
# Then melt the DataFrame
tidy = tidy.melt(
    id_vars=["device", "timestamp"],
    var_name="phase_variable", # var_name only allows scalar names
    value_name="value"
).assign(
    phase=lambda x: x["phase_variable"].str.split("_", expand=True)[0], # split phase_variable into two new columns
    variable=lambda x: x["phase_variable"].str.split("_", expand=True)[1] # split phase_variable into two new columns
).drop(
    columns=["phase_variable"] # drop the now redundant column
)
# Reorder and sort columns
tidy = tidy[["device", "timestamp", "phase", "variable", "value"]].sort_values(["device", "timestamp", "phase", "variable"])
tidy.head(10)
device timestamp phase variable value
18 dev_1 2023-01-01 07:00:00 after value1 9.20
24 dev_1 2023-01-01 07:00:00 after value2 2.10
30 dev_1 2023-01-01 07:00:00 after value3 3.15
0 dev_1 2023-01-01 07:00:00 before value1 9.00
6 dev_1 2023-01-01 07:00:00 before value2 2.00
12 dev_1 2023-01-01 07:00:00 before value3 3.10
19 dev_1 2023-01-01 08:00:00 after value1 10.30
25 dev_1 2023-01-01 08:00:00 after value2 2.60
31 dev_1 2023-01-01 08:00:00 after value3 3.05
1 dev_1 2023-01-01 08:00:00 before value1 10.10

A DataFrame formatted this way is easier to manipulate in analysis, because now we can more easily mask by columns.

# Here we select the values for "dev_1", "before" and "value3".
tidy[
    (tidy["device"] == "dev_1") &
    (tidy["phase"] == "before") &
    (tidy["variable"] == "value3")
]
device timestamp phase variable value
12 dev_1 2023-01-01 07:00:00 before value3 3.1
13 dev_1 2023-01-01 08:00:00 before value3 3.0
14 dev_1 2023-01-01 09:00:00 before value3 3.2